- Back to Home »
- The dark side of the Cloud; network connectivity issues and service unavailability - AKA Transient Faults
The dark side of the Cloud; network connectivity issues and service unavailability - AKA Transient Faults
Posted by : Michael Mortensen
Thursday, April 25, 2013
In these days where everything is moving to "the cloud", new challenges arises that most of us where spared from before; temporary condition such as network connectivity issues or simply a service unavailability. The technical term for this is called Transient Fault Handling and one you must be conversant with.
If it seems a big over the top for your solution, you could consider the static TransientFaultUtility class found in the Cuemon namespace; it is fully compatible with cloud provides such as Windows Azure. It has several overloads for invoking a fault sensitive method, and will continue until the operation is successful, the amount of retry attempts has been reached, or a failed operation is not considered related to a transient fault condition.
The minimum required parameters for invoking a transient fault protected method is an integer specifying retryAttempts, a function delegate that will determine if isTransientFault and last but not least; a function delegate/action delegate pointing to the faultSensitiveMethod.
To see this in action, have a look at Figure 1. What we do here is simply throwing an HttpException should we encounter a HTTP 502. This can easily be extended to the ones listed in the IsTransientFault callback method. Otherwise we just write some debug information. Figure 2 shows how to consume the TransientFaultExample class and is intentionally set to fail in the first run. Figure 3 shows the Debug Trace.
public class TransientFaultExample
{
public TransientFaultExample()
{
NetHttpUtility.DefaultHttpTimeout = TimeSpan.FromSeconds(15);
}
public void OpenWebsite(Uri location)
{
using (HttpWebResponse response = NetHttpUtility.HttpGet(location))
{
if (response.StatusCode == HttpStatusCode.BadGateway) { throw new HttpException(502, response.StatusDescription); }
Debug.WriteLine("Status code in response was {0} - {1}.", (int)response.StatusCode, response.StatusDescription);
Debug.WriteLine("The headers of the response was {0}.", ConvertUtility.ToDelimitedString(response.Headers.AllKeys, ", ", HeaderConverter, response.Headers) as object);
}
}
private string HeaderConverter(string header, WebHeaderCollection headers)
{
return string.Format("{0}: {1}", header, headers[header] ?? "null");
}
public bool IsTransientFault(Exception exception)
{
HttpException httpException = exception as HttpException;
if (httpException != null)
{
switch (httpException.GetHttpCode())
{
case 404:
case 408:
case 410:
case 500:
case 502:
case 503:
case 504:
return true;
default:
return false;
}
}
return (exception.Message.IndexOf("timed out", StringComparison.OrdinalIgnoreCase) >= 0);
}
}
[TestClass]
public class TransientFaultExampleTest
{
[TestMethod]
public void TransientFault()
{
TransientFaultExample transient = new TransientFaultExample();
try
{
TransientFaultUtility.ExecuteAction(5, transient.IsTransientFault, transient.OpenWebsite, new Uri("http://www.google.com:88/"));
}
catch (TransientFaultException ex)
{
Debug.WriteLine("TransientFaultException was thrown (which is good): {0}", ConvertUtility.ToString(ex, Encoding.Default, true) as object);
}
try
{
TransientFaultUtility.ExecuteAction(5, transient.IsTransientFault, transient.OpenWebsite, new Uri("http://www.google.com/"));
}
catch (TransientFaultException ex)
{
Debug.WriteLine("TransientFaultException was thrown (which is not so good - for Google at least): {0}", ConvertUtility.ToString(ex, Encoding.Default, true) as object);
Assert.Fail();
}
}
}
Debug Trace:
TransientFaultException was thrown (which is good): TransientFaultException (Cuemon)
Source:
Cuemon
Message:
The amount of retry attempts has been reached.
Data:
Key: Attempts
Value: 5
Key: RecoveryWaitTimeInSeconds
Value: 21
Key: TotalRecoveryWaitTimeInSeconds
Value: 56
InnerException [of TransientFaultException]:
TimeoutException (System)
Source:
Cuemon
Message:
The operation has timed out.
Status code in response was 200 - OK.
The headers of the response was Cache-Control: private, max-age=0, Content-Type: text/html; charset=ISO-8859-1, Date: Thu, 25 Apr 2013 01:09:52 GMT, Expires: -1, Set-Cookie: PREF=ID=fd0ca0865d752f5b:FF=0:TM=1366852192:LM=1366852192:S=KNM_xhwUAUmLa1-f; expires=Sat, 25-Apr-2015 01:09:52 GMT; path=/; domain=.google.dk,NID=67=vYXlkUfWQ_paZ7fdrkXaq2gmgUati-Y3FfiPzpRLQTTWn7lQgWowKZgE53z4_D1G04SmEk0N_4YdaUKC2RZkajhrCZ69QrCRHKBumdejJg4Z2MKak7fUF0QUbL7nKf3F; expires=Fri, 25-Oct-2013 01:09:52 GMT; path=/; domain=.google.dk; HttpOnly, P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.", Server: gws, X-XSS-Protection: 1; mode=block, X-Frame-Options: SAMEORIGIN, Transfer-Encoding: chunked.
Figure 3: The output of the test in Figure 2TransientFaultException was thrown (which is good): TransientFaultException (Cuemon)
Source:
Cuemon
Message:
The amount of retry attempts has been reached.
Data:
Key: Attempts
Value: 5
Key: RecoveryWaitTimeInSeconds
Value: 21
Key: TotalRecoveryWaitTimeInSeconds
Value: 56
InnerException [of TransientFaultException]:
TimeoutException (System)
Source:
Cuemon
Message:
The operation has timed out.
Status code in response was 200 - OK.
The headers of the response was Cache-Control: private, max-age=0, Content-Type: text/html; charset=ISO-8859-1, Date: Thu, 25 Apr 2013 01:09:52 GMT, Expires: -1, Set-Cookie: PREF=ID=fd0ca0865d752f5b:FF=0:TM=1366852192:LM=1366852192:S=KNM_xhwUAUmLa1-f; expires=Sat, 25-Apr-2015 01:09:52 GMT; path=/; domain=.google.dk,NID=67=vYXlkUfWQ_paZ7fdrkXaq2gmgUati-Y3FfiPzpRLQTTWn7lQgWowKZgE53z4_D1G04SmEk0N_4YdaUKC2RZkajhrCZ69QrCRHKBumdejJg4Z2MKak7fUF0QUbL7nKf3F; expires=Fri, 25-Oct-2013 01:09:52 GMT; path=/; domain=.google.dk; HttpOnly, P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.", Server: gws, X-XSS-Protection: 1; mode=block, X-Frame-Options: SAMEORIGIN, Transfer-Encoding: chunked.
If you don't like the default recovery wait time this can easily be added using one of the overloads on the TransientFaultUtility class. For your convenience, I have included the default implementation in Figure 4.
/// <summary>
/// Specifies the amount of time to wait for a transient fault to recover gracefully before trying a new attempt.
/// </summary>
/// <param name="currentAttempt">The current attempt.</param>
/// <returns>A <see cref="TimeSpan"/> that defines the amount of time to wait for a transient fault to recover gracefully.</returns>
/// <remarks>Default implementation is <see cref="RecoveryWaitTime"/> + 2^ to a maximum of 5; a total of 5 (default) + 32 = 37 seconds.</remarks>
public static TimeSpan RecoveryWaitTime(int currentAttempt)
{
TimeSpan sleep = DefaultRecoveryWaitTime;
sleep = sleep.Add(TimeSpan.FromSeconds(Math.Pow(2, currentAttempt > 5 ? 5 : currentAttempt)));
return sleep;
}
I hope this little introduction have inspired you to master Transient Fault Handling one way or another. Happy coding.
