Previous Dockerizing .NET Core Application Master-Inter-Service-Communication Next

Retry Policies and Dead Letter Queues (DLQs)

Retry Policies and Dead Letter Queues

In distributed systems, Retry Policies and Dead Letter Queues (DLQs) are complementary strategies used to handle failures gracefully. A retry policy attempts to fix transient issues automatically, while a DLQ provides a safety net for messages that cannot be processed even after those retries.

πŸ”„ Retry Policies

Retry policies define how your system should handle transient failures (like network hiccups, temporary service unavailability) by retrying the operation instead of failing immediately (operations like an API call or database write).

Example

  • Scenario: A C# service calling an external payment API.
  • Retry Policy: Retry up to 3 times with exponential backoff (e.g., wait 2s, 4s, 8s).
  • Implementation: Use libraries like Polly in .NET to configure retry logic.

When to Use

  • Transient errors (timeouts, temporary network issues).
  • External dependencies that are usually reliable but occasionally fail.
  • Operations where retrying increases success probability.

When Not to Use

  • Permanent errors (e.g., invalid input, authentication failure).
  • Idempotency not guaranteed (retrying could cause duplicate charges).
  • Real-time systems where latency is critical.

Best Practices

  • Use exponential backoff with jitter to avoid thundering herd problems.
  • Limit retries to a reasonable number.
  • Ensure operations are idempotent (safe to retry).
  • Log retries for observability.

Pitfalls

  • Infinite retry loops β†’ resource exhaustion.
  • Retrying non-idempotent operations β†’ duplicate side effects.
  • Adding too much latency if retries are excessive.

Implementing these strategies in C# typically involves Polly for retry policies

Retry Policy Example (using Polly)

For HTTP calls or database operations, use Exponential Backoff with Jitter. Jitter adds a random delay to prevent multiple clients from retrying simultaneously and overloading your server.

//csharp
using Polly;
using Polly.Retry;

// 1. Define the strategy with Exponential Backoff + Jitter
var retryPipeline = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        ShouldHandle = new PredicateBuilder().Handle<HttpRequestException>(),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,  // Best practice: prevents thundering herd
        MaxRetryAttempts = 3,
        Delay = TimeSpan.FromSeconds(1) // Base delay: 1s, 2s, 4s...
    })
    .Build();

// 2. Execute your code within the pipeline
await retryPipeline.ExecuteAsync(async token => 
{
    await httpClient.GetAsync("https://api.example.com/data", token);
});

πŸ“¦ Dead Letter Queues (DLQs)

A DLQ is a special queue where messages that cannot be processed successfully after retries are sent for later inspection.

Example

  • Scenario: A message broker (Azure Service Bus, RabbitMQ, Kafka) processing orders.
  • DLQ: If a message fails processing after 5 retries, it’s moved to DLQ for manual review or automated handling.

When to Use

  • As a safety net for unprocessable messages.
  • To isolate problematic data without blocking the main queue.
  • For auditing and debugging failed messages.

When Not to Use

  • For transient errors (use retry policies first).
  • As a substitute for fixing root causes (DLQ is for containment, not resolution).
  • If message volume is extremely high and DLQ handling isn’t scalable.

Best Practices

  • Monitor DLQs actively (alerts, dashboards).
  • Automate DLQ processing (e.g., reprocess after fix, notify teams).
  • Include metadata (error reason, timestamp) with DLQ messages.
  • Keep DLQ retention policies aligned with compliance needs.

Pitfalls

  • Ignoring DLQ β†’ silent data loss.
  • Treating DLQ as permanent storage.
  • Not differentiating between transient vs permanent errors before DLQ routing.

Dead Letter Queue (DLQ) Example (Azure Service Bus)

In messaging systems, you don't "implement" a DLQ so much as you configure it and then handle messages that land there.

A. Automatic DLQ (Configuration)

When creating a queue, you set the MaxDeliveryCount. If a message fails processing more than 10 times (default), Azure automatically moves it to the DLQ.

B. Manual Dead-Lettering (Code)

Sometimes you detect a "Poison Message" (e.g., malformed JSON) that will never succeed. You should move it to the DLQ immediately to save resources.

//csharp
using Azure.Messaging.ServiceBus;

ServiceBusProcessor processor = client.CreateProcessor("my-queue");

processor.ProcessMessageAsync += async args =>
{
    try 
    {
        // Attempt processing
        ProcessOrder(args.Message.Body);
        await args.CompleteMessageAsync(args.Message);
    }
    catch (InvalidDataException ex)
    {
        // USE CASE: Permanent error. Move to DLQ with a reason.
        await args.DeadLetterMessageAsync(args.Message, 
            deadLetterReason: "InvalidOrderFormat", 
            deadLetterErrorDescription: ex.Message);
    }
};

βš–οΈ Putting It Together

  • Retry Policies handle temporary failures automatically.
  • Dead Letter Queues handle persistent failures safely without blocking the system.

πŸ‘‰ Think of retries as your first line of defense, and DLQs as your last safety net.

Comparison Summary

Feature Retry Policy Dead Letter Queue (DLQ)
Purpose Automatic recovery from transient errors. Isolation of failed/invalid messages.
Trigger Immediate failure (e.g., timeout). Reaching max retries or permanent error.
Primary Goal Minimize manual intervention. Prevent data loss and queue blocking.
Action Re-run the same code. Store for manual analysis or later replay.

Best Practices & Pitfalls

Feature Best Practice Pitfall to Avoid
Retry Policy Use Jitter: Always randomize retry intervals to protect your backend. Infinite Retries: Never retry indefinitely; you will lock up threads and hide deeper issues.
Retry Policy Combine with Circuit Breaker: If a service is down for minutes, stop retrying and "fail fast". Retrying 4xx Errors: Do not retry Client Errors (400, 401, 404). These are permanent and will fail again.
DLQ Monitor Depth: Set alerts (e.g., in Azure Monitor) if the DLQ has more than 0 messages. Treating DLQ as Storage: The DLQ is for errors. If you don't process them, you lose data and eventually hit storage limits.
DLQ Capture Context: Always attach the Exception Message to the DLQ reason for easier debugging. Ignoring Order: Moving a message to a DLQ can break the sequence if you are using FIFO Sessions.
Back to Index
Previous Dockerizing .NET Core Application Master-Inter-Service-Communication Next
*