Himanshu Kukreja
0%
LearnSystem DesignWeek 2Circuit Breakers
Day 03

Week 2 — Day 3: Circuit Breakers

System Design Mastery Series


Preface

Over the past two days, we've built increasingly resilient payment systems:

Day 1: We added timeouts so slow services don't hang our system forever. Day 2: We added idempotency so retries don't cause duplicate charges.

But there's still a problem. Let me show you:

Black Friday, 10:00 AM:
  Bank API starts failing (50% error rate)
  
  Your payment service:
    Request 1: Call bank → timeout after 3.5s → retry → timeout → fail
    Request 2: Call bank → timeout after 3.5s → retry → timeout → fail
    Request 3: Call bank → timeout after 3.5s → retry → timeout → fail
    ... (1000 concurrent requests)
    
  Each request waits 3.5s before failing.
  Your threads are all blocked waiting.
  Users see 3.5s delays followed by errors.
  You're hammering an already struggling bank with retries.
  The bank gets even more overloaded.

You're doing everything "right" — timeouts, retries, idempotency — but you're still:

  1. Wasting resources waiting for a service that's clearly broken
  2. Making the broken service worse by continuing to call it
  3. Giving users a terrible experience (wait 3.5s just to see an error)

Circuit breakers solve this.

When a service is clearly failing, stop calling it. Fail immediately. Give the user an error in 10ms instead of 3500ms. Give the struggling service a chance to recover without your traffic piling on.

The Philosophy Behind Circuit Breakers

Think about what happens in a cascade failure:

  1. Service A depends on Service B
  2. Service B becomes slow (not dead, just slow)
  3. Service A keeps calling Service B, waiting for each timeout
  4. Service A's threads fill up waiting
  5. Service A becomes slow
  6. Services that depend on A start slowing down
  7. The entire system grinds to a halt

The cruel irony? Service B might recover in 30 seconds. But by then, you've created a system-wide outage that takes 30 minutes to fix.

Circuit breakers are about failing fast and protecting the system. They embody a key principle of distributed systems:

"It's better to give users a fast error than a slow error."

A user who sees "Service temporarily unavailable" in 50ms can retry, use an alternative, or come back later. A user who waits 30 seconds for a timeout has wasted their time and patience.

Circuit breakers also protect the failing service. When you keep hammering a struggling service with requests, you're preventing it from recovering. By backing off, you give it breathing room.


Part I: Foundations

Chapter 1: The Circuit Breaker Pattern

1.1 The Electrical Analogy

The pattern is named after electrical circuit breakers in your home:

Normal operation:
  Power flows through circuit breaker
  Everything works
  
Electrical fault (short circuit):
  Too much current flows
  Circuit breaker TRIPS (opens)
  Power is cut off
  House doesn't burn down
  
After repair:
  You manually reset the breaker
  Power flows again

Software circuit breakers work the same way:

Normal operation:
  Requests flow through to downstream service
  Everything works
  
Service fault (errors, timeouts):
  Too many failures
  Circuit breaker TRIPS (opens)
  Requests fail immediately (don't call downstream)
  System doesn't collapse
  
After recovery:
  Circuit breaker tests if service is back
  If healthy, traffic flows again

1.2 The Three States

┌─────────────────────────────────────────────────────────────────────────┐
│                      Circuit Breaker State Machine                       │
│                                                                          │
│                                                                          │
│     ┌──────────────────┐                    ┌──────────────────┐        │
│     │                  │   failure_count    │                  │        │
│     │     CLOSED       │   > threshold      │      OPEN        │        │
│     │                  │ ─────────────────▶ │                  │        │
│     │  (Normal flow)   │                    │  (Fail fast)     │        │
│     │                  │                    │                  │        │
│     └────────┬─────────┘                    └────────┬─────────┘        │
│              │                                       │                   │
│              │                                       │                   │
│              │ success                    timeout    │                   │
│              │                            expires    │                   │
│              │                                       │                   │
│              │                                       ▼                   │
│              │                              ┌──────────────────┐        │
│              │                              │                  │        │
│              │                              │    HALF-OPEN     │        │
│              └─────────────────────────────▶│                  │        │
│                        success              │   (Testing)      │        │
│                                             │                  │        │
│                                             └────────┬─────────┘        │
│                                                      │                   │
│                                                      │ failure           │
│                                                      │                   │
│                                                      ▼                   │
│                                             Back to OPEN                │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

CLOSED: Normal operation. Requests pass through. Failures are counted.
        When failures exceed threshold → transition to OPEN.

OPEN:   Failing fast. Requests immediately rejected without calling 
        downstream. After timeout → transition to HALF-OPEN.

HALF-OPEN: Testing. Allow ONE request through.
           If success → CLOSED
           If failure → back to OPEN

Let's understand each state in depth:

CLOSED State (The Happy Path)

In the CLOSED state, the circuit breaker is invisible. Every request passes through to the downstream service. But silently, the circuit breaker is watching:

  • It records every failure (timeouts, errors, exceptions)
  • It maintains a sliding window of recent calls
  • It calculates the failure rate or count

Think of it like a watchful guardian. Everything seems normal, but it's ready to act the moment things go wrong.

Key question: What counts as a "failure"?

  • Timeouts: Yes, always
  • 5xx errors: Yes, server is having problems
  • 4xx errors: Usually no — that's the client's fault, not the server's
  • Connection refused: Yes, server is unreachable
  • Business exceptions (e.g., "insufficient funds"): No — the service is working correctly

This distinction is crucial. A payment being declined is NOT a failure of the payment service. The service correctly evaluated the request and said "no." Counting business rejections as failures would cause your circuit to open when nothing is actually wrong.

OPEN State (The Protection Mode)

When failures cross the threshold, the circuit "trips" and enters the OPEN state. This is the protection mode:

  • No requests reach the downstream service — they fail immediately
  • Response time drops from seconds to milliseconds — fail-fast in action
  • The downstream service gets breathing room — no traffic to overwhelm it
  • Your resources are freed — threads aren't blocked waiting

The OPEN state has a timer. After a configured duration (typically 30-60 seconds), the circuit transitions to HALF-OPEN to test if the service has recovered.

Why not stay OPEN forever?

Services recover. Networks heal. Bugs get fixed. If we never tested, we'd permanently cut off a service that's been healthy for hours. The timeout ensures we eventually try again.

Why not transition immediately?

If we tested immediately, we'd never give the service time to recover. The timeout is the "cooling off" period.

HALF-OPEN State (The Testing Mode)

This is the careful, probing state. The circuit breaker allows a LIMITED number of requests through (often just one) to test if the service has recovered.

  • If the test succeeds: The service is back! Transition to CLOSED.
  • If the test fails: Still broken. Go back to OPEN and wait again.

The HALF-OPEN state is critical for preventing oscillation (rapidly switching between OPEN and CLOSED). By requiring success before closing, we ensure the service is actually healthy.

Some implementations require multiple successes in HALF-OPEN before transitioning to CLOSED. This provides extra confidence that recovery is real, not just a lucky request.

1.3 Why This Works

Without circuit breaker:

Time    Bank Status    Your Behavior              Result
────────────────────────────────────────────────────────────
0:00    Healthy        Call bank, 100ms response  Success
0:01    Failing        Call bank, 3.5s timeout    Fail
0:02    Failing        Call bank, 3.5s timeout    Fail
0:03    Failing        Call bank, 3.5s timeout    Fail
...
0:10    Recovering     Call bank, 3.5s timeout    Fail
0:11    Healthy        Call bank, 100ms response  Success

Total: 10 minutes of terrible experience
Each request: 3.5s of waiting before failure

With circuit breaker:

Time    Bank Status    Circuit State    Your Behavior              Result
─────────────────────────────────────────────────────────────────────────
0:00    Healthy        CLOSED           Call bank, 100ms           Success
0:01    Failing        CLOSED           Call bank, 3.5s timeout    Fail (count: 1)
0:02    Failing        CLOSED           Call bank, 3.5s timeout    Fail (count: 2)
0:03    Failing        CLOSED           Call bank, 3.5s timeout    Fail (count: 3)
0:04    Failing        CLOSED           Call bank, 3.5s timeout    Fail (count: 4)
0:05    Failing        CLOSED           Call bank, 3.5s timeout    Fail (count: 5)
        [5 failures in 60s → threshold exceeded → OPEN]
0:06    Failing        OPEN             Fail immediately, 10ms     Fail
0:07    Failing        OPEN             Fail immediately, 10ms     Fail
...
0:35    Recovering     OPEN             [30s timeout expires → HALF-OPEN]
0:35    Recovering     HALF-OPEN        Test: call bank            Success!
        [Success in HALF-OPEN → CLOSED]
0:36    Healthy        CLOSED           Call bank, 100ms           Success

Total: 35 seconds of degraded experience (vs 10 minutes)
Most failures: 10ms instead of 3.5s

Let's break down the math:

Without circuit breaker:

  • 10 minutes of failures × 60 seconds = 600 seconds
  • Each failure takes 3.5 seconds of user waiting
  • If 100 users are affected, that's 35,000 seconds of wasted user time

With circuit breaker:

  • 5 initial failures × 3.5s = 17.5 seconds of slow failures
  • 30 seconds of fast failures (10ms each) while OPEN
  • Total: ~35 seconds of degraded experience
  • Fast failures mean users can retry quickly or see a helpful error message

The "fail fast" principle in action:

When the circuit is OPEN, you're not making anything worse — you're making it better:

  1. Users get immediate feedback instead of hanging
  2. Your server threads are free for other work
  3. The struggling downstream service isn't being hammered
  4. You can show meaningful error messages ("Try again in 30 seconds")

A critical insight:

Notice that the circuit breaker didn't prevent failures. Users still couldn't complete their payments during the outage. But it dramatically reduced the IMPACT of those failures:

  • Faster feedback
  • Better user experience
  • System stability maintained
  • Faster recovery once the downstream service heals

1.4 A Real-World Analogy: The Restaurant

You discover a new restaurant. 

Visit 1: Great food, 30 min wait. (Success)
Visit 2: Okay food, 45 min wait. (Success)
Visit 3: Food poisoning! (Failure)
Visit 4: They messed up your order. (Failure)
Visit 5: 2 hour wait, then wrong order. (Failure)

Your mental circuit breaker trips:
  "I'm not going back there."
  
For the next few months:
  Friend: "Want to try that restaurant?"
  You: "No way." (Fail fast - you don't even try)
  
Six months later (timeout expires):
  You: "Maybe they've improved? Let me try once." (Half-open)
  
Visit 6 (test):
  Great food, 20 min wait! (Success)
  
Your circuit closes:
  You start going there again.

This analogy reveals something profound: humans naturally implement circuit breakers in their daily lives. We stop going to unreliable stores. We stop calling friends who never answer. We stop using apps that always crash.

The software circuit breaker pattern is just formalizing this intuitive behavior into code.

Key insights from the analogy:

  1. Threshold matters: One bad meal might not trip your circuit. Three in a row probably will. The threshold should match the severity of failures.

  2. Time heals: You don't blacklist the restaurant forever. After enough time (the recovery timeout), you're willing to give it another chance.

  3. One test is enough: You don't need 10 good visits to trust the restaurant again. One successful test visit (HALF-OPEN) is usually enough to restore confidence.

  4. Context matters: A bad experience at a fast-food joint has a lower threshold than a bad experience at a fine-dining restaurant. Similarly, critical services might have lower failure thresholds than non-critical ones.


Chapter 2: Implementing a Circuit Breaker

Now that we understand the concept, let's build one. We'll start with a basic implementation and then explore more sophisticated patterns.

2.1 The Core Components

Before diving into code, let's identify what a circuit breaker needs:

  1. State tracking: Know if we're CLOSED, OPEN, or HALF-OPEN
  2. Failure counting: Track failures within a time window
  3. Threshold checking: Know when to open the circuit
  4. Timing: Know when to transition from OPEN to HALF-OPEN
  5. Success tracking: Know when to close the circuit from HALF-OPEN
  6. Thread safety: Handle concurrent requests correctly

2.2 Basic Implementation

import time
import threading
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any, Optional
from collections import deque

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreakerConfig:
    """Configuration for circuit breaker behavior."""
    
    # How many failures before opening
    failure_threshold: int = 5
    
    # Time window for counting failures (seconds)
    failure_window: int = 60
    
    # How long to stay open before testing (seconds)
    recovery_timeout: int = 30
    
    # How many successes needed in half-open to close
    success_threshold: int = 1
    
    # What exceptions count as failures
    failure_exceptions: tuple = (Exception,)

class CircuitBreaker:
    """
    Circuit breaker implementation.
    
    Wraps calls to external services and fails fast when the service
    is detected as unhealthy.
    """
    
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        
        self.state = CircuitState.CLOSED
        self.failures = deque()  # Timestamps of recent failures
        self.last_failure_time: Optional[float] = None
        self.half_open_successes = 0
        
        self.lock = threading.Lock()
    
    def call(self, func: Callable, *args, **kwargs) -> Any:
        """
        Execute function through circuit breaker.
        
        Raises CircuitOpenError if circuit is open.
        """
        with self.lock:
            self._check_state_transition()
            
            if self.state == CircuitState.OPEN:
                raise CircuitOpenError(
                    f"Circuit breaker '{self.name}' is open"
                )
        
        try:
            result = func(*args, **kwargs)
            self._record_success()
            return result
        
        except self.config.failure_exceptions as e:
            self._record_failure()
            raise
    
    def _check_state_transition(self):
        """Check if we should transition states."""
        
        if self.state == CircuitState.OPEN:
            # Check if recovery timeout has passed
            if self.last_failure_time:
                elapsed = time.time() - self.last_failure_time
                if elapsed >= self.config.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_successes = 0
    
    def _record_success(self):
        """Record successful call."""
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_successes += 1
                if self.half_open_successes >= self.config.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failures.clear()
    
    def _record_failure(self):
        """Record failed call."""
        with self.lock:
            now = time.time()
            self.last_failure_time = now
            
            if self.state == CircuitState.HALF_OPEN:
                # Single failure in half-open → back to open
                self.state = CircuitState.OPEN
                return
            
            if self.state == CircuitState.CLOSED:
                # Add failure to window
                self.failures.append(now)
                
                # Remove old failures outside window
                cutoff = now - self.config.failure_window
                while self.failures and self.failures[0] < cutoff:
                    self.failures.popleft()
                
                # Check if threshold exceeded
                if len(self.failures) >= self.config.failure_threshold:
                    self.state = CircuitState.OPEN
    
    def get_state(self) -> CircuitState:
        """Get current circuit state."""
        with self.lock:
            self._check_state_transition()
            return self.state

class CircuitOpenError(Exception):
    """Raised when circuit breaker is open."""
    pass

Understanding the implementation:

Let's walk through the key design decisions:

  1. Why use a deque for failures? A deque (double-ended queue) is perfect for sliding window implementations. We append new failures to the right and remove old ones from the left. Both operations are O(1).

  2. Why use a lock? Circuit breakers are called from multiple threads simultaneously. Without the lock, we could have race conditions where two threads both see 4 failures and both increment to 5, or where the state transitions happen inconsistently.

  3. Why check state transition at the start of each call? The transition from OPEN to HALF-OPEN is time-based. By checking at call time, we lazily evaluate the transition only when needed, rather than running a background timer.

  4. Why re-raise the exception after recording failure? The circuit breaker shouldn't hide errors from the caller. It records the failure for its statistics, but the caller still needs to handle the actual exception.

2.3 Using the Circuit Breaker

Let's see how to integrate the circuit breaker with a real service call:

# Create circuit breaker for bank API
bank_circuit = CircuitBreaker(
    name="bank_api",
    config=CircuitBreakerConfig(
        failure_threshold=5,      # Open after 5 failures
        failure_window=60,        # Within 60 seconds
        recovery_timeout=30,      # Try again after 30 seconds
        success_threshold=2,      # Need 2 successes to close
    )
)

def call_bank_api(user_id: str, amount: float) -> dict:
    """Call bank API through circuit breaker."""
    
    def _make_request():
        response = requests.post(
            BANK_API_URL,
            json={'user_id': user_id, 'amount': amount},
            timeout=3.5
        )
        response.raise_for_status()
        return response.json()
    
    try:
        return bank_circuit.call(_make_request)
    
    except CircuitOpenError:
        # Circuit is open - fail fast with fallback
        raise PaymentServiceUnavailable(
            "Payment processing is temporarily unavailable. "
            "Please try again in a few minutes."
        )
    
    except requests.Timeout:
        # Will be recorded as failure by circuit breaker
        raise PaymentTimeout("Payment processing timed out")
    
    except requests.HTTPError as e:
        # Will be recorded as failure by circuit breaker
        raise PaymentFailed(f"Payment failed: {e}")

2.3 Circuit Breaker as Decorator

from functools import wraps

def circuit_breaker(
    name: str,
    failure_threshold: int = 5,
    failure_window: int = 60,
    recovery_timeout: int = 30
):
    """Decorator to wrap function with circuit breaker."""
    
    config = CircuitBreakerConfig(
        failure_threshold=failure_threshold,
        failure_window=failure_window,
        recovery_timeout=recovery_timeout,
    )
    breaker = CircuitBreaker(name, config)
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            return breaker.call(func, *args, **kwargs)
        
        wrapper.circuit_breaker = breaker  # Expose for monitoring
        return wrapper
    
    return decorator

# Usage
@circuit_breaker("bank_api", failure_threshold=5, recovery_timeout=30)
def charge_bank(user_id: str, amount: float) -> dict:
    response = requests.post(BANK_API_URL, ...)
    return response.json()

# Calling the function - circuit breaker handles failures
try:
    result = charge_bank("user_123", 99.00)
except CircuitOpenError:
    # Handle gracefully
    return "Payment service temporarily unavailable"

Chapter 3: Advanced Circuit Breaker Patterns

The basic circuit breaker works, but real-world systems need more sophisticated approaches. Let's explore the patterns used by production systems at scale.

3.1 Sliding Window vs Count-Based

There are three main approaches to deciding when to open a circuit:

Count-based (simple):

Open after N consecutive failures.
Problem: 5 failures over 24 hours shouldn't open circuit.

This is the simplest approach but has a major flaw: it doesn't consider TIME. Five failures in 5 seconds is very different from five failures in 24 hours.

Time-based sliding window (better):

Open after N failures within T seconds.
Failures "expire" after window passes.

This adds the concept of a sliding window. We only count failures that happened recently. A failure from an hour ago doesn't affect the current state. This is what our basic implementation uses.

Rate-based (most sophisticated):

Open when failure rate exceeds X% within window.
Better handles high-throughput services.

This is the gold standard for high-traffic systems. Instead of counting failures, we calculate the failure RATE. This makes the circuit breaker scale-independent.

Why rate-based is important:

Consider a service handling 1000 requests/second:

  • Normal failure rate: 1%
  • That's 10 failures per second, 600 per minute

With count-based (threshold=5):

  • Circuit opens in 0.5 seconds!
  • But the service is 99% healthy!

With rate-based (threshold=50%, minimum=100 calls):

  • We check: 10 failures out of 1000 = 1%
  • 1% < 50%
  • Circuit stays closed (correct!)

The minimum_calls parameter is crucial — it prevents the circuit from opening on small sample sizes. One failure out of 2 calls is 50%, but that's not statistically significant.

class RateBasedCircuitBreaker:
    """Circuit breaker that uses failure rate, not count."""
    
    def __init__(
        self,
        name: str,
        failure_rate_threshold: float = 0.5,  # 50% failure rate
        minimum_calls: int = 10,               # Need at least 10 calls
        window_size: int = 60,                 # 60 second window
        recovery_timeout: int = 30
    ):
        self.name = name
        self.failure_rate_threshold = failure_rate_threshold
        self.minimum_calls = minimum_calls
        self.window_size = window_size
        self.recovery_timeout = recovery_timeout
        
        self.calls = deque()  # (timestamp, success: bool)
        self.state = CircuitState.CLOSED
        self.last_failure_time = None
        self.lock = threading.Lock()
    
    def _get_stats(self) -> tuple[int, int]:
        """Get success and failure counts in current window."""
        now = time.time()
        cutoff = now - self.window_size
        
        # Remove old entries
        while self.calls and self.calls[0][0] < cutoff:
            self.calls.popleft()
        
        successes = sum(1 for _, success in self.calls if success)
        failures = sum(1 for _, success in self.calls if not success)
        
        return successes, failures
    
    def _record_call(self, success: bool):
        """Record a call result."""
        with self.lock:
            now = time.time()
            self.calls.append((now, success))
            
            if not success:
                self.last_failure_time = now
            
            if self.state == CircuitState.HALF_OPEN:
                if success:
                    self.state = CircuitState.CLOSED
                    self.calls.clear()
                else:
                    self.state = CircuitState.OPEN
                return
            
            if self.state == CircuitState.CLOSED:
                successes, failures = self._get_stats()
                total = successes + failures
                
                if total >= self.minimum_calls:
                    failure_rate = failures / total
                    if failure_rate >= self.failure_rate_threshold:
                        self.state = CircuitState.OPEN

3.2 Per-Operation Circuit Breakers

Different operations might have different failure characteristics:

class MultiCircuitBreaker:
    """Manage multiple circuit breakers for different operations."""
    
    def __init__(self):
        self.breakers: dict[str, CircuitBreaker] = {}
        self.lock = threading.Lock()
    
    def get_breaker(self, operation: str, config: CircuitBreakerConfig = None) -> CircuitBreaker:
        """Get or create circuit breaker for operation."""
        with self.lock:
            if operation not in self.breakers:
                self.breakers[operation] = CircuitBreaker(
                    name=operation,
                    config=config or CircuitBreakerConfig()
                )
            return self.breakers[operation]

# Usage
breakers = MultiCircuitBreaker()

# Different thresholds for different operations
charge_breaker = breakers.get_breaker("bank_charge", CircuitBreakerConfig(
    failure_threshold=3,  # Payments are critical - open fast
    recovery_timeout=60   # But wait longer to recover
))

refund_breaker = breakers.get_breaker("bank_refund", CircuitBreakerConfig(
    failure_threshold=10,  # Refunds less critical
    recovery_timeout=30
))

3.3 Circuit Breaker with Fallback

One of the most powerful patterns is combining circuit breakers with fallbacks. When the circuit is open, instead of just failing, we return a sensible default value.

When to use fallbacks:

Service Fallback Strategy
Fraud check Return medium risk score (allow with extra verification)
Recommendation engine Return popular items instead of personalized
User preferences Return defaults
Analytics/Tracking Skip silently
Price calculation Return cached prices

When NOT to use fallbacks:

Service Why No Fallback
Payment processing Can't fake a payment
Authentication Can't fake a login
Inventory check Can't fake stock levels
Legal compliance Can't skip compliance checks

The key question: "Is a degraded experience better than no experience?"

For fraud checking: Yes. A 0.5 risk score is better than failing all payments. For payments: No. You can't pretend a payment succeeded.

from typing import Callable, TypeVar, Optional

T = TypeVar('T')

class CircuitBreakerWithFallback:
    """Circuit breaker that can return fallback value when open."""
    
    def __init__(
        self,
        name: str,
        config: CircuitBreakerConfig = None,
        fallback: Optional[Callable[[], T]] = None
    ):
        self.breaker = CircuitBreaker(name, config)
        self.fallback = fallback
    
    def call(
        self,
        func: Callable[[], T],
        fallback: Optional[Callable[[], T]] = None
    ) -> T:
        """
        Execute function with circuit breaker and optional fallback.
        
        If circuit is open and fallback is provided, return fallback value
        instead of raising exception.
        """
        try:
            return self.breaker.call(func)
        
        except CircuitOpenError:
            effective_fallback = fallback or self.fallback
            if effective_fallback:
                return effective_fallback()
            raise

# Usage
fraud_circuit = CircuitBreakerWithFallback(
    name="fraud_service",
    config=CircuitBreakerConfig(failure_threshold=5),
    fallback=lambda: {"risk_score": 0.5, "fallback": True}  # Default medium risk
)

def check_fraud(user_id: str, amount: float) -> dict:
    def _call():
        return fraud_service.check(user_id, amount)
    
    # If fraud service is down, return default risk score
    # This allows payments to continue with extra verification
    return fraud_circuit.call(_call)

Important: When using a fallback, always mark the response so downstream code knows it's a fallback. In the example above, we include "fallback": True in the response. This allows the calling code to adjust behavior (e.g., require additional verification for payments made with fallback fraud scores).


Chapter 4: When Circuit Breakers Cause Harm

Circuit breakers are powerful, but like any tool, they can cause problems when misused. Understanding these failure modes is crucial for production systems.

The paradox of circuit breakers:

A circuit breaker is designed to protect your system from failures. But a poorly configured circuit breaker can CAUSE failures — by cutting off healthy services or creating oscillation patterns.

Let's explore the common pitfalls:

4.1 Problem 1: Opening During Legitimate Load Spikes

This is the most common and most devastating problem with circuit breakers.

Black Friday scenario:
  
Normal day: 100 requests/second to bank, 1% timeout
Circuit breaker: threshold=5 failures in 60s → won't trigger

Black Friday: 10,000 requests/second to bank, 1% timeout
  1% of 10,000 = 100 timeouts per second!
  Circuit breaker: 5 failures in first second → OPENS
  
Result: You've cut off all payments during your busiest day!

The bank isn't failing more — you just have more traffic,
so you see more of the normal 1% failures.

Why this happens:

Count-based thresholds are absolute numbers. At low traffic, "5 failures in 60 seconds" is significant. At high traffic, it's noise.

Think of it this way:

  • 5 failures out of 100 requests = 5% failure rate (concerning!)
  • 5 failures out of 10,000 requests = 0.05% failure rate (excellent!)

The count-based circuit breaker can't tell the difference.

Solution: Use failure RATE, not failure COUNT

# Bad: Opens on 5 failures (broken at high traffic)
CircuitBreakerConfig(failure_threshold=5)

# Good: Opens on 50% failure rate with minimum sample
RateBasedCircuitBreaker(
    failure_rate_threshold=0.5,  # 50% failure rate
    minimum_calls=100,            # Need 100 calls before deciding
)

How to choose the rate threshold:

  • Normal services: 50% (open when half of requests fail)
  • Critical services: 30% (more sensitive)
  • Best-effort services: 80% (very tolerant)

The minimum_calls prevents premature decisions. You need enough data to calculate a meaningful rate.

4.2 Problem 2: Thundering Herd on Recovery

The "thundering herd" is a classic distributed systems problem that circuit breakers can accidentally trigger.

Circuit opens → all requests fail fast
30 seconds later → circuit goes HALF-OPEN
One test request succeeds → circuit CLOSES
IMMEDIATELY: 10,000 queued requests hit the recovering service
Service crashes again → circuit opens again

This can create an oscillation:
  OPEN → HALF-OPEN → CLOSED → instant overload → OPEN

Why this happens:

During the OPEN period, requests are accumulating. Users are retrying. Background jobs are queueing. The moment the circuit closes, ALL of this backed-up demand floods the recovering service.

It's like a traffic jam clearing: the moment the road opens, everyone accelerates — and causes another jam.

The oscillation pattern:

Time    State       Traffic to service
────────────────────────────────────────
0:00    CLOSED      1000 req/s (normal)
0:01    CLOSED      1000 req/s
0:02    OPEN        0 req/s (failing fast)
0:03    OPEN        0 req/s
0:32    HALF-OPEN   1 req/s (test request)
0:33    CLOSED      5000 req/s (backed-up demand!)
0:34    OPEN        0 req/s (service crashed again)
...repeats forever...

Solution: Gradual recovery (ramp-up)

Instead of instantly allowing all traffic, we gradually increase the percentage of requests allowed:

class GradualRecoveryCircuitBreaker:
    """Circuit breaker with gradual traffic ramp-up on recovery."""
    
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.breaker = CircuitBreaker(name, config)
        self.recovery_percentage = 100  # % of traffic to allow
        self.recovery_start_time = None
        self.recovery_duration = 60  # seconds to reach 100%
    
    def call(self, func: Callable) -> Any:
        state = self.breaker.get_state()
        
        if state == CircuitState.CLOSED and self.recovery_percentage < 100:
            # We're in recovery ramp-up
            if self._should_allow_request():
                return self.breaker.call(func)
            else:
                raise CircuitOpenError("Request shed during recovery")
        
        return self.breaker.call(func)
    
    def _should_allow_request(self) -> bool:
        """Probabilistically allow request based on recovery percentage."""
        self._update_recovery_percentage()
        return random.random() * 100 < self.recovery_percentage
    
    def _update_recovery_percentage(self):
        """Update recovery percentage based on time since recovery started."""
        if self.recovery_start_time is None:
            self.recovery_start_time = time.time()
        
        elapsed = time.time() - self.recovery_start_time
        self.recovery_percentage = min(100, (elapsed / self.recovery_duration) * 100)
        
        if self.recovery_percentage >= 100:
            self.recovery_start_time = None  # Reset for next recovery

4.3 Problem 3: Single Point of Failure (Inconsistent State)

In distributed systems, you typically have multiple instances of your service. If each instance has its own circuit breaker, they can have different views of the world.

If your circuit breaker state is stored in memory:
  - Server 1: Circuit OPEN (saw failures)
  - Server 2: Circuit CLOSED (hasn't seen failures yet)
  - Server 3: Circuit CLOSED (hasn't seen failures yet)
  
Only Server 1 is protecting itself.
Servers 2 and 3 are still hammering the broken service.

Why this happens:

Each server's circuit breaker only sees the failures IT experiences. If you have 3 servers and the downstream service starts failing, each server will independently count failures. By the time one server's circuit opens, the others might still be sending traffic.

The math:

  • 100 failures/second from downstream
  • 3 servers sharing load equally ≈ 33 failures/second/server
  • Threshold: 5 failures
  • Each server opens in ~150ms
  • But they open at DIFFERENT times

For 150ms, some servers are protecting themselves while others are still sending traffic. At high scale, this matters.

Trade-offs of distributed vs local state:

Aspect Local State Distributed State
Latency Fastest Adds Redis RTT
Consistency Each server independent All servers agree
Complexity Simple More complex
Redis dependency None Redis must be up

Solution: Shared circuit breaker state

class DistributedCircuitBreaker:
    """Circuit breaker with shared state in Redis."""
    
    def __init__(self, name: str, redis_client, config: CircuitBreakerConfig = None):
        self.name = name
        self.redis = redis_client
        self.config = config or CircuitBreakerConfig()
    
    def _state_key(self) -> str:
        return f"circuit_breaker:{self.name}:state"
    
    def _failures_key(self) -> str:
        return f"circuit_breaker:{self.name}:failures"
    
    def get_state(self) -> CircuitState:
        """Get circuit state from Redis."""
        state = self.redis.get(self._state_key())
        if state:
            return CircuitState(state.decode())
        return CircuitState.CLOSED
    
    def _set_state(self, state: CircuitState, ttl: int = None):
        """Set circuit state in Redis."""
        if ttl:
            self.redis.setex(self._state_key(), ttl, state.value)
        else:
            self.redis.set(self._state_key(), state.value)
    
    def record_failure(self):
        """Record failure in Redis."""
        pipe = self.redis.pipeline()
        pipe.lpush(self._failures_key(), time.time())
        pipe.ltrim(self._failures_key(), 0, 99)  # Keep last 100
        pipe.expire(self._failures_key(), self.config.failure_window)
        pipe.execute()
        
        # Check if should open
        failures = self.redis.lrange(self._failures_key(), 0, -1)
        recent = [f for f in failures 
                  if float(f) > time.time() - self.config.failure_window]
        
        if len(recent) >= self.config.failure_threshold:
            self._set_state(CircuitState.OPEN, ttl=self.config.recovery_timeout)
    
    def record_success(self):
        """Record success, potentially closing circuit."""
        if self.get_state() == CircuitState.HALF_OPEN:
            self._set_state(CircuitState.CLOSED)
            self.redis.delete(self._failures_key())

Hybrid approach:

Many production systems use a hybrid: local state with periodic sync. The circuit breaker checks local state first (fast), but synchronizes with Redis periodically or on state changes. This balances performance with consistency.


### 4.4 Problem 4: Hiding Real Problems

This is subtle but dangerous: circuit breakers can mask problems, making them harder to detect and fix.

Scenario: You deploy buggy code that always times out. Circuit breaker opens. Errors stop (circuit is open, not calling broken code). Alert clears! You think it's fixed, but it's not.

30 seconds later, circuit tests → fails → stays open. This repeats forever.

Your monitoring shows "service degraded" not "service broken." It might take hours before someone investigates.


**Why this is dangerous:**

1. **Error rate drops when circuit opens**: Your "error rate" alert might clear because errors are being prevented, not fixed.

2. **The circuit periodically tests and fails**: Every 30 seconds, one request fails. This looks like "occasional flakiness" not "complete outage."

3. **Users experience degraded service**: They're seeing fallbacks or errors, but your alerts don't reflect the severity.

**Real-world example:**

Timeline of a hidden outage:

10:00 AM - Bad deploy breaks authentication service 10:01 AM - Circuit breaker opens, error rate drops 10:02 AM - On-call engineer sees "circuit open" but error rate is low 10:02 AM - Engineer thinks: "Hmm, weird, but errors are low. Maybe transient." 10:03 AM - Circuit tests, fails, stays open 10:30 AM - Circuit has been open for 30 minutes 10:30 AM - Engineer notices "average latency" is great (because most calls fail fast!) 11:00 AM - Customer complains about login issues 11:15 AM - Someone finally investigates the open circuit 11:20 AM - Bad deploy found and rolled back 11:25 AM - Service recovered

Total outage: 1 hour 25 minutes Time to detection: 1 hour 15 minutes (!)


**Solution: Alert on circuit state changes**

You need alerts that fire when the circuit opens, not just when errors are high.

```python
from prometheus_client import Counter, Gauge

circuit_state_gauge = Gauge(
    'circuit_breaker_state',
    'Current circuit breaker state (0=closed, 1=half-open, 2=open)',
    ['name']
)

circuit_state_changes = Counter(
    'circuit_breaker_state_changes_total',
    'Circuit breaker state transitions',
    ['name', 'from_state', 'to_state']
)

class ObservableCircuitBreaker(CircuitBreaker):
    """Circuit breaker with metrics and alerting."""
    
    def _transition_state(self, new_state: CircuitState):
        old_state = self.state
        self.state = new_state
        
        # Update metrics
        state_value = {'closed': 0, 'half_open': 1, 'open': 2}
        circuit_state_gauge.labels(name=self.name).set(state_value[new_state.value])
        
        circuit_state_changes.labels(
            name=self.name,
            from_state=old_state.value,
            to_state=new_state.value
        ).inc()
        
        # Log for alerting
        if new_state == CircuitState.OPEN:
            logger.error(
                f"Circuit breaker '{self.name}' OPENED",
                extra={'circuit': self.name, 'event': 'circuit_open'}
            )
        elif new_state == CircuitState.CLOSED and old_state == CircuitState.HALF_OPEN:
            logger.info(
                f"Circuit breaker '{self.name}' recovered",
                extra={'circuit': self.name, 'event': 'circuit_recovered'}
            )

Alert rules:

groups:
  - name: circuit_breakers
    rules:
      # Alert when any circuit is open
      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state == 2
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is open"
          description: "The circuit has been open for over 1 minute. Investigate the downstream service."
          
      # Alert when circuit is flapping (sign of borderline failure)
      - alert: CircuitBreakerFlapping
        expr: increase(circuit_breaker_state_changes_total[10m]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker {{ $labels.name }} is flapping"
          description: "The circuit has changed state {{ $value }} times in 10 minutes. The downstream service may be unstable."

Key insight: An open circuit is ALWAYS worth investigating. Even if it's working as intended (protecting your system), you want to know about it.


Part II: The Design Challenge

Chapter 5: Adding Circuit Breakers to Payment Service

5.1 Our Payment System So Far

From Days 1 and 2, we have:

class PaymentService:
    def process_payment(self, user_id, amount, idempotency_key):
        # Day 2: Check idempotency
        is_new, cached = self.idempotency.check_and_set(idempotency_key, {...})
        if not is_new:
            return cached
        
        # Day 1: Timeout budget
        budget = TimeoutBudget(4500)
        
        # Step 1: Fraud check
        fraud_result = self._check_fraud(budget, user_id, amount)
        
        # Step 2: Bank charge
        bank_result = self._charge_bank(budget, user_id, amount, idempotency_key)
        
        # Step 3: Notification
        self._send_notification(user_id, amount, bank_result.transaction_id)
        
        return bank_result

5.2 Adding Circuit Breakers

from typing import Optional
import asyncio

class PaymentServiceWithCircuitBreaker:
    """
    Payment service with all three reliability patterns:
    - Timeouts (Day 1)
    - Idempotency (Day 2)  
    - Circuit Breakers (Day 3)
    """
    
    def __init__(
        self,
        config: PaymentConfig,
        idempotency_store: IdempotencyStore,
        redis_client
    ):
        self.config = config
        self.idempotency = idempotency_store
        
        # Circuit breakers for each downstream service
        self.fraud_circuit = CircuitBreakerWithFallback(
            name="fraud_service",
            config=CircuitBreakerConfig(
                failure_threshold=5,
                failure_window=60,
                recovery_timeout=30,
            ),
            # If fraud service is down, use default medium risk
            fallback=lambda: {"risk_score": 0.5, "source": "fallback"}
        )
        
        self.bank_circuit = CircuitBreaker(
            name="bank_api",
            config=CircuitBreakerConfig(
                failure_threshold=3,     # More sensitive for payments
                failure_window=30,
                recovery_timeout=60,     # Wait longer before retry
            )
        )
        
        self.notification_circuit = CircuitBreakerWithFallback(
            name="notification_service",
            config=CircuitBreakerConfig(
                failure_threshold=10,    # Less sensitive, not critical
                failure_window=60,
                recovery_timeout=15,
            ),
            fallback=lambda: {"status": "queued"}  # Queue for later
        )
    
    async def process_payment(
        self,
        user_id: str,
        amount: float,
        idempotency_key: str
    ) -> PaymentResult:
        """
        Process payment with full reliability stack.
        """
        request_body = {'user_id': user_id, 'amount': amount}
        
        # =====================================================================
        # Layer 1: Idempotency (Day 2)
        # =====================================================================
        try:
            is_new, cached = await self.idempotency.check_and_claim(
                idempotency_key, request_body, user_id
            )
            if not is_new:
                return PaymentResult(**cached)
        except IdempotencyInProgressError:
            return PaymentResult(
                status=PaymentStatus.PENDING,
                error_message="Payment is being processed"
            )
        
        # =====================================================================
        # Layer 2: Timeout Budget (Day 1)
        # =====================================================================
        budget = TimeoutBudget(self.config.total_budget_ms)
        
        try:
            result = await self._process_with_circuit_breakers(
                budget, user_id, amount, idempotency_key
            )
            await self.idempotency.complete(idempotency_key, result.__dict__)
            return result
        
        except Exception as e:
            await self.idempotency.fail(idempotency_key, str(e))
            raise
    
    async def _process_with_circuit_breakers(
        self,
        budget: TimeoutBudget,
        user_id: str,
        amount: float,
        idempotency_key: str
    ) -> PaymentResult:
        """Process payment with circuit breakers on each step."""
        
        # =====================================================================
        # Step 1: Fraud Check (with fallback)
        # =====================================================================
        fraud_result = await self._check_fraud_with_circuit(budget, user_id, amount)
        
        if fraud_result.get('risk_score', 0) > 0.9:
            return PaymentResult(
                status=PaymentStatus.REJECTED,
                error_message="Transaction flagged as high risk"
            )
        
        # If using fallback (fraud service down), require additional verification
        if fraud_result.get('source') == 'fallback':
            if amount > 100:  # High value + no fraud check = reject
                return PaymentResult(
                    status=PaymentStatus.REJECTED,
                    error_message="Unable to verify transaction. Please try a smaller amount."
                )
        
        # =====================================================================
        # Step 2: Bank Charge (no fallback - critical)
        # =====================================================================
        try:
            bank_result = await self._charge_bank_with_circuit(
                budget, user_id, amount, idempotency_key
            )
        except CircuitOpenError:
            # Bank circuit is open - provide clear message
            return PaymentResult(
                status=PaymentStatus.SERVICE_UNAVAILABLE,
                error_message="Payment processing is temporarily unavailable."
            )
        
        if bank_result.status != PaymentStatus.SUCCESS:
            return bank_result
        
        # =====================================================================
        # Step 3: Notification (with fallback)
        # =====================================================================
        await self._notify_with_circuit(user_id, amount, bank_result.transaction_id)
        
        return bank_result

5.3 The Black Friday Scenario

Challenge: What if the circuit opens during Black Friday?

Black Friday, 2:00 PM:
  - 10,000 payment attempts per minute
  - Bank API under heavy load
  - Bank latency increases from 500ms to 3000ms
  - Some requests timeout
  
What happens with our circuit breaker?

The fix - smarter circuit breaker:

class SmartPaymentCircuitBreaker:
    """
    Circuit breaker designed for high-stakes, high-traffic scenarios.
    """
    
    def __init__(self):
        self.breaker = RateBasedCircuitBreaker(
            name="bank_api",
            failure_rate_threshold=0.5,  # Only open at 50% failure rate
            minimum_calls=100,            # Need 100 calls before deciding
            window_size=30,
            recovery_timeout=30,          # Shorter recovery time
        )
        
        # Track by error type - not all errors should count
        self.retryable_errors = (Timeout, ConnectionError)
        self.non_retryable_errors = (PaymentDeclined, InsufficientFunds)
    
    def call(self, func):
        try:
            result = func()
            self.breaker.record_success()
            return result
        
        except self.retryable_errors as e:
            # These indicate service problems - count them
            self.breaker.record_failure()
            raise
        
        except self.non_retryable_errors as e:
            # These are business errors, not service problems
            # Don't count against circuit breaker
            self.breaker.record_success()  # Service worked, just said "no"
            raise

5.4 User Experience When Circuit Opens

When the bank circuit opens, what does the user see?

async def handle_payment_request(request):
    try:
        result = await payment_service.process_payment(
            user_id=request.user_id,
            amount=request.amount,
            idempotency_key=request.idempotency_key
        )
        return PaymentResponse(result)
    
    except CircuitOpenError:
        # Log for monitoring
        logger.warning("Payment circuit open, returning friendly error")
        
        return PaymentResponse(
            status="temporarily_unavailable",
            message="We're experiencing high demand. Your payment could not be processed.",
            retry_after=60,  # Tell client when to retry
            actions=[
                {
                    "type": "retry",
                    "label": "Try Again",
                    "delay_seconds": 60
                },
                {
                    "type": "alternative",
                    "label": "Pay with PayPal",
                    "url": "/checkout/paypal"
                },
                {
                    "type": "save",
                    "label": "Save Cart for Later",
                    "url": "/cart/save"
                }
            ]
        )

Part III: Comparing Resilience Patterns

Chapter 6: Circuit Breaker vs Retry vs Bulkhead

6.1 The Three Patterns

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│  RETRY                    CIRCUIT BREAKER           BULKHEAD            │
│                                                                          │
│  "Try again"              "Stop trying"             "Isolate failures"   │
│                                                                          │
│  Request fails            Too many failures         Limit resources      │
│       ↓                         ↓                   per dependency       │
│  Wait (backoff)           Stop calling                  ↓                │
│       ↓                         ↓                   If one fails,        │
│  Try again                Fail immediately          others continue      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Let's understand each pattern in depth:

Retry

What it does: When a request fails, wait a bit and try again.

The intuition: Many failures are transient. A network blip, a temporary overload, a GC pause. If you just try again, it often works.

Key components:

  • Max retries: How many times to try (usually 2-3)
  • Backoff: How long to wait between tries (exponential: 1s, 2s, 4s...)
  • Jitter: Random delay to prevent thundering herd

The danger: Retries amplify load. If a service is overloaded and 100 clients each retry 3 times, you've turned 100 requests into 300 requests — making the problem worse.

Circuit Breaker

What it does: When a service is clearly broken, stop calling it entirely.

The intuition: If the last 10 calls all failed, the 11th will probably fail too. Why waste time and resources trying?

Key components:

  • Failure detection: Counting failures or measuring failure rate
  • State machine: CLOSED → OPEN → HALF-OPEN → CLOSED
  • Recovery testing: Periodically checking if the service has recovered

The danger: If configured poorly, circuit breakers can open during legitimate load spikes, cutting off healthy services.

Bulkhead

What it does: Isolates resources so that one slow/failing dependency can't consume everything.

The intuition: Think of a ship with watertight compartments. If one compartment floods, the others stay dry. The ship doesn't sink.

Key components:

  • Thread pools per dependency: Each downstream service gets its own pool
  • Semaphores: Limit concurrent calls to each service
  • Queue limits: Reject requests if queue is too long

The danger: Over-provisioning wastes resources. Under-provisioning causes false rejections.

6.2 When to Use Each

Pattern Use When Don't Use When
Retry Transient failures (network blip, temporary overload) Service is clearly down
Circuit Breaker Service is failing consistently Single request fails
Bulkhead Want to prevent cascade failures All dependencies equally critical

Decision flowchart:

Request failed
     │
     ├─ Was it a transient error (timeout, 503)?
     │      YES → RETRY with backoff
     │      NO  → Don't retry (4xx, validation error)
     │
     ├─ Have many requests failed recently?
     │      YES → CIRCUIT BREAKER should open
     │      NO  → Keep trying (might be bad luck)
     │
     └─ Is this dependency isolated from others?
            NO  → Use BULKHEAD to isolate
            YES → Bulkhead already in place

6.3 Combined Strategy

In production, you typically use ALL THREE patterns together. They're not alternatives — they're layers of defense.

class ResilientClient:
    """
    Client with all three patterns working together.
    """
    
    def __init__(
        self,
        name: str,
        max_concurrent: int = 10,      # Bulkhead
        max_retries: int = 3,          # Retry
        failure_threshold: int = 5,    # Circuit breaker
    ):
        self.name = name
        self.circuit = CircuitBreaker(name, CircuitBreakerConfig(
            failure_threshold=failure_threshold
        ))
        self.semaphore = asyncio.Semaphore(max_concurrent)  # Bulkhead
        self.max_retries = max_retries
    
    async def call(self, func: Callable, timeout: float = 5.0) -> Any:
        """Execute with all resilience patterns."""
        
        # Layer 1: Circuit Breaker (cheapest check first)
        if self.circuit.get_state() == CircuitState.OPEN:
            raise CircuitOpenError(f"{self.name} circuit is open")
        
        # Layer 2: Bulkhead (limit concurrent calls)
        try:
            acquired = await asyncio.wait_for(
                self.semaphore.acquire(),
                timeout=1.0
            )
        except asyncio.TimeoutError:
            raise BulkheadFullError(f"{self.name} bulkhead is full")
        
        try:
            # Layer 3: Retry with backoff
            last_exception = None
            
            for attempt in range(self.max_retries):
                try:
                    result = await asyncio.wait_for(func(), timeout=timeout)
                    self.circuit.record_success()
                    return result
                
                except asyncio.TimeoutError as e:
                    last_exception = e
                    self.circuit.record_failure()
                    
                    if attempt < self.max_retries - 1:
                        delay = (2 ** attempt) + random.uniform(0, 1)
                        await asyncio.sleep(delay)
                
                except Exception as e:
                    last_exception = e
                    self.circuit.record_failure()
                    raise
            
            raise last_exception
        
        finally:
            self.semaphore.release()

Why this order matters:

  1. Circuit breaker first: It's the cheapest check (just reading state). If the circuit is open, we fail immediately without consuming bulkhead slots.

  2. Bulkhead second: Before we start waiting on the downstream service, we acquire a slot. This prevents one slow service from consuming all resources.

  3. Retry last: Within the bulkhead, we retry on transient failures. Each retry updates the circuit breaker state.


Part IV: Discussion and Trade-offs

Chapter 7: The Hard Questions

7.1 "What if the circuit opens during Black Friday?"

Strong Answer:

"This is a critical concern. A naive circuit breaker could turn a partial outage into a complete outage.

1. Use failure rate, not count: At 10x traffic, 1% failures = 10x more failures but same rate. Circuit stays closed.

2. Only count retryable errors: 'Card declined' is not a service failure. Don't count business rejections.

3. Have fallback payment paths: If primary fails, try backup processor.

4. Provide clear UX: If all paths fail: 'Try again in 60 seconds', 'Save cart', 'Alternative payment'."

7.2 "Circuit breaker vs retry vs bulkhead - when each?"

Strong Answer:

"Retry: Transient failures. Service generally healthy.

Circuit breaker: Persistent failures. Stop wasting resources.

Bulkhead: Isolate failures. One slow dependency shouldn't kill everything.

They work together:

  • Circuit breaker first (cheapest check)
  • Bulkhead second (limit concurrent)
  • Retry last (within the call)"

Chapter 8: Session Summary

What You've Learned This Week

Day Pattern Problem Solved
Day 1 Timeouts Stop waiting forever for slow services
Day 2 Idempotency Safe to retry without duplicate effects
Day 3 Circuit Breakers Stop calling services that are clearly broken

How They Work Together

User clicks "Pay"
        ↓
[Idempotency Check] - Day 2
        ↓
[Circuit Breaker Check] - Day 3
        ↓
[Make Request with Timeout] - Day 1
        ↓
[Record Result]

Part V: Interview Questions

Chapter 9: Key Questions

Question 1: "Explain the circuit breaker pattern and its states."

Answer: "Three states: CLOSED (normal), OPEN (failing fast), HALF-OPEN (testing). Opens after threshold failures, closes after successful test. Converts slow failures to fast failures."

Question 2: "Count-based vs rate-based circuit breaker?"

Answer: "Count-based opens after N failures - breaks at high traffic. Rate-based opens at X% failure rate with minimum sample - scales properly."

Question 3: "When do circuit breakers cause harm?"

Answer: "Opening during load spikes, thundering herd on recovery, hiding real problems, inconsistent state across servers. Fix with rate-based thresholds, gradual recovery, alerting, distributed state."


Exercises

  1. Implement rate-based circuit breaker with gradual recovery
  2. Add circuit breakers to Day 2 payment service
  3. Create chaos tests for circuit breaker behavior

End of Day 3: Circuit Breakers

Tomorrow: Day 4 — Webhook Delivery System. How do you guarantee delivery to external systems you don't control?


Appendix: Production Implementation

"""
Production-ready circuit breaker implementation.
Completes the reliability stack from Days 1-3.
"""

import time
import random
import asyncio
import logging
from enum import Enum
from dataclasses import dataclass
from typing import Callable, Any, Optional, TypeVar, Generic
from collections import deque
import threading
from prometheus_client import Counter, Gauge, Histogram

# =============================================================================
# Metrics
# =============================================================================

circuit_state_gauge = Gauge(
    'circuit_breaker_state',
    'Current circuit state (0=closed, 1=half_open, 2=open)',
    ['name']
)

circuit_calls = Counter(
    'circuit_breaker_calls_total',
    'Circuit breaker call results',
    ['name', 'result']  # success, failure, rejected
)

circuit_state_changes = Counter(
    'circuit_breaker_state_changes_total',
    'Circuit breaker state transitions',
    ['name', 'from_state', 'to_state']
)

# =============================================================================
# Core Types
# =============================================================================

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitBreakerConfig:
    # Rate-based thresholds
    failure_rate_threshold: float = 0.5
    minimum_calls: int = 10
    
    # Time windows
    sliding_window_size: int = 60  # seconds
    recovery_timeout: int = 30     # seconds
    
    # Recovery behavior
    success_threshold: int = 3     # successes needed to close
    gradual_recovery: bool = True
    recovery_ramp_duration: int = 60  # seconds to reach 100%
    
    # Error classification
    record_exceptions: tuple = (Exception,)
    ignore_exceptions: tuple = ()

T = TypeVar('T')

# =============================================================================
# Production Circuit Breaker
# =============================================================================

class ProductionCircuitBreaker:
    """
    Production circuit breaker with:
    - Rate-based failure detection
    - Gradual recovery
    - Prometheus metrics
    - Distributed state option
    """
    
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self.logger = logging.getLogger(f'circuit_breaker.{name}')
        
        # State
        self.state = CircuitState.CLOSED
        self.calls: deque = deque()  # (timestamp, success: bool)
        self.last_failure_time: Optional[float] = None
        self.half_open_successes = 0
        self.recovery_start_time: Optional[float] = None
        
        self.lock = threading.Lock()
        
        # Initialize metrics
        circuit_state_gauge.labels(name=name).set(0)
    
    def call(self, func: Callable[[], T]) -> T:
        """Execute function through circuit breaker."""
        
        with self.lock:
            self._maybe_transition_state()
            
            if self.state == CircuitState.OPEN:
                circuit_calls.labels(name=self.name, result='rejected').inc()
                retry_after = self._get_retry_after()
                raise CircuitOpenError(self.name, retry_after)
            
            if self.state == CircuitState.CLOSED and self._in_recovery():
                if not self._should_allow_during_recovery():
                    circuit_calls.labels(name=self.name, result='rejected').inc()
                    raise CircuitOpenError(self.name, retry_after=5)
        
        try:
            result = func()
            self._record_success()
            return result
        
        except self.config.ignore_exceptions:
            self._record_success()
            raise
        
        except self.config.record_exceptions:
            self._record_failure()
            raise
    
    def _maybe_transition_state(self):
        """Check if state should transition."""
        if self.state == CircuitState.OPEN:
            if self.last_failure_time:
                elapsed = time.time() - self.last_failure_time
                if elapsed >= self.config.recovery_timeout:
                    self._transition_to(CircuitState.HALF_OPEN)
    
    def _record_success(self):
        """Record successful call."""
        with self.lock:
            now = time.time()
            self.calls.append((now, True))
            self._clean_old_calls()
            
            circuit_calls.labels(name=self.name, result='success').inc()
            
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_successes += 1
                if self.half_open_successes >= self.config.success_threshold:
                    self._transition_to(CircuitState.CLOSED)
                    self._start_recovery()
    
    def _record_failure(self):
        """Record failed call."""
        with self.lock:
            now = time.time()
            self.last_failure_time = now
            self.calls.append((now, False))
            self._clean_old_calls()
            
            circuit_calls.labels(name=self.name, result='failure').inc()
            
            if self.state == CircuitState.HALF_OPEN:
                self._transition_to(CircuitState.OPEN)
                return
            
            if self.state == CircuitState.CLOSED:
                if self._should_open():
                    self._transition_to(CircuitState.OPEN)
    
    def _should_open(self) -> bool:
        """Check if circuit should open based on failure rate."""
        total = len(self.calls)
        
        if total < self.config.minimum_calls:
            return False
        
        failures = sum(1 for _, success in self.calls if not success)
        failure_rate = failures / total
        
        return failure_rate >= self.config.failure_rate_threshold
    
    def _clean_old_calls(self):
        """Remove calls outside sliding window."""
        cutoff = time.time() - self.config.sliding_window_size
        while self.calls and self.calls[0][0] < cutoff:
            self.calls.popleft()
    
    def _transition_to(self, new_state: CircuitState):
        """Transition to new state with logging and metrics."""
        old_state = self.state
        self.state = new_state
        
        state_values = {
            CircuitState.CLOSED: 0,
            CircuitState.HALF_OPEN: 1,
            CircuitState.OPEN: 2
        }
        
        circuit_state_gauge.labels(name=self.name).set(state_values[new_state])
        circuit_state_changes.labels(
            name=self.name,
            from_state=old_state.value,
            to_state=new_state.value
        ).inc()
        
        if new_state == CircuitState.OPEN:
            self.logger.warning(f"Circuit '{self.name}' OPENED")
        elif new_state == CircuitState.CLOSED:
            self.logger.info(f"Circuit '{self.name}' closed")
        elif new_state == CircuitState.HALF_OPEN:
            self.logger.info(f"Circuit '{self.name}' half-open, testing...")
            self.half_open_successes = 0
    
    def _start_recovery(self):
        """Start gradual recovery period."""
        if self.config.gradual_recovery:
            self.recovery_start_time = time.time()
    
    def _in_recovery(self) -> bool:
        """Check if in gradual recovery period."""
        if not self.recovery_start_time:
            return False
        
        elapsed = time.time() - self.recovery_start_time
        if elapsed >= self.config.recovery_ramp_duration:
            self.recovery_start_time = None
            return False
        
        return True
    
    def _should_allow_during_recovery(self) -> bool:
        """Probabilistically allow request during recovery."""
        elapsed = time.time() - self.recovery_start_time
        recovery_percentage = (elapsed / self.config.recovery_ramp_duration) * 100
        return random.random() * 100 < recovery_percentage
    
    def _get_retry_after(self) -> int:
        """Get seconds until retry might succeed."""
        if self.last_failure_time:
            elapsed = time.time() - self.last_failure_time
            remaining = self.config.recovery_timeout - elapsed
            return max(1, int(remaining))
        return self.config.recovery_timeout
    
    def get_state(self) -> CircuitState:
        """Get current circuit state."""
        with self.lock:
            self._maybe_transition_state()
            return self.state
    
    def get_stats(self) -> dict:
        """Get circuit breaker statistics."""
        with self.lock:
            total = len(self.calls)
            failures = sum(1 for _, success in self.calls if not success)
            
            return {
                'name': self.name,
                'state': self.state.value,
                'total_calls': total,
                'failures': failures,
                'failure_rate': failures / total if total > 0 else 0,
                'in_recovery': self._in_recovery(),
            }


class CircuitOpenError(Exception):
    """Raised when circuit breaker is open."""
    def __init__(self, name: str, retry_after: int = None):
        self.name = name
        self.retry_after = retry_after
        super().__init__(f"Circuit breaker '{name}' is open")


# =============================================================================
# Complete Payment Service with All Patterns
# =============================================================================

class ResilientPaymentService:
    """
    Payment service demonstrating all three reliability patterns:
    - Day 1: Timeouts
    - Day 2: Idempotency  
    - Day 3: Circuit Breakers
    """
    
    def __init__(self, config, idempotency_store):
        self.config = config
        self.idempotency = idempotency_store
        
        # Different circuit breakers for different services
        self.fraud_circuit = ProductionCircuitBreaker(
            name="fraud_service",
            config=CircuitBreakerConfig(
                failure_rate_threshold=0.3,
                minimum_calls=20,
                recovery_timeout=30,
            )
        )
        
        self.bank_circuit = ProductionCircuitBreaker(
            name="bank_api",
            config=CircuitBreakerConfig(
                failure_rate_threshold=0.5,
                minimum_calls=50,
                recovery_timeout=60,
                success_threshold=3,
                gradual_recovery=True,
            )
        )
        
        self.notification_circuit = ProductionCircuitBreaker(
            name="notification_service",
            config=CircuitBreakerConfig(
                failure_rate_threshold=0.5,
                minimum_calls=10,
                recovery_timeout=15,
            )
        )
    
    async def process_payment(
        self,
        user_id: str,
        amount: float,
        idempotency_key: str
    ):
        """
        Process payment with complete reliability stack.
        
        Layer 1: Idempotency (Day 2) - prevent duplicates
        Layer 2: Timeout Budget (Day 1) - don't wait forever
        Layer 3: Circuit Breakers (Day 3) - fail fast on broken services
        """
        
        # Check idempotency first
        is_new, cached = await self.idempotency.check_and_claim(
            idempotency_key,
            {'user_id': user_id, 'amount': amount}
        )
        
        if not is_new:
            return cached
        
        # Create timeout budget
        budget = TimeoutBudget(self.config.total_budget_ms)
        
        try:
            # Process with circuit breakers
            result = await self._process(budget, user_id, amount, idempotency_key)
            await self.idempotency.complete(idempotency_key, result)
            return result
        
        except Exception as e:
            await self.idempotency.fail(idempotency_key, str(e))
            raise
    
    async def _process(self, budget, user_id, amount, idempotency_key):
        """Internal processing with all patterns."""
        
        # Fraud check with fallback
        try:
            fraud_result = self.fraud_circuit.call(
                lambda: self._check_fraud(budget, user_id, amount)
            )
        except CircuitOpenError:
            # Use fallback risk score
            fraud_result = {"risk_score": 0.5, "fallback": True}
        
        if fraud_result.get('risk_score', 0) > 0.9:
            return {'status': 'rejected', 'reason': 'high_risk'}
        
        # Bank charge - no fallback, critical
        try:
            bank_result = self.bank_circuit.call(
                lambda: self._charge_bank(budget, user_id, amount, idempotency_key)
            )
        except CircuitOpenError as e:
            return {
                'status': 'unavailable',
                'reason': 'payment_service_unavailable',
                'retry_after': e.retry_after
            }
        
        # Notification - with fallback
        try:
            self.notification_circuit.call(
                lambda: self._send_notification(user_id, amount, bank_result['transaction_id'])
            )
        except CircuitOpenError:
            # Queue for later - not critical
            pass
        
        return {
            'status': 'success',
            'transaction_id': bank_result['transaction_id']
        }

Further Reading

  • "Release It!" by Michael Nygard: The original circuit breaker pattern
  • Netflix Hystrix Wiki: Detailed implementation guide (now maintenance mode)
  • resilience4j Documentation: Modern Java circuit breaker library
  • Microsoft Azure Architecture Center: Circuit Breaker pattern
  • Martin Fowler's Blog: CircuitBreaker article

End of Day 3: Circuit Breakers

Tomorrow: Day 4 — Webhook Delivery System. We shift from synchronous request/response to asynchronous event delivery. How do you guarantee delivery to external systems you don't control? What happens when a receiver is down for hours?