Week 2 — Day 3: Circuit Breakers
System Design Mastery Series
Preface
Over the past two days, we've built increasingly resilient payment systems:
Day 1: We added timeouts so slow services don't hang our system forever. Day 2: We added idempotency so retries don't cause duplicate charges.
But there's still a problem. Let me show you:
Black Friday, 10:00 AM:
Bank API starts failing (50% error rate)
Your payment service:
Request 1: Call bank → timeout after 3.5s → retry → timeout → fail
Request 2: Call bank → timeout after 3.5s → retry → timeout → fail
Request 3: Call bank → timeout after 3.5s → retry → timeout → fail
... (1000 concurrent requests)
Each request waits 3.5s before failing.
Your threads are all blocked waiting.
Users see 3.5s delays followed by errors.
You're hammering an already struggling bank with retries.
The bank gets even more overloaded.
You're doing everything "right" — timeouts, retries, idempotency — but you're still:
- Wasting resources waiting for a service that's clearly broken
- Making the broken service worse by continuing to call it
- Giving users a terrible experience (wait 3.5s just to see an error)
Circuit breakers solve this.
When a service is clearly failing, stop calling it. Fail immediately. Give the user an error in 10ms instead of 3500ms. Give the struggling service a chance to recover without your traffic piling on.
The Philosophy Behind Circuit Breakers
Think about what happens in a cascade failure:
- Service A depends on Service B
- Service B becomes slow (not dead, just slow)
- Service A keeps calling Service B, waiting for each timeout
- Service A's threads fill up waiting
- Service A becomes slow
- Services that depend on A start slowing down
- The entire system grinds to a halt
The cruel irony? Service B might recover in 30 seconds. But by then, you've created a system-wide outage that takes 30 minutes to fix.
Circuit breakers are about failing fast and protecting the system. They embody a key principle of distributed systems:
"It's better to give users a fast error than a slow error."
A user who sees "Service temporarily unavailable" in 50ms can retry, use an alternative, or come back later. A user who waits 30 seconds for a timeout has wasted their time and patience.
Circuit breakers also protect the failing service. When you keep hammering a struggling service with requests, you're preventing it from recovering. By backing off, you give it breathing room.
Part I: Foundations
Chapter 1: The Circuit Breaker Pattern
1.1 The Electrical Analogy
The pattern is named after electrical circuit breakers in your home:
Normal operation:
Power flows through circuit breaker
Everything works
Electrical fault (short circuit):
Too much current flows
Circuit breaker TRIPS (opens)
Power is cut off
House doesn't burn down
After repair:
You manually reset the breaker
Power flows again
Software circuit breakers work the same way:
Normal operation:
Requests flow through to downstream service
Everything works
Service fault (errors, timeouts):
Too many failures
Circuit breaker TRIPS (opens)
Requests fail immediately (don't call downstream)
System doesn't collapse
After recovery:
Circuit breaker tests if service is back
If healthy, traffic flows again
1.2 The Three States
┌─────────────────────────────────────────────────────────────────────────┐
│ Circuit Breaker State Machine │
│ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ │ failure_count │ │ │
│ │ CLOSED │ > threshold │ OPEN │ │
│ │ │ ─────────────────▶ │ │ │
│ │ (Normal flow) │ │ (Fail fast) │ │
│ │ │ │ │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
│ │ │ │
│ │ success timeout │ │
│ │ expires │ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────────┐ │
│ │ │ │ │
│ │ │ HALF-OPEN │ │
│ └─────────────────────────────▶│ │ │
│ success │ (Testing) │ │
│ │ │ │
│ └────────┬─────────┘ │
│ │ │
│ │ failure │
│ │ │
│ ▼ │
│ Back to OPEN │
│ │
└─────────────────────────────────────────────────────────────────────────┘
CLOSED: Normal operation. Requests pass through. Failures are counted.
When failures exceed threshold → transition to OPEN.
OPEN: Failing fast. Requests immediately rejected without calling
downstream. After timeout → transition to HALF-OPEN.
HALF-OPEN: Testing. Allow ONE request through.
If success → CLOSED
If failure → back to OPEN
Let's understand each state in depth:
CLOSED State (The Happy Path)
In the CLOSED state, the circuit breaker is invisible. Every request passes through to the downstream service. But silently, the circuit breaker is watching:
- It records every failure (timeouts, errors, exceptions)
- It maintains a sliding window of recent calls
- It calculates the failure rate or count
Think of it like a watchful guardian. Everything seems normal, but it's ready to act the moment things go wrong.
Key question: What counts as a "failure"?
- Timeouts: Yes, always
- 5xx errors: Yes, server is having problems
- 4xx errors: Usually no — that's the client's fault, not the server's
- Connection refused: Yes, server is unreachable
- Business exceptions (e.g., "insufficient funds"): No — the service is working correctly
This distinction is crucial. A payment being declined is NOT a failure of the payment service. The service correctly evaluated the request and said "no." Counting business rejections as failures would cause your circuit to open when nothing is actually wrong.
OPEN State (The Protection Mode)
When failures cross the threshold, the circuit "trips" and enters the OPEN state. This is the protection mode:
- No requests reach the downstream service — they fail immediately
- Response time drops from seconds to milliseconds — fail-fast in action
- The downstream service gets breathing room — no traffic to overwhelm it
- Your resources are freed — threads aren't blocked waiting
The OPEN state has a timer. After a configured duration (typically 30-60 seconds), the circuit transitions to HALF-OPEN to test if the service has recovered.
Why not stay OPEN forever?
Services recover. Networks heal. Bugs get fixed. If we never tested, we'd permanently cut off a service that's been healthy for hours. The timeout ensures we eventually try again.
Why not transition immediately?
If we tested immediately, we'd never give the service time to recover. The timeout is the "cooling off" period.
HALF-OPEN State (The Testing Mode)
This is the careful, probing state. The circuit breaker allows a LIMITED number of requests through (often just one) to test if the service has recovered.
- If the test succeeds: The service is back! Transition to CLOSED.
- If the test fails: Still broken. Go back to OPEN and wait again.
The HALF-OPEN state is critical for preventing oscillation (rapidly switching between OPEN and CLOSED). By requiring success before closing, we ensure the service is actually healthy.
Some implementations require multiple successes in HALF-OPEN before transitioning to CLOSED. This provides extra confidence that recovery is real, not just a lucky request.
1.3 Why This Works
Without circuit breaker:
Time Bank Status Your Behavior Result
────────────────────────────────────────────────────────────
0:00 Healthy Call bank, 100ms response Success
0:01 Failing Call bank, 3.5s timeout Fail
0:02 Failing Call bank, 3.5s timeout Fail
0:03 Failing Call bank, 3.5s timeout Fail
...
0:10 Recovering Call bank, 3.5s timeout Fail
0:11 Healthy Call bank, 100ms response Success
Total: 10 minutes of terrible experience
Each request: 3.5s of waiting before failure
With circuit breaker:
Time Bank Status Circuit State Your Behavior Result
─────────────────────────────────────────────────────────────────────────
0:00 Healthy CLOSED Call bank, 100ms Success
0:01 Failing CLOSED Call bank, 3.5s timeout Fail (count: 1)
0:02 Failing CLOSED Call bank, 3.5s timeout Fail (count: 2)
0:03 Failing CLOSED Call bank, 3.5s timeout Fail (count: 3)
0:04 Failing CLOSED Call bank, 3.5s timeout Fail (count: 4)
0:05 Failing CLOSED Call bank, 3.5s timeout Fail (count: 5)
[5 failures in 60s → threshold exceeded → OPEN]
0:06 Failing OPEN Fail immediately, 10ms Fail
0:07 Failing OPEN Fail immediately, 10ms Fail
...
0:35 Recovering OPEN [30s timeout expires → HALF-OPEN]
0:35 Recovering HALF-OPEN Test: call bank Success!
[Success in HALF-OPEN → CLOSED]
0:36 Healthy CLOSED Call bank, 100ms Success
Total: 35 seconds of degraded experience (vs 10 minutes)
Most failures: 10ms instead of 3.5s
Let's break down the math:
Without circuit breaker:
- 10 minutes of failures × 60 seconds = 600 seconds
- Each failure takes 3.5 seconds of user waiting
- If 100 users are affected, that's 35,000 seconds of wasted user time
With circuit breaker:
- 5 initial failures × 3.5s = 17.5 seconds of slow failures
- 30 seconds of fast failures (10ms each) while OPEN
- Total: ~35 seconds of degraded experience
- Fast failures mean users can retry quickly or see a helpful error message
The "fail fast" principle in action:
When the circuit is OPEN, you're not making anything worse — you're making it better:
- Users get immediate feedback instead of hanging
- Your server threads are free for other work
- The struggling downstream service isn't being hammered
- You can show meaningful error messages ("Try again in 30 seconds")
A critical insight:
Notice that the circuit breaker didn't prevent failures. Users still couldn't complete their payments during the outage. But it dramatically reduced the IMPACT of those failures:
- Faster feedback
- Better user experience
- System stability maintained
- Faster recovery once the downstream service heals
1.4 A Real-World Analogy: The Restaurant
You discover a new restaurant.
Visit 1: Great food, 30 min wait. (Success)
Visit 2: Okay food, 45 min wait. (Success)
Visit 3: Food poisoning! (Failure)
Visit 4: They messed up your order. (Failure)
Visit 5: 2 hour wait, then wrong order. (Failure)
Your mental circuit breaker trips:
"I'm not going back there."
For the next few months:
Friend: "Want to try that restaurant?"
You: "No way." (Fail fast - you don't even try)
Six months later (timeout expires):
You: "Maybe they've improved? Let me try once." (Half-open)
Visit 6 (test):
Great food, 20 min wait! (Success)
Your circuit closes:
You start going there again.
This analogy reveals something profound: humans naturally implement circuit breakers in their daily lives. We stop going to unreliable stores. We stop calling friends who never answer. We stop using apps that always crash.
The software circuit breaker pattern is just formalizing this intuitive behavior into code.
Key insights from the analogy:
-
Threshold matters: One bad meal might not trip your circuit. Three in a row probably will. The threshold should match the severity of failures.
-
Time heals: You don't blacklist the restaurant forever. After enough time (the recovery timeout), you're willing to give it another chance.
-
One test is enough: You don't need 10 good visits to trust the restaurant again. One successful test visit (HALF-OPEN) is usually enough to restore confidence.
-
Context matters: A bad experience at a fast-food joint has a lower threshold than a bad experience at a fine-dining restaurant. Similarly, critical services might have lower failure thresholds than non-critical ones.
Chapter 2: Implementing a Circuit Breaker
Now that we understand the concept, let's build one. We'll start with a basic implementation and then explore more sophisticated patterns.
2.1 The Core Components
Before diving into code, let's identify what a circuit breaker needs:
- State tracking: Know if we're CLOSED, OPEN, or HALF-OPEN
- Failure counting: Track failures within a time window
- Threshold checking: Know when to open the circuit
- Timing: Know when to transition from OPEN to HALF-OPEN
- Success tracking: Know when to close the circuit from HALF-OPEN
- Thread safety: Handle concurrent requests correctly
2.2 Basic Implementation
import time
import threading
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, Any, Optional
from collections import deque
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
"""Configuration for circuit breaker behavior."""
# How many failures before opening
failure_threshold: int = 5
# Time window for counting failures (seconds)
failure_window: int = 60
# How long to stay open before testing (seconds)
recovery_timeout: int = 30
# How many successes needed in half-open to close
success_threshold: int = 1
# What exceptions count as failures
failure_exceptions: tuple = (Exception,)
class CircuitBreaker:
"""
Circuit breaker implementation.
Wraps calls to external services and fails fast when the service
is detected as unhealthy.
"""
def __init__(self, name: str, config: CircuitBreakerConfig = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.state = CircuitState.CLOSED
self.failures = deque() # Timestamps of recent failures
self.last_failure_time: Optional[float] = None
self.half_open_successes = 0
self.lock = threading.Lock()
def call(self, func: Callable, *args, **kwargs) -> Any:
"""
Execute function through circuit breaker.
Raises CircuitOpenError if circuit is open.
"""
with self.lock:
self._check_state_transition()
if self.state == CircuitState.OPEN:
raise CircuitOpenError(
f"Circuit breaker '{self.name}' is open"
)
try:
result = func(*args, **kwargs)
self._record_success()
return result
except self.config.failure_exceptions as e:
self._record_failure()
raise
def _check_state_transition(self):
"""Check if we should transition states."""
if self.state == CircuitState.OPEN:
# Check if recovery timeout has passed
if self.last_failure_time:
elapsed = time.time() - self.last_failure_time
if elapsed >= self.config.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_successes = 0
def _record_success(self):
"""Record successful call."""
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.half_open_successes += 1
if self.half_open_successes >= self.config.success_threshold:
self.state = CircuitState.CLOSED
self.failures.clear()
def _record_failure(self):
"""Record failed call."""
with self.lock:
now = time.time()
self.last_failure_time = now
if self.state == CircuitState.HALF_OPEN:
# Single failure in half-open → back to open
self.state = CircuitState.OPEN
return
if self.state == CircuitState.CLOSED:
# Add failure to window
self.failures.append(now)
# Remove old failures outside window
cutoff = now - self.config.failure_window
while self.failures and self.failures[0] < cutoff:
self.failures.popleft()
# Check if threshold exceeded
if len(self.failures) >= self.config.failure_threshold:
self.state = CircuitState.OPEN
def get_state(self) -> CircuitState:
"""Get current circuit state."""
with self.lock:
self._check_state_transition()
return self.state
class CircuitOpenError(Exception):
"""Raised when circuit breaker is open."""
pass
Understanding the implementation:
Let's walk through the key design decisions:
-
Why use a deque for failures? A deque (double-ended queue) is perfect for sliding window implementations. We append new failures to the right and remove old ones from the left. Both operations are O(1).
-
Why use a lock? Circuit breakers are called from multiple threads simultaneously. Without the lock, we could have race conditions where two threads both see 4 failures and both increment to 5, or where the state transitions happen inconsistently.
-
Why check state transition at the start of each call? The transition from OPEN to HALF-OPEN is time-based. By checking at call time, we lazily evaluate the transition only when needed, rather than running a background timer.
-
Why re-raise the exception after recording failure? The circuit breaker shouldn't hide errors from the caller. It records the failure for its statistics, but the caller still needs to handle the actual exception.
2.3 Using the Circuit Breaker
Let's see how to integrate the circuit breaker with a real service call:
# Create circuit breaker for bank API
bank_circuit = CircuitBreaker(
name="bank_api",
config=CircuitBreakerConfig(
failure_threshold=5, # Open after 5 failures
failure_window=60, # Within 60 seconds
recovery_timeout=30, # Try again after 30 seconds
success_threshold=2, # Need 2 successes to close
)
)
def call_bank_api(user_id: str, amount: float) -> dict:
"""Call bank API through circuit breaker."""
def _make_request():
response = requests.post(
BANK_API_URL,
json={'user_id': user_id, 'amount': amount},
timeout=3.5
)
response.raise_for_status()
return response.json()
try:
return bank_circuit.call(_make_request)
except CircuitOpenError:
# Circuit is open - fail fast with fallback
raise PaymentServiceUnavailable(
"Payment processing is temporarily unavailable. "
"Please try again in a few minutes."
)
except requests.Timeout:
# Will be recorded as failure by circuit breaker
raise PaymentTimeout("Payment processing timed out")
except requests.HTTPError as e:
# Will be recorded as failure by circuit breaker
raise PaymentFailed(f"Payment failed: {e}")
2.3 Circuit Breaker as Decorator
from functools import wraps
def circuit_breaker(
name: str,
failure_threshold: int = 5,
failure_window: int = 60,
recovery_timeout: int = 30
):
"""Decorator to wrap function with circuit breaker."""
config = CircuitBreakerConfig(
failure_threshold=failure_threshold,
failure_window=failure_window,
recovery_timeout=recovery_timeout,
)
breaker = CircuitBreaker(name, config)
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
return breaker.call(func, *args, **kwargs)
wrapper.circuit_breaker = breaker # Expose for monitoring
return wrapper
return decorator
# Usage
@circuit_breaker("bank_api", failure_threshold=5, recovery_timeout=30)
def charge_bank(user_id: str, amount: float) -> dict:
response = requests.post(BANK_API_URL, ...)
return response.json()
# Calling the function - circuit breaker handles failures
try:
result = charge_bank("user_123", 99.00)
except CircuitOpenError:
# Handle gracefully
return "Payment service temporarily unavailable"
Chapter 3: Advanced Circuit Breaker Patterns
The basic circuit breaker works, but real-world systems need more sophisticated approaches. Let's explore the patterns used by production systems at scale.
3.1 Sliding Window vs Count-Based
There are three main approaches to deciding when to open a circuit:
Count-based (simple):
Open after N consecutive failures.
Problem: 5 failures over 24 hours shouldn't open circuit.
This is the simplest approach but has a major flaw: it doesn't consider TIME. Five failures in 5 seconds is very different from five failures in 24 hours.
Time-based sliding window (better):
Open after N failures within T seconds.
Failures "expire" after window passes.
This adds the concept of a sliding window. We only count failures that happened recently. A failure from an hour ago doesn't affect the current state. This is what our basic implementation uses.
Rate-based (most sophisticated):
Open when failure rate exceeds X% within window.
Better handles high-throughput services.
This is the gold standard for high-traffic systems. Instead of counting failures, we calculate the failure RATE. This makes the circuit breaker scale-independent.
Why rate-based is important:
Consider a service handling 1000 requests/second:
- Normal failure rate: 1%
- That's 10 failures per second, 600 per minute
With count-based (threshold=5):
- Circuit opens in 0.5 seconds!
- But the service is 99% healthy!
With rate-based (threshold=50%, minimum=100 calls):
- We check: 10 failures out of 1000 = 1%
- 1% < 50%
- Circuit stays closed (correct!)
The minimum_calls parameter is crucial — it prevents the circuit from opening on small sample sizes. One failure out of 2 calls is 50%, but that's not statistically significant.
class RateBasedCircuitBreaker:
"""Circuit breaker that uses failure rate, not count."""
def __init__(
self,
name: str,
failure_rate_threshold: float = 0.5, # 50% failure rate
minimum_calls: int = 10, # Need at least 10 calls
window_size: int = 60, # 60 second window
recovery_timeout: int = 30
):
self.name = name
self.failure_rate_threshold = failure_rate_threshold
self.minimum_calls = minimum_calls
self.window_size = window_size
self.recovery_timeout = recovery_timeout
self.calls = deque() # (timestamp, success: bool)
self.state = CircuitState.CLOSED
self.last_failure_time = None
self.lock = threading.Lock()
def _get_stats(self) -> tuple[int, int]:
"""Get success and failure counts in current window."""
now = time.time()
cutoff = now - self.window_size
# Remove old entries
while self.calls and self.calls[0][0] < cutoff:
self.calls.popleft()
successes = sum(1 for _, success in self.calls if success)
failures = sum(1 for _, success in self.calls if not success)
return successes, failures
def _record_call(self, success: bool):
"""Record a call result."""
with self.lock:
now = time.time()
self.calls.append((now, success))
if not success:
self.last_failure_time = now
if self.state == CircuitState.HALF_OPEN:
if success:
self.state = CircuitState.CLOSED
self.calls.clear()
else:
self.state = CircuitState.OPEN
return
if self.state == CircuitState.CLOSED:
successes, failures = self._get_stats()
total = successes + failures
if total >= self.minimum_calls:
failure_rate = failures / total
if failure_rate >= self.failure_rate_threshold:
self.state = CircuitState.OPEN
3.2 Per-Operation Circuit Breakers
Different operations might have different failure characteristics:
class MultiCircuitBreaker:
"""Manage multiple circuit breakers for different operations."""
def __init__(self):
self.breakers: dict[str, CircuitBreaker] = {}
self.lock = threading.Lock()
def get_breaker(self, operation: str, config: CircuitBreakerConfig = None) -> CircuitBreaker:
"""Get or create circuit breaker for operation."""
with self.lock:
if operation not in self.breakers:
self.breakers[operation] = CircuitBreaker(
name=operation,
config=config or CircuitBreakerConfig()
)
return self.breakers[operation]
# Usage
breakers = MultiCircuitBreaker()
# Different thresholds for different operations
charge_breaker = breakers.get_breaker("bank_charge", CircuitBreakerConfig(
failure_threshold=3, # Payments are critical - open fast
recovery_timeout=60 # But wait longer to recover
))
refund_breaker = breakers.get_breaker("bank_refund", CircuitBreakerConfig(
failure_threshold=10, # Refunds less critical
recovery_timeout=30
))
3.3 Circuit Breaker with Fallback
One of the most powerful patterns is combining circuit breakers with fallbacks. When the circuit is open, instead of just failing, we return a sensible default value.
When to use fallbacks:
| Service | Fallback Strategy |
|---|---|
| Fraud check | Return medium risk score (allow with extra verification) |
| Recommendation engine | Return popular items instead of personalized |
| User preferences | Return defaults |
| Analytics/Tracking | Skip silently |
| Price calculation | Return cached prices |
When NOT to use fallbacks:
| Service | Why No Fallback |
|---|---|
| Payment processing | Can't fake a payment |
| Authentication | Can't fake a login |
| Inventory check | Can't fake stock levels |
| Legal compliance | Can't skip compliance checks |
The key question: "Is a degraded experience better than no experience?"
For fraud checking: Yes. A 0.5 risk score is better than failing all payments. For payments: No. You can't pretend a payment succeeded.
from typing import Callable, TypeVar, Optional
T = TypeVar('T')
class CircuitBreakerWithFallback:
"""Circuit breaker that can return fallback value when open."""
def __init__(
self,
name: str,
config: CircuitBreakerConfig = None,
fallback: Optional[Callable[[], T]] = None
):
self.breaker = CircuitBreaker(name, config)
self.fallback = fallback
def call(
self,
func: Callable[[], T],
fallback: Optional[Callable[[], T]] = None
) -> T:
"""
Execute function with circuit breaker and optional fallback.
If circuit is open and fallback is provided, return fallback value
instead of raising exception.
"""
try:
return self.breaker.call(func)
except CircuitOpenError:
effective_fallback = fallback or self.fallback
if effective_fallback:
return effective_fallback()
raise
# Usage
fraud_circuit = CircuitBreakerWithFallback(
name="fraud_service",
config=CircuitBreakerConfig(failure_threshold=5),
fallback=lambda: {"risk_score": 0.5, "fallback": True} # Default medium risk
)
def check_fraud(user_id: str, amount: float) -> dict:
def _call():
return fraud_service.check(user_id, amount)
# If fraud service is down, return default risk score
# This allows payments to continue with extra verification
return fraud_circuit.call(_call)
Important: When using a fallback, always mark the response so downstream code knows it's a fallback. In the example above, we include "fallback": True in the response. This allows the calling code to adjust behavior (e.g., require additional verification for payments made with fallback fraud scores).
Chapter 4: When Circuit Breakers Cause Harm
Circuit breakers are powerful, but like any tool, they can cause problems when misused. Understanding these failure modes is crucial for production systems.
The paradox of circuit breakers:
A circuit breaker is designed to protect your system from failures. But a poorly configured circuit breaker can CAUSE failures — by cutting off healthy services or creating oscillation patterns.
Let's explore the common pitfalls:
4.1 Problem 1: Opening During Legitimate Load Spikes
This is the most common and most devastating problem with circuit breakers.
Black Friday scenario:
Normal day: 100 requests/second to bank, 1% timeout
Circuit breaker: threshold=5 failures in 60s → won't trigger
Black Friday: 10,000 requests/second to bank, 1% timeout
1% of 10,000 = 100 timeouts per second!
Circuit breaker: 5 failures in first second → OPENS
Result: You've cut off all payments during your busiest day!
The bank isn't failing more — you just have more traffic,
so you see more of the normal 1% failures.
Why this happens:
Count-based thresholds are absolute numbers. At low traffic, "5 failures in 60 seconds" is significant. At high traffic, it's noise.
Think of it this way:
- 5 failures out of 100 requests = 5% failure rate (concerning!)
- 5 failures out of 10,000 requests = 0.05% failure rate (excellent!)
The count-based circuit breaker can't tell the difference.
Solution: Use failure RATE, not failure COUNT
# Bad: Opens on 5 failures (broken at high traffic)
CircuitBreakerConfig(failure_threshold=5)
# Good: Opens on 50% failure rate with minimum sample
RateBasedCircuitBreaker(
failure_rate_threshold=0.5, # 50% failure rate
minimum_calls=100, # Need 100 calls before deciding
)
How to choose the rate threshold:
- Normal services: 50% (open when half of requests fail)
- Critical services: 30% (more sensitive)
- Best-effort services: 80% (very tolerant)
The minimum_calls prevents premature decisions. You need enough data to calculate a meaningful rate.
4.2 Problem 2: Thundering Herd on Recovery
The "thundering herd" is a classic distributed systems problem that circuit breakers can accidentally trigger.
Circuit opens → all requests fail fast
30 seconds later → circuit goes HALF-OPEN
One test request succeeds → circuit CLOSES
IMMEDIATELY: 10,000 queued requests hit the recovering service
Service crashes again → circuit opens again
This can create an oscillation:
OPEN → HALF-OPEN → CLOSED → instant overload → OPEN
Why this happens:
During the OPEN period, requests are accumulating. Users are retrying. Background jobs are queueing. The moment the circuit closes, ALL of this backed-up demand floods the recovering service.
It's like a traffic jam clearing: the moment the road opens, everyone accelerates — and causes another jam.
The oscillation pattern:
Time State Traffic to service
────────────────────────────────────────
0:00 CLOSED 1000 req/s (normal)
0:01 CLOSED 1000 req/s
0:02 OPEN 0 req/s (failing fast)
0:03 OPEN 0 req/s
0:32 HALF-OPEN 1 req/s (test request)
0:33 CLOSED 5000 req/s (backed-up demand!)
0:34 OPEN 0 req/s (service crashed again)
...repeats forever...
Solution: Gradual recovery (ramp-up)
Instead of instantly allowing all traffic, we gradually increase the percentage of requests allowed:
class GradualRecoveryCircuitBreaker:
"""Circuit breaker with gradual traffic ramp-up on recovery."""
def __init__(self, name: str, config: CircuitBreakerConfig = None):
self.breaker = CircuitBreaker(name, config)
self.recovery_percentage = 100 # % of traffic to allow
self.recovery_start_time = None
self.recovery_duration = 60 # seconds to reach 100%
def call(self, func: Callable) -> Any:
state = self.breaker.get_state()
if state == CircuitState.CLOSED and self.recovery_percentage < 100:
# We're in recovery ramp-up
if self._should_allow_request():
return self.breaker.call(func)
else:
raise CircuitOpenError("Request shed during recovery")
return self.breaker.call(func)
def _should_allow_request(self) -> bool:
"""Probabilistically allow request based on recovery percentage."""
self._update_recovery_percentage()
return random.random() * 100 < self.recovery_percentage
def _update_recovery_percentage(self):
"""Update recovery percentage based on time since recovery started."""
if self.recovery_start_time is None:
self.recovery_start_time = time.time()
elapsed = time.time() - self.recovery_start_time
self.recovery_percentage = min(100, (elapsed / self.recovery_duration) * 100)
if self.recovery_percentage >= 100:
self.recovery_start_time = None # Reset for next recovery
4.3 Problem 3: Single Point of Failure (Inconsistent State)
In distributed systems, you typically have multiple instances of your service. If each instance has its own circuit breaker, they can have different views of the world.
If your circuit breaker state is stored in memory:
- Server 1: Circuit OPEN (saw failures)
- Server 2: Circuit CLOSED (hasn't seen failures yet)
- Server 3: Circuit CLOSED (hasn't seen failures yet)
Only Server 1 is protecting itself.
Servers 2 and 3 are still hammering the broken service.
Why this happens:
Each server's circuit breaker only sees the failures IT experiences. If you have 3 servers and the downstream service starts failing, each server will independently count failures. By the time one server's circuit opens, the others might still be sending traffic.
The math:
- 100 failures/second from downstream
- 3 servers sharing load equally ≈ 33 failures/second/server
- Threshold: 5 failures
- Each server opens in ~150ms
- But they open at DIFFERENT times
For 150ms, some servers are protecting themselves while others are still sending traffic. At high scale, this matters.
Trade-offs of distributed vs local state:
| Aspect | Local State | Distributed State |
|---|---|---|
| Latency | Fastest | Adds Redis RTT |
| Consistency | Each server independent | All servers agree |
| Complexity | Simple | More complex |
| Redis dependency | None | Redis must be up |
Solution: Shared circuit breaker state
class DistributedCircuitBreaker:
"""Circuit breaker with shared state in Redis."""
def __init__(self, name: str, redis_client, config: CircuitBreakerConfig = None):
self.name = name
self.redis = redis_client
self.config = config or CircuitBreakerConfig()
def _state_key(self) -> str:
return f"circuit_breaker:{self.name}:state"
def _failures_key(self) -> str:
return f"circuit_breaker:{self.name}:failures"
def get_state(self) -> CircuitState:
"""Get circuit state from Redis."""
state = self.redis.get(self._state_key())
if state:
return CircuitState(state.decode())
return CircuitState.CLOSED
def _set_state(self, state: CircuitState, ttl: int = None):
"""Set circuit state in Redis."""
if ttl:
self.redis.setex(self._state_key(), ttl, state.value)
else:
self.redis.set(self._state_key(), state.value)
def record_failure(self):
"""Record failure in Redis."""
pipe = self.redis.pipeline()
pipe.lpush(self._failures_key(), time.time())
pipe.ltrim(self._failures_key(), 0, 99) # Keep last 100
pipe.expire(self._failures_key(), self.config.failure_window)
pipe.execute()
# Check if should open
failures = self.redis.lrange(self._failures_key(), 0, -1)
recent = [f for f in failures
if float(f) > time.time() - self.config.failure_window]
if len(recent) >= self.config.failure_threshold:
self._set_state(CircuitState.OPEN, ttl=self.config.recovery_timeout)
def record_success(self):
"""Record success, potentially closing circuit."""
if self.get_state() == CircuitState.HALF_OPEN:
self._set_state(CircuitState.CLOSED)
self.redis.delete(self._failures_key())
Hybrid approach:
Many production systems use a hybrid: local state with periodic sync. The circuit breaker checks local state first (fast), but synchronizes with Redis periodically or on state changes. This balances performance with consistency.
### 4.4 Problem 4: Hiding Real Problems
This is subtle but dangerous: circuit breakers can mask problems, making them harder to detect and fix.
Scenario: You deploy buggy code that always times out. Circuit breaker opens. Errors stop (circuit is open, not calling broken code). Alert clears! You think it's fixed, but it's not.
30 seconds later, circuit tests → fails → stays open. This repeats forever.
Your monitoring shows "service degraded" not "service broken." It might take hours before someone investigates.
**Why this is dangerous:**
1. **Error rate drops when circuit opens**: Your "error rate" alert might clear because errors are being prevented, not fixed.
2. **The circuit periodically tests and fails**: Every 30 seconds, one request fails. This looks like "occasional flakiness" not "complete outage."
3. **Users experience degraded service**: They're seeing fallbacks or errors, but your alerts don't reflect the severity.
**Real-world example:**
Timeline of a hidden outage:
10:00 AM - Bad deploy breaks authentication service 10:01 AM - Circuit breaker opens, error rate drops 10:02 AM - On-call engineer sees "circuit open" but error rate is low 10:02 AM - Engineer thinks: "Hmm, weird, but errors are low. Maybe transient." 10:03 AM - Circuit tests, fails, stays open 10:30 AM - Circuit has been open for 30 minutes 10:30 AM - Engineer notices "average latency" is great (because most calls fail fast!) 11:00 AM - Customer complains about login issues 11:15 AM - Someone finally investigates the open circuit 11:20 AM - Bad deploy found and rolled back 11:25 AM - Service recovered
Total outage: 1 hour 25 minutes Time to detection: 1 hour 15 minutes (!)
**Solution: Alert on circuit state changes**
You need alerts that fire when the circuit opens, not just when errors are high.
```python
from prometheus_client import Counter, Gauge
circuit_state_gauge = Gauge(
'circuit_breaker_state',
'Current circuit breaker state (0=closed, 1=half-open, 2=open)',
['name']
)
circuit_state_changes = Counter(
'circuit_breaker_state_changes_total',
'Circuit breaker state transitions',
['name', 'from_state', 'to_state']
)
class ObservableCircuitBreaker(CircuitBreaker):
"""Circuit breaker with metrics and alerting."""
def _transition_state(self, new_state: CircuitState):
old_state = self.state
self.state = new_state
# Update metrics
state_value = {'closed': 0, 'half_open': 1, 'open': 2}
circuit_state_gauge.labels(name=self.name).set(state_value[new_state.value])
circuit_state_changes.labels(
name=self.name,
from_state=old_state.value,
to_state=new_state.value
).inc()
# Log for alerting
if new_state == CircuitState.OPEN:
logger.error(
f"Circuit breaker '{self.name}' OPENED",
extra={'circuit': self.name, 'event': 'circuit_open'}
)
elif new_state == CircuitState.CLOSED and old_state == CircuitState.HALF_OPEN:
logger.info(
f"Circuit breaker '{self.name}' recovered",
extra={'circuit': self.name, 'event': 'circuit_recovered'}
)
Alert rules:
groups:
- name: circuit_breakers
rules:
# Alert when any circuit is open
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} is open"
description: "The circuit has been open for over 1 minute. Investigate the downstream service."
# Alert when circuit is flapping (sign of borderline failure)
- alert: CircuitBreakerFlapping
expr: increase(circuit_breaker_state_changes_total[10m]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is flapping"
description: "The circuit has changed state {{ $value }} times in 10 minutes. The downstream service may be unstable."
Key insight: An open circuit is ALWAYS worth investigating. Even if it's working as intended (protecting your system), you want to know about it.
Part II: The Design Challenge
Chapter 5: Adding Circuit Breakers to Payment Service
5.1 Our Payment System So Far
From Days 1 and 2, we have:
class PaymentService:
def process_payment(self, user_id, amount, idempotency_key):
# Day 2: Check idempotency
is_new, cached = self.idempotency.check_and_set(idempotency_key, {...})
if not is_new:
return cached
# Day 1: Timeout budget
budget = TimeoutBudget(4500)
# Step 1: Fraud check
fraud_result = self._check_fraud(budget, user_id, amount)
# Step 2: Bank charge
bank_result = self._charge_bank(budget, user_id, amount, idempotency_key)
# Step 3: Notification
self._send_notification(user_id, amount, bank_result.transaction_id)
return bank_result
5.2 Adding Circuit Breakers
from typing import Optional
import asyncio
class PaymentServiceWithCircuitBreaker:
"""
Payment service with all three reliability patterns:
- Timeouts (Day 1)
- Idempotency (Day 2)
- Circuit Breakers (Day 3)
"""
def __init__(
self,
config: PaymentConfig,
idempotency_store: IdempotencyStore,
redis_client
):
self.config = config
self.idempotency = idempotency_store
# Circuit breakers for each downstream service
self.fraud_circuit = CircuitBreakerWithFallback(
name="fraud_service",
config=CircuitBreakerConfig(
failure_threshold=5,
failure_window=60,
recovery_timeout=30,
),
# If fraud service is down, use default medium risk
fallback=lambda: {"risk_score": 0.5, "source": "fallback"}
)
self.bank_circuit = CircuitBreaker(
name="bank_api",
config=CircuitBreakerConfig(
failure_threshold=3, # More sensitive for payments
failure_window=30,
recovery_timeout=60, # Wait longer before retry
)
)
self.notification_circuit = CircuitBreakerWithFallback(
name="notification_service",
config=CircuitBreakerConfig(
failure_threshold=10, # Less sensitive, not critical
failure_window=60,
recovery_timeout=15,
),
fallback=lambda: {"status": "queued"} # Queue for later
)
async def process_payment(
self,
user_id: str,
amount: float,
idempotency_key: str
) -> PaymentResult:
"""
Process payment with full reliability stack.
"""
request_body = {'user_id': user_id, 'amount': amount}
# =====================================================================
# Layer 1: Idempotency (Day 2)
# =====================================================================
try:
is_new, cached = await self.idempotency.check_and_claim(
idempotency_key, request_body, user_id
)
if not is_new:
return PaymentResult(**cached)
except IdempotencyInProgressError:
return PaymentResult(
status=PaymentStatus.PENDING,
error_message="Payment is being processed"
)
# =====================================================================
# Layer 2: Timeout Budget (Day 1)
# =====================================================================
budget = TimeoutBudget(self.config.total_budget_ms)
try:
result = await self._process_with_circuit_breakers(
budget, user_id, amount, idempotency_key
)
await self.idempotency.complete(idempotency_key, result.__dict__)
return result
except Exception as e:
await self.idempotency.fail(idempotency_key, str(e))
raise
async def _process_with_circuit_breakers(
self,
budget: TimeoutBudget,
user_id: str,
amount: float,
idempotency_key: str
) -> PaymentResult:
"""Process payment with circuit breakers on each step."""
# =====================================================================
# Step 1: Fraud Check (with fallback)
# =====================================================================
fraud_result = await self._check_fraud_with_circuit(budget, user_id, amount)
if fraud_result.get('risk_score', 0) > 0.9:
return PaymentResult(
status=PaymentStatus.REJECTED,
error_message="Transaction flagged as high risk"
)
# If using fallback (fraud service down), require additional verification
if fraud_result.get('source') == 'fallback':
if amount > 100: # High value + no fraud check = reject
return PaymentResult(
status=PaymentStatus.REJECTED,
error_message="Unable to verify transaction. Please try a smaller amount."
)
# =====================================================================
# Step 2: Bank Charge (no fallback - critical)
# =====================================================================
try:
bank_result = await self._charge_bank_with_circuit(
budget, user_id, amount, idempotency_key
)
except CircuitOpenError:
# Bank circuit is open - provide clear message
return PaymentResult(
status=PaymentStatus.SERVICE_UNAVAILABLE,
error_message="Payment processing is temporarily unavailable."
)
if bank_result.status != PaymentStatus.SUCCESS:
return bank_result
# =====================================================================
# Step 3: Notification (with fallback)
# =====================================================================
await self._notify_with_circuit(user_id, amount, bank_result.transaction_id)
return bank_result
5.3 The Black Friday Scenario
Challenge: What if the circuit opens during Black Friday?
Black Friday, 2:00 PM:
- 10,000 payment attempts per minute
- Bank API under heavy load
- Bank latency increases from 500ms to 3000ms
- Some requests timeout
What happens with our circuit breaker?
The fix - smarter circuit breaker:
class SmartPaymentCircuitBreaker:
"""
Circuit breaker designed for high-stakes, high-traffic scenarios.
"""
def __init__(self):
self.breaker = RateBasedCircuitBreaker(
name="bank_api",
failure_rate_threshold=0.5, # Only open at 50% failure rate
minimum_calls=100, # Need 100 calls before deciding
window_size=30,
recovery_timeout=30, # Shorter recovery time
)
# Track by error type - not all errors should count
self.retryable_errors = (Timeout, ConnectionError)
self.non_retryable_errors = (PaymentDeclined, InsufficientFunds)
def call(self, func):
try:
result = func()
self.breaker.record_success()
return result
except self.retryable_errors as e:
# These indicate service problems - count them
self.breaker.record_failure()
raise
except self.non_retryable_errors as e:
# These are business errors, not service problems
# Don't count against circuit breaker
self.breaker.record_success() # Service worked, just said "no"
raise
5.4 User Experience When Circuit Opens
When the bank circuit opens, what does the user see?
async def handle_payment_request(request):
try:
result = await payment_service.process_payment(
user_id=request.user_id,
amount=request.amount,
idempotency_key=request.idempotency_key
)
return PaymentResponse(result)
except CircuitOpenError:
# Log for monitoring
logger.warning("Payment circuit open, returning friendly error")
return PaymentResponse(
status="temporarily_unavailable",
message="We're experiencing high demand. Your payment could not be processed.",
retry_after=60, # Tell client when to retry
actions=[
{
"type": "retry",
"label": "Try Again",
"delay_seconds": 60
},
{
"type": "alternative",
"label": "Pay with PayPal",
"url": "/checkout/paypal"
},
{
"type": "save",
"label": "Save Cart for Later",
"url": "/cart/save"
}
]
)
Part III: Comparing Resilience Patterns
Chapter 6: Circuit Breaker vs Retry vs Bulkhead
6.1 The Three Patterns
┌─────────────────────────────────────────────────────────────────────────┐
│ │
│ RETRY CIRCUIT BREAKER BULKHEAD │
│ │
│ "Try again" "Stop trying" "Isolate failures" │
│ │
│ Request fails Too many failures Limit resources │
│ ↓ ↓ per dependency │
│ Wait (backoff) Stop calling ↓ │
│ ↓ ↓ If one fails, │
│ Try again Fail immediately others continue │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Let's understand each pattern in depth:
Retry
What it does: When a request fails, wait a bit and try again.
The intuition: Many failures are transient. A network blip, a temporary overload, a GC pause. If you just try again, it often works.
Key components:
- Max retries: How many times to try (usually 2-3)
- Backoff: How long to wait between tries (exponential: 1s, 2s, 4s...)
- Jitter: Random delay to prevent thundering herd
The danger: Retries amplify load. If a service is overloaded and 100 clients each retry 3 times, you've turned 100 requests into 300 requests — making the problem worse.
Circuit Breaker
What it does: When a service is clearly broken, stop calling it entirely.
The intuition: If the last 10 calls all failed, the 11th will probably fail too. Why waste time and resources trying?
Key components:
- Failure detection: Counting failures or measuring failure rate
- State machine: CLOSED → OPEN → HALF-OPEN → CLOSED
- Recovery testing: Periodically checking if the service has recovered
The danger: If configured poorly, circuit breakers can open during legitimate load spikes, cutting off healthy services.
Bulkhead
What it does: Isolates resources so that one slow/failing dependency can't consume everything.
The intuition: Think of a ship with watertight compartments. If one compartment floods, the others stay dry. The ship doesn't sink.
Key components:
- Thread pools per dependency: Each downstream service gets its own pool
- Semaphores: Limit concurrent calls to each service
- Queue limits: Reject requests if queue is too long
The danger: Over-provisioning wastes resources. Under-provisioning causes false rejections.
6.2 When to Use Each
| Pattern | Use When | Don't Use When |
|---|---|---|
| Retry | Transient failures (network blip, temporary overload) | Service is clearly down |
| Circuit Breaker | Service is failing consistently | Single request fails |
| Bulkhead | Want to prevent cascade failures | All dependencies equally critical |
Decision flowchart:
Request failed
│
├─ Was it a transient error (timeout, 503)?
│ YES → RETRY with backoff
│ NO → Don't retry (4xx, validation error)
│
├─ Have many requests failed recently?
│ YES → CIRCUIT BREAKER should open
│ NO → Keep trying (might be bad luck)
│
└─ Is this dependency isolated from others?
NO → Use BULKHEAD to isolate
YES → Bulkhead already in place
6.3 Combined Strategy
In production, you typically use ALL THREE patterns together. They're not alternatives — they're layers of defense.
class ResilientClient:
"""
Client with all three patterns working together.
"""
def __init__(
self,
name: str,
max_concurrent: int = 10, # Bulkhead
max_retries: int = 3, # Retry
failure_threshold: int = 5, # Circuit breaker
):
self.name = name
self.circuit = CircuitBreaker(name, CircuitBreakerConfig(
failure_threshold=failure_threshold
))
self.semaphore = asyncio.Semaphore(max_concurrent) # Bulkhead
self.max_retries = max_retries
async def call(self, func: Callable, timeout: float = 5.0) -> Any:
"""Execute with all resilience patterns."""
# Layer 1: Circuit Breaker (cheapest check first)
if self.circuit.get_state() == CircuitState.OPEN:
raise CircuitOpenError(f"{self.name} circuit is open")
# Layer 2: Bulkhead (limit concurrent calls)
try:
acquired = await asyncio.wait_for(
self.semaphore.acquire(),
timeout=1.0
)
except asyncio.TimeoutError:
raise BulkheadFullError(f"{self.name} bulkhead is full")
try:
# Layer 3: Retry with backoff
last_exception = None
for attempt in range(self.max_retries):
try:
result = await asyncio.wait_for(func(), timeout=timeout)
self.circuit.record_success()
return result
except asyncio.TimeoutError as e:
last_exception = e
self.circuit.record_failure()
if attempt < self.max_retries - 1:
delay = (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
except Exception as e:
last_exception = e
self.circuit.record_failure()
raise
raise last_exception
finally:
self.semaphore.release()
Why this order matters:
-
Circuit breaker first: It's the cheapest check (just reading state). If the circuit is open, we fail immediately without consuming bulkhead slots.
-
Bulkhead second: Before we start waiting on the downstream service, we acquire a slot. This prevents one slow service from consuming all resources.
-
Retry last: Within the bulkhead, we retry on transient failures. Each retry updates the circuit breaker state.
Part IV: Discussion and Trade-offs
Chapter 7: The Hard Questions
7.1 "What if the circuit opens during Black Friday?"
Strong Answer:
"This is a critical concern. A naive circuit breaker could turn a partial outage into a complete outage.
1. Use failure rate, not count: At 10x traffic, 1% failures = 10x more failures but same rate. Circuit stays closed.
2. Only count retryable errors: 'Card declined' is not a service failure. Don't count business rejections.
3. Have fallback payment paths: If primary fails, try backup processor.
4. Provide clear UX: If all paths fail: 'Try again in 60 seconds', 'Save cart', 'Alternative payment'."
7.2 "Circuit breaker vs retry vs bulkhead - when each?"
Strong Answer:
"Retry: Transient failures. Service generally healthy.
Circuit breaker: Persistent failures. Stop wasting resources.
Bulkhead: Isolate failures. One slow dependency shouldn't kill everything.
They work together:
- Circuit breaker first (cheapest check)
- Bulkhead second (limit concurrent)
- Retry last (within the call)"
Chapter 8: Session Summary
What You've Learned This Week
| Day | Pattern | Problem Solved |
|---|---|---|
| Day 1 | Timeouts | Stop waiting forever for slow services |
| Day 2 | Idempotency | Safe to retry without duplicate effects |
| Day 3 | Circuit Breakers | Stop calling services that are clearly broken |
How They Work Together
User clicks "Pay"
↓
[Idempotency Check] - Day 2
↓
[Circuit Breaker Check] - Day 3
↓
[Make Request with Timeout] - Day 1
↓
[Record Result]
Part V: Interview Questions
Chapter 9: Key Questions
Question 1: "Explain the circuit breaker pattern and its states."
Answer: "Three states: CLOSED (normal), OPEN (failing fast), HALF-OPEN (testing). Opens after threshold failures, closes after successful test. Converts slow failures to fast failures."
Question 2: "Count-based vs rate-based circuit breaker?"
Answer: "Count-based opens after N failures - breaks at high traffic. Rate-based opens at X% failure rate with minimum sample - scales properly."
Question 3: "When do circuit breakers cause harm?"
Answer: "Opening during load spikes, thundering herd on recovery, hiding real problems, inconsistent state across servers. Fix with rate-based thresholds, gradual recovery, alerting, distributed state."
Exercises
- Implement rate-based circuit breaker with gradual recovery
- Add circuit breakers to Day 2 payment service
- Create chaos tests for circuit breaker behavior
End of Day 3: Circuit Breakers
Tomorrow: Day 4 — Webhook Delivery System. How do you guarantee delivery to external systems you don't control?
Appendix: Production Implementation
"""
Production-ready circuit breaker implementation.
Completes the reliability stack from Days 1-3.
"""
import time
import random
import asyncio
import logging
from enum import Enum
from dataclasses import dataclass
from typing import Callable, Any, Optional, TypeVar, Generic
from collections import deque
import threading
from prometheus_client import Counter, Gauge, Histogram
# =============================================================================
# Metrics
# =============================================================================
circuit_state_gauge = Gauge(
'circuit_breaker_state',
'Current circuit state (0=closed, 1=half_open, 2=open)',
['name']
)
circuit_calls = Counter(
'circuit_breaker_calls_total',
'Circuit breaker call results',
['name', 'result'] # success, failure, rejected
)
circuit_state_changes = Counter(
'circuit_breaker_state_changes_total',
'Circuit breaker state transitions',
['name', 'from_state', 'to_state']
)
# =============================================================================
# Core Types
# =============================================================================
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CircuitBreakerConfig:
# Rate-based thresholds
failure_rate_threshold: float = 0.5
minimum_calls: int = 10
# Time windows
sliding_window_size: int = 60 # seconds
recovery_timeout: int = 30 # seconds
# Recovery behavior
success_threshold: int = 3 # successes needed to close
gradual_recovery: bool = True
recovery_ramp_duration: int = 60 # seconds to reach 100%
# Error classification
record_exceptions: tuple = (Exception,)
ignore_exceptions: tuple = ()
T = TypeVar('T')
# =============================================================================
# Production Circuit Breaker
# =============================================================================
class ProductionCircuitBreaker:
"""
Production circuit breaker with:
- Rate-based failure detection
- Gradual recovery
- Prometheus metrics
- Distributed state option
"""
def __init__(self, name: str, config: CircuitBreakerConfig = None):
self.name = name
self.config = config or CircuitBreakerConfig()
self.logger = logging.getLogger(f'circuit_breaker.{name}')
# State
self.state = CircuitState.CLOSED
self.calls: deque = deque() # (timestamp, success: bool)
self.last_failure_time: Optional[float] = None
self.half_open_successes = 0
self.recovery_start_time: Optional[float] = None
self.lock = threading.Lock()
# Initialize metrics
circuit_state_gauge.labels(name=name).set(0)
def call(self, func: Callable[[], T]) -> T:
"""Execute function through circuit breaker."""
with self.lock:
self._maybe_transition_state()
if self.state == CircuitState.OPEN:
circuit_calls.labels(name=self.name, result='rejected').inc()
retry_after = self._get_retry_after()
raise CircuitOpenError(self.name, retry_after)
if self.state == CircuitState.CLOSED and self._in_recovery():
if not self._should_allow_during_recovery():
circuit_calls.labels(name=self.name, result='rejected').inc()
raise CircuitOpenError(self.name, retry_after=5)
try:
result = func()
self._record_success()
return result
except self.config.ignore_exceptions:
self._record_success()
raise
except self.config.record_exceptions:
self._record_failure()
raise
def _maybe_transition_state(self):
"""Check if state should transition."""
if self.state == CircuitState.OPEN:
if self.last_failure_time:
elapsed = time.time() - self.last_failure_time
if elapsed >= self.config.recovery_timeout:
self._transition_to(CircuitState.HALF_OPEN)
def _record_success(self):
"""Record successful call."""
with self.lock:
now = time.time()
self.calls.append((now, True))
self._clean_old_calls()
circuit_calls.labels(name=self.name, result='success').inc()
if self.state == CircuitState.HALF_OPEN:
self.half_open_successes += 1
if self.half_open_successes >= self.config.success_threshold:
self._transition_to(CircuitState.CLOSED)
self._start_recovery()
def _record_failure(self):
"""Record failed call."""
with self.lock:
now = time.time()
self.last_failure_time = now
self.calls.append((now, False))
self._clean_old_calls()
circuit_calls.labels(name=self.name, result='failure').inc()
if self.state == CircuitState.HALF_OPEN:
self._transition_to(CircuitState.OPEN)
return
if self.state == CircuitState.CLOSED:
if self._should_open():
self._transition_to(CircuitState.OPEN)
def _should_open(self) -> bool:
"""Check if circuit should open based on failure rate."""
total = len(self.calls)
if total < self.config.minimum_calls:
return False
failures = sum(1 for _, success in self.calls if not success)
failure_rate = failures / total
return failure_rate >= self.config.failure_rate_threshold
def _clean_old_calls(self):
"""Remove calls outside sliding window."""
cutoff = time.time() - self.config.sliding_window_size
while self.calls and self.calls[0][0] < cutoff:
self.calls.popleft()
def _transition_to(self, new_state: CircuitState):
"""Transition to new state with logging and metrics."""
old_state = self.state
self.state = new_state
state_values = {
CircuitState.CLOSED: 0,
CircuitState.HALF_OPEN: 1,
CircuitState.OPEN: 2
}
circuit_state_gauge.labels(name=self.name).set(state_values[new_state])
circuit_state_changes.labels(
name=self.name,
from_state=old_state.value,
to_state=new_state.value
).inc()
if new_state == CircuitState.OPEN:
self.logger.warning(f"Circuit '{self.name}' OPENED")
elif new_state == CircuitState.CLOSED:
self.logger.info(f"Circuit '{self.name}' closed")
elif new_state == CircuitState.HALF_OPEN:
self.logger.info(f"Circuit '{self.name}' half-open, testing...")
self.half_open_successes = 0
def _start_recovery(self):
"""Start gradual recovery period."""
if self.config.gradual_recovery:
self.recovery_start_time = time.time()
def _in_recovery(self) -> bool:
"""Check if in gradual recovery period."""
if not self.recovery_start_time:
return False
elapsed = time.time() - self.recovery_start_time
if elapsed >= self.config.recovery_ramp_duration:
self.recovery_start_time = None
return False
return True
def _should_allow_during_recovery(self) -> bool:
"""Probabilistically allow request during recovery."""
elapsed = time.time() - self.recovery_start_time
recovery_percentage = (elapsed / self.config.recovery_ramp_duration) * 100
return random.random() * 100 < recovery_percentage
def _get_retry_after(self) -> int:
"""Get seconds until retry might succeed."""
if self.last_failure_time:
elapsed = time.time() - self.last_failure_time
remaining = self.config.recovery_timeout - elapsed
return max(1, int(remaining))
return self.config.recovery_timeout
def get_state(self) -> CircuitState:
"""Get current circuit state."""
with self.lock:
self._maybe_transition_state()
return self.state
def get_stats(self) -> dict:
"""Get circuit breaker statistics."""
with self.lock:
total = len(self.calls)
failures = sum(1 for _, success in self.calls if not success)
return {
'name': self.name,
'state': self.state.value,
'total_calls': total,
'failures': failures,
'failure_rate': failures / total if total > 0 else 0,
'in_recovery': self._in_recovery(),
}
class CircuitOpenError(Exception):
"""Raised when circuit breaker is open."""
def __init__(self, name: str, retry_after: int = None):
self.name = name
self.retry_after = retry_after
super().__init__(f"Circuit breaker '{name}' is open")
# =============================================================================
# Complete Payment Service with All Patterns
# =============================================================================
class ResilientPaymentService:
"""
Payment service demonstrating all three reliability patterns:
- Day 1: Timeouts
- Day 2: Idempotency
- Day 3: Circuit Breakers
"""
def __init__(self, config, idempotency_store):
self.config = config
self.idempotency = idempotency_store
# Different circuit breakers for different services
self.fraud_circuit = ProductionCircuitBreaker(
name="fraud_service",
config=CircuitBreakerConfig(
failure_rate_threshold=0.3,
minimum_calls=20,
recovery_timeout=30,
)
)
self.bank_circuit = ProductionCircuitBreaker(
name="bank_api",
config=CircuitBreakerConfig(
failure_rate_threshold=0.5,
minimum_calls=50,
recovery_timeout=60,
success_threshold=3,
gradual_recovery=True,
)
)
self.notification_circuit = ProductionCircuitBreaker(
name="notification_service",
config=CircuitBreakerConfig(
failure_rate_threshold=0.5,
minimum_calls=10,
recovery_timeout=15,
)
)
async def process_payment(
self,
user_id: str,
amount: float,
idempotency_key: str
):
"""
Process payment with complete reliability stack.
Layer 1: Idempotency (Day 2) - prevent duplicates
Layer 2: Timeout Budget (Day 1) - don't wait forever
Layer 3: Circuit Breakers (Day 3) - fail fast on broken services
"""
# Check idempotency first
is_new, cached = await self.idempotency.check_and_claim(
idempotency_key,
{'user_id': user_id, 'amount': amount}
)
if not is_new:
return cached
# Create timeout budget
budget = TimeoutBudget(self.config.total_budget_ms)
try:
# Process with circuit breakers
result = await self._process(budget, user_id, amount, idempotency_key)
await self.idempotency.complete(idempotency_key, result)
return result
except Exception as e:
await self.idempotency.fail(idempotency_key, str(e))
raise
async def _process(self, budget, user_id, amount, idempotency_key):
"""Internal processing with all patterns."""
# Fraud check with fallback
try:
fraud_result = self.fraud_circuit.call(
lambda: self._check_fraud(budget, user_id, amount)
)
except CircuitOpenError:
# Use fallback risk score
fraud_result = {"risk_score": 0.5, "fallback": True}
if fraud_result.get('risk_score', 0) > 0.9:
return {'status': 'rejected', 'reason': 'high_risk'}
# Bank charge - no fallback, critical
try:
bank_result = self.bank_circuit.call(
lambda: self._charge_bank(budget, user_id, amount, idempotency_key)
)
except CircuitOpenError as e:
return {
'status': 'unavailable',
'reason': 'payment_service_unavailable',
'retry_after': e.retry_after
}
# Notification - with fallback
try:
self.notification_circuit.call(
lambda: self._send_notification(user_id, amount, bank_result['transaction_id'])
)
except CircuitOpenError:
# Queue for later - not critical
pass
return {
'status': 'success',
'transaction_id': bank_result['transaction_id']
}
Further Reading
- "Release It!" by Michael Nygard: The original circuit breaker pattern
- Netflix Hystrix Wiki: Detailed implementation guide (now maintenance mode)
- resilience4j Documentation: Modern Java circuit breaker library
- Microsoft Azure Architecture Center: Circuit Breaker pattern
- Martin Fowler's Blog: CircuitBreaker article
End of Day 3: Circuit Breakers
Tomorrow: Day 4 — Webhook Delivery System. We shift from synchronous request/response to asynchronous event delivery. How do you guarantee delivery to external systems you don't control? What happens when a receiver is down for hours?