The Circuit Breaker Pattern, Explained for Resilient Systems
How the circuit breaker pattern stops one slow dependency from taking down your whole service — states, thresholds, and the defaults real libraries ship with.
One dependency starts answering slowly. Your service keeps calling it, every request piling up behind a 30-second timeout. Threads block. The connection pool drains. Now requests that never touched the slow dependency are timing out too, because there are no threads left to serve them. A single degraded downstream has taken down a service that didn’t need it.
The circuit breaker pattern exists to cut that chain. It wraps a call to a remote dependency and watches the failure rate. When failures cross a threshold, it stops calling — it “trips” — and fails fast instead of waiting on timeouts. The name comes straight from the electrical panel: when current spikes, the breaker opens so the wiring behind it doesn’t melt.
The three states
A circuit breaker is a small state machine with three states, and understanding them is most of understanding the pattern.
Closed is the normal state. Calls pass through to the dependency. The breaker counts outcomes — successes, failures, and (in better implementations) slow calls that succeed but took too long. As long as the failure rate stays under the threshold, nothing changes.
Open is the tripped state. The breaker has seen too many failures and stops forwarding calls entirely. Every request returns immediately with an error or a fallback value — no network call, no timeout wait. This is the whole point: a call that fails in microseconds instead of blocking a thread for 30 seconds. The breaker stays open for a fixed wait duration.
Half-open is the probe. After the wait duration elapses, the breaker lets a small number of trial calls through. If they succeed, it assumes the dependency recovered and returns to closed. If they fail, it snaps back to open and waits again. Half-open is what lets the system heal on its own without a human restarting anything.
What the thresholds actually are
The pattern is simple. Tuning it is where teams get it wrong, usually by guessing at numbers. It helps to look at what a mature library ships as defaults. Resilience4j, the widely used JVM implementation, defaults to:
- Failure rate threshold: 50%. The breaker trips when half the calls in its window fail.
- Sliding window size: 100 calls, count-based. Failure rate is computed over the last 100 calls, not all-time.
- Minimum number of calls: 100. The breaker won’t trip until it has seen at least 100 calls, so a single early failure can’t open it on a cold start.
- Wait duration in open state: 60 seconds. How long it stays open before probing.
- Permitted calls in half-open: 10. The number of trial calls it allows before deciding recovered-or-not.
- Slow call duration threshold: 60 seconds, with a slow-call rate threshold of 100%.
Those last two matter more than people expect. A dependency that returns HTTP 200 in 25 seconds is technically succeeding while it quietly destroys your latency budget. Treating slow-but-successful calls as failures is what separates a useful breaker from one that only reacts to hard errors.
The right values are workload-specific. A high-traffic service might use a sliding window of seconds rather than a call count, because 100 calls can arrive in a blink. A low-traffic internal endpoint needs a small minimum-calls value or the breaker will never accumulate enough samples to trip at all.
Fallbacks, and the parts people skip
A tripped breaker has to return something. The fallback is what your code does when the circuit is open, and a good one is the difference between graceful degradation and a stack trace in the user’s face.
Good fallbacks are honest about being degraded: a cached value from the last successful call, an empty list with a “results may be incomplete” flag, a default recommendation set instead of personalized ones. Bad fallbacks hide the problem — silently swallowing the error and returning success makes the outage invisible until something downstream corrupts.
Three details routinely get missed:
Per-dependency breakers, not one global breaker. Each remote dependency needs its own breaker instance. A shared breaker means a flaky recommendations service can trip the circuit for your healthy payments service. Isolate them.
Breakers pair with timeouts, they don’t replace them. A breaker counts failures, but a call has to fail before it counts. Without an aggressive timeout, your first wave of calls still blocks for the full 30 seconds before the breaker has anything to react to. The timeout produces the failures; the breaker reacts to the pattern of them.
Emit metrics on state transitions. Every open/half-open/closed transition should be a logged event and a metric. A breaker that trips and recovers silently robs you of the single clearest signal that a dependency is struggling. The transition log is often the first thing that tells you which downstream broke.
If you implement this yourself, you’ll reinvent these one painful incident at a time. Reach for an established library — Resilience4j on the JVM, Polly in .NET, or the breaker built into your service mesh (Istio and Envoy both do this at the proxy layer, no code change) — before you hand-roll a state machine.
FAQ
How is a circuit breaker different from a retry?+
What is the half-open state for?+
Should every external call have a circuit breaker?+
The pattern is small, but it changes the failure mode of a distributed system from “one slow dependency takes everything down” to “one dependency degrades and the rest keeps serving.” That’s the entire reason it’s worth the bookkeeping.
Related reading
2026-06-09
What a Merkle Tree Is, and Where You've Already Seen One
A Merkle tree hashes data into a single fingerprint so you can verify any piece without downloading the whole set. Here's how it works and where it already runs in your stack.
2026-06-09
What a Write-Ahead Log Is, and Why Databases Trust It
A practical look at the write-ahead log: the durability trick behind Postgres, SQLite, and most databases, and what it means when a server loses power mid-write.
2026-06-09
Consistent Hashing, Explained Through the Problem It Actually Solves
Why hash(key) % N falls apart when you add a server, how the hash ring fixes it, and what virtual nodes do — a practical walkthrough for developers.
2026-06-09
What a CRDT Is, and How Collaborative Apps Stay in Sync
A practical explainer on conflict-free replicated data types: the merge math behind them, the main CRDT families, and how libraries like Yjs and Automerge use them.
2026-06-08
How DNS Resolution Actually Works, Step by Step
A precise walkthrough of what happens between typing a domain and getting an IP: stub resolver, recursive resolver, root, TLD, and authoritative nameservers, plus TTLs.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.