Circuit breaker policy
AIOpen as plain markdown for AIThe circuit breaker cordons off an unhealthy upstream so failures stop wasting latency on a bad endpoint. Three states: closed (normal traffic), open (instant rejection), half-open (limited probe traffic to test recovery). Only valid at the upstream level — at the network level, the selection policy already routes around open breakers.
Full configuration
projects: - id: main upstreams: - id: my-upstream endpoint: https://rpc.example.com failsafe: - matchMethod: "*" # applies to all methods; narrow with "!eth_call" to exclude noisy ones # matchFinality: [finalized] # optional: scope to specific finality states circuitBreaker: failureThresholdCount: 160 # trip after 160 failures... failureThresholdCapacity: 200 # ...out of the most recent 200 requests (80% failure rate) halfOpenAfter: 5m # stay open for 5 min before allowing probes successThresholdCount: 3 # close again after 3 consecutive successes... successThresholdCapacity: 10 # ...within a 10-sample probe windowUse matchMethod and matchFinality to scope a breaker to a specific method group. An upstream can have multiple failsafe[] entries — each creates an independent breaker with its own state and counters.
How it works
State machine
- Closed (normal) — every request goes through. Failures and successes both count toward the rolling sample window. When
failureThresholdCountfailures land within the most recentfailureThresholdCapacitysamples, the breaker transitions to open. - Open — every request is rejected immediately with a breaker-open error. The network's selection policy routes to the next-best upstream. The breaker stays open for
halfOpenAfterduration. - Half-open — after
halfOpenAfterelapses, the next request is allowed through as a probe. Probes continue (concurrently, capped atsuccessThresholdCapacity) untilsuccessThresholdCountsucceed — then transition back to closed. Any failure during half-open transitions back to open and restarts thehalfOpenAftertimer.
Rolling windows
Both failureThreshold* and successThreshold* use sample counters, not time windows. A 160/200 failure threshold means "trip when 160 of the most recent 200 outcomes were failures" — regardless of how long that took. For sparse traffic the window naturally extends over longer wall-clock time.
What counts as a failure
The breaker uses the same retryable-error classification as retry: HTTP 5xx, 408, 429, network errors, transport timeouts, and missing-data conditions. HTTP 4xx (non-408/429) and successful responses — including legitimate empty results — do not count as failures.
Hedge interaction
Hedge attempts are never counted toward the breaker's failure or success windows. Speculation would distort the health signal — if the primary is just slow but eventually succeeds, the hedge race itself should not open the breaker.
Network-level behavior
When an upstream's breaker opens, the network's selection policy marks that upstream as cordoned (erpc_upstream_cordoned{state=1}) and treats it as inactive until the breaker closes. The network routes to the next healthy upstream automatically — no manual intervention required.
Defaults
All fields have defaults applied when circuitBreaker: {} is set. The breaker is opt-in — you must explicitly add a circuitBreaker block for it to take effect.
| Field | Default | Notes |
|---|---|---|
failureThresholdCount | 20 | Failures required to open. |
failureThresholdCapacity | 80 | Rolling sample window size. Default ratio: 25% failure rate. |
halfOpenAfter | 5m | How long to stay open before probing. |
successThresholdCount | 8 | Successes required to close from half-open. |
successThresholdCapacity | 10 | Probe window in half-open state. |
The default thresholds (20/80 = 25% failure rate) are deliberately sensitive. Tune upward (160/200) for upstreams that experience legitimate transient errors under load.
Gotchas
- Network-level
circuitBreakeris silently ignored — the config parser accepts it but it has no effect at network scope. Only upstream-level breakers trip. - Threshold ratio, not absolute count — set
failureThresholdCountandfailureThresholdCapacitytogether.160/200= 80% failure rate.1/10= 10% — much more sensitive to any fault. halfOpenAftertoo short — the upstream gets probed before it has time to recover, immediately opens again, and oscillates. Start at5mfor real outages; lower only when tuning in a staging environment.- Per-method scoping for noisy methods — if a method legitimately returns errors often (e.g.
eth_callon contracts that revert), usematchMethod: "!eth_call"so those failures don't open the breaker for unrelated traffic. - Hedge attempts are excluded from the failure window — by design. Counting hedge outcomes would cause false positives on slow-but-functional upstreams.
- Open breaker doesn't disable scoring — the upstream's latency and health scores continue to be tracked (it just receives zero traffic while open). On close, scoring resumes from where it left off.
- One breaker per
failsafe[]entry — withoutmatchMethod/matchFinalityscoping, a single bad method group opens the breaker for the entire upstream.
If you use both retry and circuitBreaker on the same upstream entry, retries happen first. A failed retry sequence counts as a single failure against the breaker — not one failure per attempt.
Metrics
erpc_upstream_cordoned(gauge) —0= active,1= cordoned by open breaker or selection policy.erpc_upstream_request_total{outcome="breaker_open"}(counter) — rejections while the breaker is open.
PromQL — alert when any upstream has been cordoned for more than 10 minutes:
avg_over_time(erpc_upstream_cordoned[10m]) > 0.95Copy for your AI assistant — circuit breaker referenceExpand for every option, default, and edge case — or copy this entire section into your AI assistant.
CircuitBreakerPolicyConfig — every field
| Field | Type | Default | Notes |
|---|---|---|---|
failureThresholdCount | uint | 20 | Number of failures within the rolling window required to open the breaker. |
failureThresholdCapacity | uint | 80 | Size of the rolling sample window (total outcomes tracked). Trip ratio = failureThresholdCount / failureThresholdCapacity. |
halfOpenAfter | Duration | 5m | How long to remain in open state before transitioning to half-open and allowing probe requests. |
successThresholdCount | uint | 8 | Number of successful probe responses required to close the breaker from half-open. |
successThresholdCapacity | uint | 10 | Size of the probe window in half-open state. Any failure resets the probe counter and transitions back to open. |
Only valid at the upstream level. Setting circuitBreaker on a network failsafe[] entry is accepted by the config parser but has no runtime effect.
Hedge attempts are never counted toward the breaker's failure or success windows — they are speculative fan-out and would otherwise distort the health signal.
State transitions
| From | To | Trigger |
|---|---|---|
| Closed | Open | failureThresholdCount failures within failureThresholdCapacity samples. |
| Open | Half-open | halfOpenAfter duration elapses. |
| Half-open | Closed | successThresholdCount successes within successThresholdCapacity probe samples. |
| Half-open | Open | Any single failure during probing; halfOpenAfter timer restarts. |
What counts as a failure
Same classifier as retry: HTTP 5xx, 408, 429, network/transport errors, timeout, missing-data conditions. HTTP 4xx (non-408/429) and successful responses — including empty results — are not failures.
Scoping
Use matchMethod and matchFinality to create independent breakers per method group on the same upstream. Each failsafe[] entry with a circuitBreaker block maintains its own state machine and counters.
See also
- Failsafe overview — scoping rules and per-attempt observability
- Retry — the breaker uses the same "what's a failure" classifier
- Selection policies — how the network routes around an open breaker
- Production guidelines — recommended thresholds for production deployments