# Circuit breaker > Source: https://docs.erpc.cloud/config/failsafe/circuit-breaker > Temporarily remove an upstream from rotation after sustained failure — three-state breaker with rolling-window thresholds. > Format: machine-readable markdown export of the docs page above. > All collapsible AI sections are inlined and fully expanded. # Circuit breaker policy The circuit breaker cordons off an unhealthy upstream so failures stop wasting latency on a bad endpoint. Three states: **closed** (normal traffic), **open** (instant rejection), **half-open** (limited probe traffic to test recovery). Only valid at the **upstream** level — at the network level, the selection policy already routes around open breakers. ## Full configuration **Config path:** `projects > upstreams[] > failsafe[] > circuitBreaker` **YAML — `erpc.yaml`:** ```yaml projects: - id: main upstreams: - id: my-upstream endpoint: https://rpc.example.com failsafe: - matchMethod: "*" # applies to all methods; narrow with "!eth_call" to exclude noisy ones # matchFinality: [finalized] # optional: scope to specific finality states circuitBreaker: failureThresholdCount: 160 # trip after 160 failures... failureThresholdCapacity: 200 # ...out of the most recent 200 requests (80% failure rate) halfOpenAfter: 5m # stay open for 5 min before allowing probes successThresholdCount: 3 # close again after 3 consecutive successes... successThresholdCapacity: 10 # ...within a 10-sample probe window ``` **TypeScript — `erpc.ts`:** ```typescript import { createConfig } from "@erpc-cloud/config"; export default createConfig({ projects: [{ id: "main", upstreams: [{ id: "my-upstream", endpoint: "https://rpc.example.com", failsafe: [{ matchMethod: "*", // applies to all methods; narrow with "!eth_call" to exclude noisy ones // matchFinality: ["finalized"], // optional: scope to specific finality states circuitBreaker: { failureThresholdCount: 160, // trip after 160 failures... failureThresholdCapacity: 200, // ...out of the most recent 200 requests (80% failure rate) halfOpenAfter: "5m", // stay open for 5 min before allowing probes successThresholdCount: 3, // close again after 3 consecutive successes... successThresholdCapacity: 10, // ...within a 10-sample probe window }, }], }], }], }); ``` Use `matchMethod` and `matchFinality` to scope a breaker to a specific method group. An upstream can have multiple `failsafe[]` entries — each creates an independent breaker with its own state and counters. ## How it works ### State machine 1. **Closed** (normal) — every request goes through. Failures and successes both count toward the rolling sample window. When `failureThresholdCount` failures land within the most recent `failureThresholdCapacity` samples, the breaker transitions to **open**. 2. **Open** — every request is rejected immediately with a breaker-open error. The network's selection policy routes to the next-best upstream. The breaker stays open for `halfOpenAfter` duration. 3. **Half-open** — after `halfOpenAfter` elapses, the next request is allowed through as a probe. Probes continue (concurrently, capped at `successThresholdCapacity`) until `successThresholdCount` succeed — then transition back to **closed**. Any failure during half-open transitions back to **open** and restarts the `halfOpenAfter` timer. ### Rolling windows Both `failureThreshold*` and `successThreshold*` use sample counters, not time windows. A 160/200 failure threshold means "trip when 160 of the most recent 200 outcomes were failures" — regardless of how long that took. For sparse traffic the window naturally extends over longer wall-clock time. ### What counts as a failure The breaker uses the same retryable-error classification as `retry`: HTTP 5xx, 408, 429, network errors, transport timeouts, and missing-data conditions. HTTP 4xx (non-408/429) and successful responses — including legitimate empty results — do **not** count as failures. ### Hedge interaction Hedge attempts are **never counted** toward the breaker's failure or success windows. Speculation would distort the health signal — if the primary is just slow but eventually succeeds, the hedge race itself should not open the breaker. ### Network-level behavior When an upstream's breaker opens, the network's selection policy marks that upstream as cordoned (`erpc_upstream_cordoned{state=1}`) and treats it as inactive until the breaker closes. The network routes to the next healthy upstream automatically — no manual intervention required. ## Defaults All fields have defaults applied when `circuitBreaker: {}` is set. The breaker is opt-in — you must explicitly add a `circuitBreaker` block for it to take effect. | Field | Default | Notes | |---|---|---| | `failureThresholdCount` | `20` | Failures required to open. | | `failureThresholdCapacity` | `80` | Rolling sample window size. Default ratio: 25% failure rate. | | `halfOpenAfter` | `5m` | How long to stay open before probing. | | `successThresholdCount` | `8` | Successes required to close from half-open. | | `successThresholdCapacity` | `10` | Probe window in half-open state. | > **INFO** > The default thresholds (20/80 = 25% failure rate) are deliberately sensitive. Tune upward (`160/200`) for upstreams that experience legitimate transient errors under load. ## Gotchas - **Network-level `circuitBreaker` is silently ignored** — the config parser accepts it but it has no effect at network scope. Only upstream-level breakers trip. - **Threshold ratio, not absolute count** — set `failureThresholdCount` and `failureThresholdCapacity` together. `160/200` = 80% failure rate. `1/10` = 10% — much more sensitive to any fault. - **`halfOpenAfter` too short** — the upstream gets probed before it has time to recover, immediately opens again, and oscillates. Start at `5m` for real outages; lower only when tuning in a staging environment. - **Per-method scoping for noisy methods** — if a method legitimately returns errors often (e.g. `eth_call` on contracts that revert), use `matchMethod: "!eth_call"` so those failures don't open the breaker for unrelated traffic. - **Hedge attempts are excluded** from the failure window — by design. Counting hedge outcomes would cause false positives on slow-but-functional upstreams. - **Open breaker doesn't disable scoring** — the upstream's latency and health scores continue to be tracked (it just receives zero traffic while open). On close, scoring resumes from where it left off. - **One breaker per `failsafe[]` entry** — without `matchMethod`/`matchFinality` scoping, a single bad method group opens the breaker for the entire upstream. > **WARNING** > If you use both `retry` and `circuitBreaker` on the same upstream entry, retries happen first. A failed retry sequence counts as a single failure against the breaker — not one failure per attempt. ## Metrics - `erpc_upstream_cordoned` (gauge) — `0` = active, `1` = cordoned by open breaker or selection policy. - `erpc_upstream_request_total{outcome="breaker_open"}` (counter) — rejections while the breaker is open. PromQL — alert when any upstream has been cordoned for more than 10 minutes: ```promql avg_over_time(erpc_upstream_cordoned[10m]) > 0.95 ``` ### `CircuitBreakerPolicyConfig` — every field | Field | Type | Default | Notes | |---|---|---|---| | `failureThresholdCount` | uint | `20` | Number of failures within the rolling window required to open the breaker. | | `failureThresholdCapacity` | uint | `80` | Size of the rolling sample window (total outcomes tracked). Trip ratio = `failureThresholdCount / failureThresholdCapacity`. | | `halfOpenAfter` | Duration | `5m` | How long to remain in open state before transitioning to half-open and allowing probe requests. | | `successThresholdCount` | uint | `8` | Number of successful probe responses required to close the breaker from half-open. | | `successThresholdCapacity` | uint | `10` | Size of the probe window in half-open state. Any failure resets the probe counter and transitions back to open. | Only valid at the **upstream** level. Setting `circuitBreaker` on a network `failsafe[]` entry is accepted by the config parser but has no runtime effect. **Hedge attempts are never counted** toward the breaker's failure or success windows — they are speculative fan-out and would otherwise distort the health signal. ### State transitions | From | To | Trigger | |---|---|---| | Closed | Open | `failureThresholdCount` failures within `failureThresholdCapacity` samples. | | Open | Half-open | `halfOpenAfter` duration elapses. | | Half-open | Closed | `successThresholdCount` successes within `successThresholdCapacity` probe samples. | | Half-open | Open | Any single failure during probing; `halfOpenAfter` timer restarts. | ### What counts as a failure Same classifier as `retry`: HTTP 5xx, 408, 429, network/transport errors, timeout, missing-data conditions. HTTP 4xx (non-408/429) and successful responses — including empty results — are not failures. ### Scoping Use `matchMethod` and `matchFinality` to create independent breakers per method group on the same upstream. Each `failsafe[]` entry with a `circuitBreaker` block maintains its own state machine and counters. ## See also - [Failsafe overview](/config/failsafe.llms.txt) — scoping rules and per-attempt observability - [Retry](/config/failsafe/retry.llms.txt) — the breaker uses the same "what's a failure" classifier - [Selection policies](/config/projects/selection-policies.llms.txt) — how the network routes around an open breaker - [Production guidelines](/operation/production.llms.txt) — recommended thresholds for production deployments