Circuit breaker

In eRPC, "trip a bad upstream out of rotation" is the Selection Policy's job — not a separate failsafe policy. One unified mechanism covers every kind of "bad": failing, slow, throttled, lagging. Exclusion lives at the network level; the per-upstream-state-machine model has no separate home.

How it works

The selection policy runs on a tick (default every 1s) and returns the ordered list of upstreams eligible to serve requests on that network. Order is law; missing means excluded. Each tick the policy can drop an upstream based on any signal in its rolling-window health metrics:

upstreams
  .excludeIf(all(samplesAbove(10), errorRateAbove(0.7)))                     // failing
  .excludeIf(all(samplesAbove(10), throttleRateAbove(0.4)))                  // throttled
  .excludeIf(any(all(samplesAbove(20), latencyDeviationAbove(3)),            // slow
                 latencyAbove(30_000)))
  .excludeIf(any(blockNumberLagAbove(16), blockSecondsLagAbove(30)))         // lagging
  ...
  .probeExcluded({ sampleRate: 0.1, minSamples: 10 })                        // shadow-mirror real traffic at excluded upstreams

An excluded upstream stays out until its tracker metrics improve below the same excludeIf thresholds — there is no separate readmit timer. The probeExcluded step subscribes the network's probe subsystem to the request feed; the prober shadow-mirrors a sampled stream of real requests against any currently-excluded upstream, feeding the same tracker counters as real traffic. Once those counters cross back below the threshold, the upstream falls out of the excluded set on the next tick and re-enters rotation.

This is the default policy — you get it by omitting selectionPolicy entirely. The thresholds (70 % error rate / 40 % throttle rate, each gated on ≥ 10 samples; p70 > 30 s OR 3× the fastest peer gated on ≥ 20 samples; head lag > 16 blocks OR 30 s) are tuned for production from day one. The samplesAbove(N) guards on the error / throttle / relative-latency predicates exist so a single failing call on a fresh-pod tracker (errorRate = 1/1 = 1.0) cannot cascade-evict every upstream before the rolling window has meaningful denominator.

State, mapped to selection-policy concepts

Concept	Surfaced as
Closed (upstream healthy)	Present in the eval's returned ordered list. `erpc_selection_position{upstream}` is `0` (primary) or `1+` (runner-up).
Open (upstream excluded)	Missing from the returned list. `erpc_selection_position{upstream}` is `-1`. The reason ("errorRate>0.7", "p70>30000ms", "p70>3xFastest", ...) appears on the per-tick `Decision.Output.Excluded[]` entry (as `Reason` for display + `LeafReasons[]` slugs that drive `selection_exclusion_total{reason}`) and in DEBUG logs.
Half-open (probing recovery)	`probeExcluded` shadow-mirrors sampled real traffic to currently-excluded upstreams in the background. The mirrored calls feed the same tracker counters as real traffic; once they cross back below the `excludeIf` thresholds, the upstream falls out of the excluded set on the next tick. No real user traffic is served by an excluded upstream during this — the probing is entirely behind the scenes. Per-upstream opt-out via `routing.probe: 'off'`.

Tuning the thresholds

Loosen, tighten, or add signals by overriding selectionPolicy.evalFunc on the network. Example — strict latency rule (p95 > 10 s on any sample count) for a workload that ALWAYS needs sub-10s responses:

(upstreams, ctx) =>
  upstreams
    .removeCordoned()
    .excludeIf(all(samplesAbove(10), errorRateAbove(0.5)))   // stricter error ceiling
    .excludeIf(all(samplesAbove(10), throttleRateAbove(0.3)))
    .excludeIf(latencyAbove(10_000, 95))                     // hard p95 floor, no guard
    .whenEmpty(() => upstreams)
    .sortByScore(PREFER_FASTEST)
    .stickyPrimary({ hysteresis: 0.3, minSwitchInterval: '30s' })
    .probeExcluded({ sampleRate: 0.5, maxConcurrent: 2, timeout: '10s' })

See Selection policies for the full chainable vocabulary + every predicate factory.

Manual exclusion

For operator-driven exclusion that doesn't depend on metrics — "this vendor's status page just went red, take it out now", planned maintenance windows, forced failover testing — use the admin cordon RPC. Cordon is independent of the metric-driven excludeIf chain: it sets a sticky flag the policy's .removeCordoned() step honors, and it stays until you uncordon.

Observability

Where	Signal
Prometheus	`erpc_selection_position{project,network,method,upstream}` — `0` = primary, `1+` = runner-up, `-1` = excluded. `erpc_selection_rejection_total{...,step}` — per-step rejection counter. `erpc_selection_primary_switch_total{...,from,to}` — primary changes.
OTLP	Per-tick eval span on the network with selected upstream + tick id.
Logs (DEBUG)	One line per `excludeIf` rejection with the upstream id and the predicate's `policyReason` (e.g. `errorRate>0.5`, `p70>30000ms`, `p70>3xFastest`, `blockHeadLag>16`).
Simulator	The policy-history pane shows the per-step trail (every chainable method's input → output → dropped/added/reordered) with timestamps.

The cordon admin endpoint (erpc_cordonUpstream) is the ONLY way to manually exclude an upstream. Everything else — health-based exclusion, recovery probing, primary changes — lives in the selection policy and is driven by live metrics. See Selection policies for the full picture and Cordoning for the operator runbook.

Integrity Drivers