# Selection & scoring > Source: https://docs.erpc.cloud/config/projects/selection-policies > eRPC ranks your upstreams every 15 seconds using live health data — bad actors drop out automatically, the fastest healthy provider goes first, and re-admission is metric-driven, not timer-driven. > Format: machine-readable markdown export of the docs page above. > All collapsible AI sections are inlined and fully expanded. # Selection & scoring Every 15 seconds eRPC scores all your upstreams using real error rates, latency quantiles, throttle rates, and block-head lag. The worst performers drop out automatically; a shadow prober keeps sampling them in the background so they rejoin the moment their numbers recover — no timer, no manual intervention. Your requests always flow to the fastest healthy provider available. **What you get** - Slow or flapping providers silently demoted without any config change - Fallback tier activates only when healthy upstreams run out - Sticky-primary routing prevents flapping between nearly-equal upstreams - Dry-run `shadowExcludeIf` lets you test new exclusion rules before they bite - Near-zero routing overhead: the request path does a single atomic load ## Quick taste This is the **exact default policy that ships in eRPC** ([`internal/policy/default_policy.js`](https://github.com/erpc/erpc/blob/main/internal/policy/default_policy.js)) — you get this behavior with zero config. It's shown wired explicitly so you can use it as the starting point for your own tweaks: **Config path:** `projects[].networks[].selectionPolicy` **YAML — `erpc.yaml`:** ```yaml projects: - id: main networks: - architecture: evm evm: { chainId: 1 } selectionPolicy: evalFunc: | (upstreams, ctx) => upstreams .removeCordoned() // Errors / throttle: drop upstreams that are clearly broken, gated // on samplesAbove(10) so a single failed call can't evict a fresh- // pod upstream. Thresholds are loose — failsafe (retry/hedge/ // consensus) already absorbs occasional failures. .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) .excludeIf(all(samplesAbove(10), throttleRateAbove(0.4))) // Latency: only exclude when the upstream is BOTH absolutely slow // AND consistently slower than peers across most of what it // serves. Two-stage check: // // 1. \`latencyAbove(3000)\` — absolute-latency floor. Drops the // deviation predicate entirely for upstreams below 3s p70. // Sub-3s upstreams stay in rotation regardless of how they // compare to a faster peer; \`sortByScore(PREFER_FASTEST)\` puts // them later in the ordered list and hedge catches the slack // on the request path. The 3s threshold is the "user-visible // pain" boundary — anything below is recoverable via hedge, // anything above starts hitting tail-latency SLOs. // // 2. \`latencyDeviationAbove(3, majority)\` — only AFTER passing // the absolute floor, also check the upstream is consistently // slower than peers. \`mode: 'majority'\` (more than half of the // per-method comparisons exceed 3×) keeps the predicate robust // against per-tick spikes on a single rare method that // would otherwise push a geomean over the threshold. The // lower multiplier (3 vs the older 10) is intentional — once // we've established the upstream is absolutely slow, a 3× // ratio is a strong "consistently behind peers" signal. // // Outer \`latencyAbove(10_000)\` is the catastrophic safety net // (>10s absolute, unconditional exclusion regardless of peers or // sample count). // // Outer \`samplesAbove(20)\` gates on aggregate counts so the // predicate doesn't even run on cold-start pods. .excludeIf(any(all(samplesAbove(20), latencyAbove(3000), latencyDeviationAbove(3, { mode: 'majority' })), latencyAbove(10_000))) // Block-head lag: drop if behind tip by ≥16 blocks or ≥30s. .excludeIf(any(blockNumberLagAbove(16), blockSecondsLagAbove(30))) // Outage safety net: if everyone failed the health excludes, fall // back to the raw set rather than failing closed. .whenEmpty(() => upstreams) // Tier split: prefer non-fallback; fall back to tier:fallback if no // primary survives. .preferTag('!tier:fallback', { minHealthy: 1, fallback: 'tier:fallback' }) // Rank survivors by p70 latency. .sortByScore(PREFER_FASTEST) // Hold the primary stable across ticks unless a meaningfully better // option exists for ≥30s. .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: '30s' }) // Shadow-mirror sampled real traffic to currently-excluded // upstreams in the background so they accumulate fresh tracker // samples without touching real user traffic. Per-upstream // opt-out via \`routing.probe: off\`. .probeExcluded({ sampleRate: 0.1, minSamples: 10, minSamplesWindow: '60s', maxConcurrent: 4, timeout: '10s' }) ``` **TypeScript — `erpc.ts`:** ```typescript projects: [{ id: "main", networks: [{ architecture: "evm", evm: { chainId: 1 }, selectionPolicy: { evalFunc: (upstreams, ctx) => upstreams .removeCordoned() // Errors / throttle: drop upstreams that are clearly broken, gated // on samplesAbove(10) so a single failed call can't evict a fresh- // pod upstream. Thresholds are loose — failsafe (retry/hedge/ // consensus) already absorbs occasional failures. .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) .excludeIf(all(samplesAbove(10), throttleRateAbove(0.4))) // Latency: two-stage check — absolute floor (3s p70) AND consistently // slower than peers (3× majority), with a catastrophic >10s safety net. // samplesAbove(20) gates cold-start pods. (Full rationale: see the // shipped default_policy.js.) .excludeIf(any(all(samplesAbove(20), latencyAbove(3000), latencyDeviationAbove(3, { mode: "majority" })), latencyAbove(10_000))) // Block-head lag: drop if behind tip by ≥16 blocks or ≥30s. .excludeIf(any(blockNumberLagAbove(16), blockSecondsLagAbove(30))) // Outage safety net: fall back to the raw set rather than failing closed. .whenEmpty(() => upstreams) // Tier split: prefer non-fallback; use tier:fallback when no primary survives. .preferTag("!tier:fallback", { minHealthy: 1, fallback: "tier:fallback" }) // Rank survivors by p70 latency. .sortByScore(PREFER_FASTEST) // Hold the primary stable unless a meaningfully better option exists ≥30s. .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: "30s" }) // Background-probe excluded upstreams with sampled real traffic. .probeExcluded({ sampleRate: 0.1, minSamples: 10, minSamplesWindow: "60s", maxConcurrent: 4, timeout: "10s" }), }, }], }] ``` ## Agent reference Copy one of these prompts into your AI agent session (Claude Code, Cursor, …) — each one points the agent at this page's machine-readable reference so it can do the work correctly: **Prompt Example #1: configure selection policy from scratch** ```text I need to set up a smart upstream selection policy for my eRPC config so that slow or erroring providers are demoted automatically and my fastest healthy upstreams always serve traffic. Work with my existing eRPC config. Read the full reference first: https://docs.erpc.cloud/config/projects/selection-policies.llms.txt ``` **Prompt Example #2: tune my exclusion thresholds without outages** ```text Audit the selectionPolicy in my eRPC config and tell me if my errorRateAbove / throttleRateAbove / latencyAbove thresholds are too aggressive (would exclude upstreams spuriously) or too loose (would keep broken upstreams in rotation too long). Also show me how to use shadowExcludeIf to validate a tighter rule safely before promoting it. Reference: https://docs.erpc.cloud/config/projects/selection-policies.llms.txt ``` **Prompt Example #3: debug why an upstream is stuck excluded** ```text One of my upstreams is stuck excluded and never re-admits even after the outage clears. Help me diagnose using erpc_selection_position, erpc_selection_exclusion_total, and erpc_selection_probe_requests_total, and fix the policy or routing.probe setting so re-admission is metric-driven. Config in my eRPC config. Reference: https://docs.erpc.cloud/config/projects/selection-policies.llms.txt ``` **Prompt Example #4: add a cost-tier fallback upstream** ```text I want to add an expensive archival node to my eRPC setup that only receives traffic when all my free-tier upstreams are unhealthy. Wire it up with scoreMultipliers overall demotion and a preferTag tier:fallback split so it stays on standby. Work with my existing eRPC config. Reference: https://docs.erpc.cloud/config/projects/selection-policies.llms.txt ``` **Prompt Example #5: switch to per-method ranking for eth_getLogs** ```text My eRPC setup uses the default evalScope: network which means all methods share one upstream ranking. I want eth_getLogs to rank upstreams by freshness (PREFER_FRESHEST) while reads use speed (PREFER_FASTEST). Walk me through switching evalScope and any scoreMultipliers.method gotchas I need to know. Work with my existing eRPC config. Reference: https://docs.erpc.cloud/config/projects/selection-policies.llms.txt ``` --- ### Selection & scoring — full agent reference ### How it works The entry point is `policy.Engine`, which owns a map of `*Slot` values keyed by `(network, method, finality)`. Each slot has its own ticker goroutine; the request path reads `slot.cache` via `atomic.Pointer.Load()` — completely wait-free. On every tick the engine takes a metric snapshot from the health tracker, builds an `EvalContext`, acquires a pooled Sobek JS runtime, calls the compiled eval function, materializes the ordered result, and atomically swaps the cache. The whole cycle completes in under a millisecond for typical policy chains. When `Engine.RegisterNetwork` is called at network bootstrap it: 1. Upgrades the placeholder `evalFunc` to `default_policy.js` (embedded at compile time). 2. Creates the wildcard `("*","*")` slot and runs a **synchronous** initial tick — the first request always sees a populated cache. 3. Starts the slot's ticker goroutine. The default `evalScope: "network"` creates one slot per network. Setting `evalScope: "network-method"` lazy-creates additional slots on first request for each distinct method, enabling per-method primary divergence. `evalScope: "network-finality"` splits on finality bucket; `"network-method-finality"` is the most granular. Narrow slots fall back to the wildcard slot cache during cold-start until their first tick completes. **Score inputs (health tracker).** Every metric available to `evalFunc` comes from the health tracker's rolling window. The tracker maintains 10 atomic sub-buckets per `(upstream, method, finality)` tuple; one bucket rotates out every `windowSize / 10`, so data drips out with no cliff. A DDSketch per bucket (1% relative accuracy) produces p50/p70/p90/p95/p99 latency quantiles. `errorRate` counts only genuine upstream-quality failures: connection refused, 5xx, genuine timeouts. EVM reverts, rate-limit 429s, client disconnects, and hedge cancellations are deliberately excluded so a misbehaving caller or aggressive hedger cannot inflate a healthy upstream's error rate. `throttledRate` counts remote 429 responses specifically — vendor quota exhaustion. `misbehaviorRate` tracks upstreams returning semantically wrong data. Latency quantiles are populated only from _successful_ responses — a fast-failing upstream that rejects connections in 1ms has empty quantile sketches and reports `p70 = 0` (unknown quality, not ultra-fast). A well-written policy should guard `latencyAbove` with `samplesAbove(N)`. `blockHeadLag` and `finalizationLag` are atomic point-in-time state, not rolling counters. They persist across window rotations and stay non-zero until the upstream catches up to the network tip. `blockHeadLagSeconds` and `finalizationLagSeconds` multiply the block count by the tracker's EMA-estimated block time (alpha 0.1, minimum 3 samples before publishing). Until the EMA is warm, both seconds fields are `0` — pair `blockSecondsLagAbove` with `blockNumberLagAbove` for cold-start coverage. **The default policy chain.** Omitting `evalFunc` applies the production-hardened default embedded at compile time: ```js (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) .excludeIf(all(samplesAbove(10), throttleRateAbove(0.4))) .excludeIf(any( all(samplesAbove(20), latencyAbove(3000), latencyDeviationAbove(3, { mode: 'majority' })), latencyAbove(10_000) )) .excludeIf(any(blockNumberLagAbove(16), blockSecondsLagAbove(30))) .whenEmpty(() => upstreams) .preferTag('!tier:fallback', { minHealthy: 1, fallback: 'tier:fallback' }) .sortByScore(PREFER_FASTEST) .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: '30s' }) .probeExcluded({ sampleRate: 0.1, minSamples: 10, minSamplesWindow: '60s', maxConcurrent: 4, timeout: '10s' }) ``` Each `excludeIf` step records per-leaf attribution into `erpc_selection_exclusion_total{reason}`. The `whenEmpty` safety net prevents fail-closed behavior during a widespread outage. `stickyPrimary` prevents flapping: the challenger must score more than 30% above the incumbent AND the last switch must be at least 30 seconds ago. `probeExcluded` shadow-mirrors 10% of real traffic to excluded upstreams so re-admission happens the moment their metrics recover. **Scoring formula and presets.** `sortByScore` ranks surviving upstreams by: ``` score(u) = overall / (1 + errorRate×w₁ + respLatency×w₂ + throttledRate×w₃ + blockHeadLag×w₄ + finalizationLag×w₅ + misbehaviors×w₆) ``` A clean upstream with no penalties scores `1.0`; every unit of penalty divides the score. Three built-in weight presets: | Preset | errorRate | respLatency | throttledRate | blockHeadLag | finalizationLag | misbehaviors | |---|---|---|---|---|---|---| | `PREFER_FASTEST` | 4 | 15 | 4 | 1 | 0 | 2 | | `PREFER_FRESHEST` | 4 | 2 | 2 | 15 | 8 | 3 | | `PREFER_LEAST_ERRORS` | 15 | 2 | 6 | 2 | 1 | 12 | Per-upstream `routing.scoreMultipliers` on the upstream config merge with the chosen preset. The default merge mode (`"merge"`) lets individual multiplier keys override the base weight; `overall` multiplies the final score and is the simplest way to strongly prefer or demote a specific upstream without touching weight shapes. **Circuit-breaker interplay.** The selection policy and the [circuit breaker](/config/failsafe/circuit-breaker.llms.txt) are complementary layers that act at different points and timescales. The policy runs per eval tick and excludes upstreams from the ranked list via `excludeIf` predicates over rolling-window metrics. The circuit breaker is a per-upstream state machine *inside the upstream executor*: it is checked only after the sweep has already dispatched to that upstream (selection runs first). When the breaker is open, `TryAcquirePermit()` fails the attempt immediately with `ErrFailsafeCircuitBreakerOpen` — no transport call is made — and the sweep advances to the next upstream in the policy's ordered list. So an upstream can sit at position 0 in the policy list with an open breaker: each request burns one cheap fast-fail attempt there before moving on, until the policy's own metrics demote it on a later tick. Neither mechanism reads the other's state (breaker-open does not cordon or exclude); they only interact through outcomes. If the split bothers you, align breaker thresholds with the policy's `errorRateAbove` so both layers agree on what "broken" means. Source: [`upstream/upstream_executor.go:L345-L366`](https://github.com/erpc/erpc/blob/main/upstream/upstream_executor.go#L345-L366). **`shadowExcludeIf` — dry-run exclusion rules.** `shadowExcludeIf(predicate)` is the non-destructive counterpart of `excludeIf`. The predicate runs every tick; would-have-been-excluded upstreams stay in the ordered list but each trip increments `erpc_selection_shadow_exclusion_total{upstream, reason}` with the same leaf-slug attribution. Use it to measure a new rule's blast radius before promoting it to `excludeIf`. **`includeIf` — conditional re-admission (pool-level quantifiers).** `includeIf(target, condition)` is the dual of `excludeIf`: it ADDS upstreams from the full universe back into the chain when an aggregate condition over the *surviving pool* holds — it never removes anyone. `target` comes first ("include *these* if *condition*"): a tag pattern (`'tier:reserve'`, `!` negation supported) or a selector object `{id, tag, vendor, type, position}` whose facets AND together; an object with no concrete facet (`{}`, `{position:'head'}`) is a deliberate no-op so it can never accidentally admit the entire universe. Admitted upstreams land at the tail by default (`position: 'head'` available in object form), are never duplicated by id, and are ranked by the subsequent `sortByScore` like everyone else. `condition` is a boolean or `(pool, ctx) => boolean` receiving the current chain — which is what enables pool-level ("network-level") questions via native array methods over the per-upstream predicate factories: `p.every(blockSecondsLagAbove(30))`, `p.length < 2`. Mind that native `every` is `true` on an empty array — guard with `p.length > 0 &&` when the empty pool should be handled by a separate explicit `p.length < 1` line. Malformed arguments and throwing conditions degrade to a no-op rather than sinking the eval. Source: [`internal/policy/stdlib/stdlib.js:L1023-L1116`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L1023-L1116); tests: [`internal/policy/stdlib/include_if_test.go`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/include_if_test.go). **Additional stdlib chain methods.** Beyond `excludeIf` / `includeIf`, the JS stdlib installs the following chainable `Array.prototype` methods (source: [`internal/policy/stdlib/stdlib.js`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js)): - **Identity/label filters**: `byId(id)`, `excludeId(id)`, `byTag(pat)`, `excludeTag(pat)`, `byVendor(name)`, `excludeVendor(name)`, `byType(t)`, `where(fn)`, `whereNot(fn)`. - **Health filters** (higher-level wrappers): `removeCordoned()`, `removeByErrorRate(threshold)`, `removeByThrottling(threshold)`, `removeByMisbehavior(threshold)`, `removeByLag(blocks)`, `removeByLatency(ms)`, `keepHealthy()`, `removeByMinRequests(n)`. - **Tier selection**: `preferTag(pat, opts?)`, `preferVendor(name, opts?)`. - **Diversity**: `spreadAcrossTags(prefix)` — interleaves the sorted list so adjacent positions come from different tag-partitions (e.g. `"region:"`). - **Safety net**: `whenEmpty(fn)` — falls back to `fn()` if the array is empty. - **Conditional/finality branching**: `when(mask, fn)`, `byFinality(handlers)`, `if(cond, fn)`, `unless(cond, fn)`, `fallbackTo(fn)`, `ensureMin(n, fn)`. **`hasMatchingTag` semantics** (used by `byTag`, `excludeTag`, `preferTag`, `includeIf` string targets — [`internal/policy/stdlib/stdlib.js:L267-293`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L267-L293)): - Positive pattern (`"tier:main"`, `"region:us-*"`): matches if ANY tag on the upstream matches. - Negated pattern (`"!tier:fallback"`): matches if NO tag matches the un-negated pattern. - Array of patterns: positives are OR'd; negations are AND'd. Upstream matches iff (no positives OR at least one positive matches) AND (all negations hold). **`preferTag` step semantics** ([`internal/policy/stdlib/stdlib.js:L854-865`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L854-L865)): 1. Filter to upstreams matching the primary pattern. 2. If ≥ `minHealthy` match → return that subset. 3. Else if `fallback` pattern is set → filter to the fallback pattern. 4. Else → return the input unchanged. **Upstream JS object shape** (built by `buildJSUpstreams` at [`internal/policy/eval.go:L312`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L312)): | Field | Type | Source | |---|---|---| | `id` | `string` | `u.Id()` | | `vendor` | `string` | `u.VendorName()` | | `type` | `string` | `cfg.Type` | | `tags` | `string[]` | `cfg.Tags` | | `hasTag(tag)` / `is(tag)` | `(tag: string) => boolean` | Shared VM singleton | | `metrics` | `UpstreamMetrics` | Slot-local metrics snapshot | | `metricsAcrossMethods` | `UpstreamMetrics` | Wildcard aggregate `("*", All)` | | `metricsByMethod` | `{[method]: {requestsTotal, p50ms, p70ms, p90ms, p95ms, p99ms}}` | Per-method, all-finalities | | `scoreMultipliers` | `{[key]: float64}` or absent | From `resolveScoreMultipliers` | | `score` | `float64` (set by `sortByScore`) | `overall / (1 + penalty)`; higher = better | **`UpstreamMetrics` object shape** (`u.metrics` and `u.metricsAcrossMethods` — [`internal/policy/eval.go:L283`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L283)): | Field | Type | Notes | |---|---|---| | `errorRate` | `float64` | EMA rolling error rate | | `errorsTotal` | `int64` | Cumulative error count | | `requestsTotal` | `int64` | Cumulative request count (window-bounded) | | `throttledRate` | `float64` | EMA rolling throttle (429) rate | | `misbehaviorRate` | `float64` | EMA rolling misbehavior rate | | `blockHeadLag` | `int64` | Blocks behind network tip | | `finalizationLag` | `int64` | Finalization blocks behind tip | | `blockHeadLagSeconds` | `float64` | `blockHeadLag × networkBlockTime`; `0` until EMA warms | | `finalizationLagSeconds` | `float64` | `finalizationLag × networkBlockTime`; `0` until EMA warms | | `p50ResponseSeconds` | `float64` | p50 latency in seconds | | `p70ResponseSeconds` | `float64` | p70 latency in seconds | | `p90ResponseSeconds` | `float64` | p90 latency in seconds | | `p95ResponseSeconds` | `float64` | p95 latency in seconds | | `p99ResponseSeconds` | `float64` | p99 latency in seconds | | `cordonedReason` | `string` or absent | Set when upstream is cordoned | | `latencyP(q)` | `(q: number) => ms` | Snaps to nearest pre-computed bucket; input 0–1 or 0–100 | The block-time EMA used to compute `blockHeadLagSeconds`/`finalizationLagSeconds` has sanity bounds: values outside `[10ms, 120s]` are rejected and the internal EMA is reset to the last published value to prevent post-halt spikes. **`EvalContext` (`ctx`) shape** ([`internal/policy/eval.go:L17`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L17)): | Field | Notes | |---|---| | `network` | Network ID string | | `method` | Slot method (`"*"` for wildcard slots) | | `finality` | Slot finality (`"unknown"` default for wildcard; real value for per-finality slots) | | `now` | Unix milliseconds | | `previousOrder` | `string[]` upstream IDs from the last tick | | `lastSwitchAt` | `int64` unix-ms or `null` | | `tickCount` | `uint64` monotonic tick counter | Global JS helpers available inside every `evalFunc`: - `methodMatches(pat)` — tests `ctx.method` against a pattern - `isFinalityRequest()` — `ctx.finality === 'finalized'` - Preset globals: `PREFER_FASTEST`, `PREFER_FRESHEST`, `PREFER_LEAST_ERRORS` - Finality bit-flag constants: `REALTIME` (1), `UNFINALIZED` (2), `FINALIZED` (4), `UNKNOWN` (8) — used with `.when(mask, fn)` inside eval; NOT for `scoreMultipliers.finality` config fields **Exclusion attribution leaf slugs.** `excludeIf` records which leaf predicate actually tripped per upstream. Bounded-cardinality slugs emitted by `erpc_selection_exclusion_total{reason}`: `error_rate_above`, `error_rate_below`, `throttle_rate_above`, `throttle_rate_below`, `misbehavior_rate_above`, `latency_p_above`, `latency_p_deviation_above`, `block_head_lag_above`, `finalization_lag_above`, `block_head_lag_seconds_above`, `finalization_lag_seconds_above`, `samples_below`, `samples_above` (guard — not a failure slug), `not_`, `custom`. Note: `samplesAbove`/`samplesBelow` predicates are marked `isGuard=true` so they do not appear as primary exclusion reasons when they do NOT trip; they only appear when they are the leaf that fires. **Log messages emitted by the selection engine:** - `WARN "selection policy eval failed; retaining previous cache"` — tick error; fields: `network`, `method`, `tick_id`, `error`. - `DEBUG "policy step"` — one line per stdlib step when step-logging is enabled; fields: `network`, `method`, `tick_id`, `idx`, `step`, `in`, `out`, `dropped`, `added`, `reordered`, `args`. Disabled by default in production (zero overhead: no allocation when disabled). - `DEBUG "policy excluded upstream"` — one per excluded upstream when step-logging is enabled; fields: `network`, `method`, `tick_id`, `upstream`, `step`, `reason`, `leaf_reasons`. **Legacy upstream tag keys.** Upstream-level `group: ` and `cohort: ` YAML keys are rewritten as `tier:` and `cohort:` tags at config-load time by `mergeLegacyLabelKeysIntoTags`. They are not valid in new configs. ### Config schema #### `selectionPolicy` (`networks[].selectionPolicy`) Full YAML path: `projects[].networks[].selectionPolicy` Config struct: `SelectionPolicyConfig` at [`common/config.go:L2352-2430`](https://github.com/erpc/erpc/blob/main/common/config.go#L2352-L2430) Defaults: [`common/defaults.go:L2512-2609`](https://github.com/erpc/erpc/blob/main/common/defaults.go#L2512-L2609) | Field | Type | Default | Footguns | |---|---|---|---| | `evalInterval` | `Duration` | `15s` ([`common/defaults.go:L2532`](https://github.com/erpc/erpc/blob/main/common/defaults.go#L2532)) | `0` disables the ticker (frozen cache); tests only. Should be ≥ 3× `scoreMetricsWindowSize / 10` for stable quantile reads. | | `evalTimeout` | `Duration` | `100ms` ([`common/defaults.go:L2534-2535`](https://github.com/erpc/erpc/blob/main/common/defaults.go#L2534-2535)) | Exceeded → tick error logged, previous cache retained unchanged. Must be `< evalInterval`. | | `evalScope` | `EvalScope` enum | `"network"` ([`common/defaults.go:L2556-2583`](https://github.com/erpc/erpc/blob/main/common/defaults.go#L2556-2583)) | `"network"` \| `"network-method"` \| `"network-finality"` \| `"network-method-finality"`. Narrower scopes multiply goroutines and JS evals. `scoreMultipliers.method`-specific entries **never match** in `"network"` scope because `ctx.method == "*"`. | | `evalPerMethod` | `*bool` | `nil` — **deprecated** | Pointer-bool alias for the method axis. `SetDefaults` nils it unconditionally and derives `evalScope` instead. Excluded from TS surface (`tstype:"-"`). | | `evalPerFinality` | `*bool` | `nil` — **deprecated** | Same mechanics as `evalPerMethod` for the finality axis. Niled by `SetDefaults`. | | `evalFunc` | `string` (JS source) | `DefaultSelectionPolicySource` placeholder → upgraded to `default_policy.js` at engine register | Full JS function `(upstreams, ctx) => Upstream[]`. In TS configs, a real arrow function is stringified via `Function.prototype.toString()` at load time. | **EvalScope slot cardinality:** | `evalScope` | Slots per network | Goroutines | Notes | |---|---|---|---| | `"network"` (default) | 1 | 1 | `ctx.method` always `"*"`; method-specific `scoreMultipliers` never match. | | `"network-method"` | 1 per observed method, lazy-created | N methods | Slots evicted after 1h idle. `scoreMultipliers.method` works correctly here. | | `"network-finality"` | Up to 4 (one per finality bucket) | Up to 4 | `ctx.method` still `"*"`; method-specific multipliers still never match. | | `"network-method-finality"` | Up to `methods × 4` | N × 4 | Highest granularity. Both `scoreMultipliers.method` and `.finality` match. | **Legacy fields (accepted, never survive to runtime):** | YAML key | Behavior | |---|---| | `evalFunction` | Old `(upstreams, method) => Upstream[]` shape. Wrapped into the new signature if no canonical `evalFunc` is also set. Emits `WarnLegacySelectionPolicy()`. | | `resampleExcluded` | When `true` AND `resampleInterval > 0`, appends `.probeExcluded({ sampleRate: 1.0, maxConcurrent: 1, timeout: '10s' })` to the synthesized eval. | | `resampleInterval` | Legacy polling cadence. The actual duration value is **ignored**; only its presence gates `probeExcluded` synthesis. | | `resampleCount` | No behavioral mapping. Checked in `hasSemanticLegacy` but never used in synthesized output. | #### `upstream.routing` (`upstreams[].routing`) | Field | YAML key | Type | Default | Notes | |---|---|---|---|---| | `ScoreMultipliers` | `scoreMultipliers` | `[]*ScoreMultiplierConfig` | `nil` | First matching entry wins. | | `ScoreLatencyQuantile` | `scoreLatencyQuantile` | `float64` | `0` (inherits policy default p70) | Declared; not yet wired into the scoring engine. | | `Probe` | `probe` | `ProbeMode` | `"on"` ([`common/config.go:L773`](https://github.com/erpc/erpc/blob/main/common/config.go#L773)) | `"off"` = never mirror probe traffic; upstream stays excluded until metrics naturally recover or operator uncordons. | #### `ScoreMultiplierConfig` fields | Field | YAML key | Type | Notes | |---|---|---|---| | `Network` | `network` | `string` | Glob; empty = any. | | `Method` | `method` | `string` | Glob; empty = any. **Specific methods never match in `evalScope: "network"` wildcard slots.** | | `Finality` | `finality` | `[]DataFinalityState` | Integer enum: `"finalized"` (0), `"unfinalized"` (1), `"realtime"` (2), `"unknown"` (3). **Not the same as JS bit-flag constants** `FINALIZED=4`, `REALTIME=1`, `UNFINALIZED=2` used in `evalFunc` bodies. | | `Overall` | `overall` | `*float64` | Multiplies final score. `>1` = prefer, `<1` = demote. `nil` = unset (inherit base). | | `ErrorRate` | `errorRate` | `*float64` | Weight override for error-rate term. `0.0` = removes contribution. | | `RespLatency` | `respLatency` | `*float64` | Weight override for latency term. | | `ThrottledRate` | `throttledRate` | `*float64` | Weight override for throttle-rate term. | | `BlockHeadLag` | `blockHeadLag` | `*float64` | Weight override for block-head-lag term. | | `FinalizationLag` | `finalizationLag` | `*float64` | Weight override for finalization-lag term. | | `Misbehaviors` | `misbehaviors` | `*float64` | Weight override for misbehavior-rate term. | | `TotalRequests` | `totalRequests` | `*float64` | **Accepted silently; has zero runtime effect.** Never read by any scoring code. Kept for backward YAML compatibility. | Merge modes (set via `sortByScore({ multipliers: 'merge' | 'override' | 'off' })`): - `"merge"` (default): per-upstream keys override matching base-weight keys; `overall` lifts final score multiplicatively. - `"override"`: configured upstreams rank by their weights only; unconfigured fall back to base. - `"off"`: ignore `u.scoreMultipliers` entirely. #### `probeExcluded` options (JS) Defaults from [`internal/policy/eval.go:L793-799`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L793-L799) and [`internal/policy/stdlib/stdlib.js:L999-1009`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L999-L1009). | Option | Default | Notes | |---|---|---| | `sampleRate` | `0.1` | 0.0–1.0 probability per `(request, excluded-upstream)`. Bypassed when `count < minSamples`. | | `minSamples` | `10` | Per-upstream floor within `minSamplesWindow`. Below this, every request is a probe candidate. Should be ≥ `samplesAbove(N)` in the chain. | | `minSamplesWindow` | `"60s"` | Rolling window for `minSamples`. | | `maxConcurrent` | `4` | In-flight probe cap per excluded upstream. | | `timeout` | `"10s"` | Per-probe deadline. Overrun records a failure in the tracker. | #### `stickyPrimary` options (JS) Defaults from [`internal/policy/stdlib/stdlib.js:L725-727`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L725-L727). | Option | Default | Notes | |---|---|---| | `hysteresis` | `0.10` (default policy uses `0.30`) | Challenger must exceed incumbent score by this fraction to trigger switch. | | `minSwitchInterval` | `"30s"` | Cooldown after a switch. No switching during cooldown regardless of score gap. | | `scope` | `"network"` | `"network"` \| `"network-method"` \| `"network-finality"` \| `"network-method-finality"`. Coarser scopes share the sticky register across slots. | #### `latencyDeviationAbove` options (JS) Defaults from [`internal/policy/stdlib/stdlib.js:L1343-1358`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L1343-L1358). | Option | Default | Notes | |---|---|---| | `quantile` | `70` | p50/p70/p90/p95/p99. | | `mode` | `"geomean"` | `"geomean"` \| `"majority"` \| `"veto"`. | | `minMethodSamples` | `50` | Per-method sample floor; methods below this are skipped. | | `dampingMs` | `30` | Exponential damping scale. Ratio is multiplied by `1 - exp(-myLatency / dampingMs)`. Set to `0` to disable. | ### Worked examples All patterns below are distilled from real production fleets; comments explain the non-obvious choices. **1. Production default: the full default policy chain with scoreMetricsWindowSize.** In production every project sets `scoreMetricsWindowSize` explicitly — this controls how fast exclusion decisions react to upstream changes. Shorter windows react faster but are noisier; `4m` is the production-proven sweet spot. The default `evalFunc` is omitted here (eRPC loads `default_policy.js` automatically), so only the project-level window and `upstreamDefaults.routing` matter: **Config path:** `projects[]` **YAML — `erpc.yaml`:** ```yaml projects: - id: main # 4-minute rolling window for health tracker metrics — fast enough to react # to a flapping upstream without being too jittery on cold-start pods. scoreMetricsWindowSize: 4m scoreMetricsMode: compact networkDefaults: # selectionPolicy omitted → eRPC loads the embedded default_policy.js failsafe: - matchMethod: "*" hedge: quantile: 0.95 maxCount: 1 minDelay: 500ms maxDelay: 10s upstreamDefaults: routing: scoreMultipliers: # Realtime/unfinalized: heavily weight latency AND block-head lag — # a lagging upstream on tip-of-chain traffic is worse than a slow one. - finality: [realtime, unfinalized] respLatency: 10 errorRate: 2 misbehaviors: 5 blockHeadLag: 15 finalizationLag: 2 # Finalized/unknown: latency matters more than lag for archive queries. - finality: [unknown, finalized] respLatency: 15 errorRate: 2 misbehaviors: 10 blockHeadLag: 10 finalizationLag: 2 ``` **TypeScript — `erpc.ts`:** ```typescript projects: [{ id: "main", // 4-minute rolling window for health tracker metrics. scoreMetricsWindowSize: "4m", scoreMetricsMode: "compact", upstreamDefaults: { routing: { scoreMultipliers: [ { // Realtime/unfinalized: heavily weight latency AND block-head lag. finality: [DataFinalityStateRealtime, DataFinalityStateUnfinalized], respLatency: 10, errorRate: 2, misbehaviors: 5, blockHeadLag: 15, finalizationLag: 2, }, { // Finalized/unknown: latency matters most for archive queries. finality: [DataFinalityStateUnknown, DataFinalityStateFinalized], respLatency: 15, errorRate: 2, misbehaviors: 10, blockHeadLag: 10, finalizationLag: 2, }, ], }, }, }] ``` **2. Vendor demotion with `overall` multiplier (real provider pattern).** Some vendors are less reliable or more expensive — instead of hard-excluding them, apply a fractional `overall` multiplier so they serve as implicit fallbacks. Production uses `0.2` for Alchemy and `0.3` for QuickNode across all networks: **Config path:** `upstreams[]` **YAML — `erpc.yaml`:** ```yaml upstreams: - id: my-alchemy-node endpoint: https://eth-mainnet.g.alchemy.com/v2/\${ALCHEMY_API_KEY} evm: { chainId: 1 } routing: scoreMultipliers: # overall: 0.2 pushes this upstream to position 3–4 behind free-tier # nodes. It still serves traffic when primaries are excluded; a hard # exclude would waste the paid capacity entirely. - overall: 0.2 - id: my-quicknode-node endpoint: https://example.quiknode.pro/\${QUICKNODE_API_KEY}/ evm: { chainId: 1 } routing: scoreMultipliers: # 0.3 keeps QuickNode as a mid-tier — ahead of Alchemy (0.2) but # behind free-tier primaries (no multiplier = score 1.0 baseline). - overall: 0.3 ``` **TypeScript — `erpc.ts`:** ```typescript upstreams: [ { id: "my-alchemy-node", endpoint: "https://eth-mainnet.g.alchemy.com/v2/\${ALCHEMY_API_KEY}", evm: { chainId: 1 }, routing: { scoreMultipliers: [ // 0.2 puts Alchemy at the back of the pack, preserving capacity // without removing it from rotation when primaries are down. { overall: 0.2 }, ], }, }, { id: "my-quicknode-node", endpoint: "https://example.quiknode.pro/\${QUICKNODE_API_KEY}/", evm: { chainId: 1 }, routing: { scoreMultipliers: [{ overall: 0.3 }], }, }, ] ``` **3. Latency-sensitive API with preferred upstream and tier:fallback.** A latency-sensitive use case (MEV, trading) uses one high-quality primary with a `overall: 1.5` multiplier to keep it at position 0, and a paid archival node tagged `tier:fallback` that only activates when no primary survives: **Config path:** `projects[].networks[].selectionPolicy` **YAML — `erpc.yaml`:** ```yaml selectionPolicy: evalScope: network evalFunc: | (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.5))) // Drop upstreams >8 blocks behind tip — stale state is dangerous for // trading. Pair with blockNumberLagAbove for cold-start coverage since // blockSecondsLagAbove is 0 until the block-time EMA warms (>=3 pairs). .excludeIf(any(blockNumberLagAbove(8), blockSecondsLagAbove(15))) .whenEmpty(() => upstreams) .preferTag('!tier:fallback', { minHealthy: 1, fallback: 'tier:fallback' }) .sortByScore(PREFER_FASTEST) // Lower hysteresis (0.20) than default: switch primaries faster when // the challenger is 20%+ better — acceptable for low-latency routing. .stickyPrimary({ hysteresis: 0.20, minSwitchInterval: '15s' }) .probeExcluded({ sampleRate: 0.1, minSamples: 10, timeout: '5s' }) ``` **TypeScript — `erpc.ts`:** ```typescript selectionPolicy: { evalScope: "network", evalFunc: (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.5))) .excludeIf(any(blockNumberLagAbove(8), blockSecondsLagAbove(15))) .whenEmpty(() => upstreams) .preferTag("!tier:fallback", { minHealthy: 1, fallback: "tier:fallback" }) .sortByScore(PREFER_FASTEST) .stickyPrimary({ hysteresis: 0.20, minSwitchInterval: "15s" }) .probeExcluded({ sampleRate: 0.1, minSamples: 10, timeout: "5s" }), } ``` **4. Freshness-critical workload (indexer / subgraph node).** A subgraph node or NFT indexer queries finalized blocks and needs the freshest data. `PREFER_FRESHEST` ranks by block-head lag and finalization lag instead of raw latency. Tighter lag guards (`blockNumberLagAbove(3)`) drop a node the moment it falls behind, and `probeExcluded` ensures re-admission is immediate when it catches up: **Config path:** `projects[].networks[].selectionPolicy` **YAML — `erpc.yaml`:** ```yaml selectionPolicy: evalScope: network evalFunc: | (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) // Finalization lag guard: drop upstreams that haven't finalized recent // blocks — crucial for indexers reading finalized state only. .excludeIf(any(blockNumberLagAbove(3), finalizationLagAbove(2))) .whenEmpty(() => upstreams) // PREFER_FRESHEST: weights blockHeadLag×15 and finalizationLag×8 vs // PREFER_FASTEST's respLatency×15. Use this when stale data is worse // than slow data. .sortByScore(PREFER_FRESHEST) .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: '30s' }) .probeExcluded({ sampleRate: 0.1, minSamples: 10, timeout: '10s' }) ``` **TypeScript — `erpc.ts`:** ```typescript selectionPolicy: { evalScope: "network", evalFunc: (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) .excludeIf(any(blockNumberLagAbove(3), finalizationLagAbove(2))) .whenEmpty(() => upstreams) // PREFER_FRESHEST weights blockHeadLag and finalizationLag over latency. .sortByScore(PREFER_FRESHEST) .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: "30s" }) .probeExcluded({ sampleRate: 0.1, minSamples: 10, timeout: "10s" }), } ``` **5. Auditing a new exclusion rule without risk (`shadowExcludeIf`).** Before tightening a throttle-rate guard on production, run it in shadow mode: the predicate fires and increments a metric, but no upstream is actually dropped. Watch `erpc_selection_shadow_exclusion_total{reason="throttle_rate_above"}` for a few hours, then promote to `excludeIf` when the fire rate looks acceptable: ```js (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) // Shadow dry-run: fires the metric without excluding — safe to deploy first. // Check erpc_selection_shadow_exclusion_total{reason="throttle_rate_above"}. .shadowExcludeIf(all(samplesAbove(10), throttleRateAbove(0.2))) .whenEmpty(() => upstreams) .sortByScore(PREFER_FASTEST) .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: '30s' }) .probeExcluded({ sampleRate: 0.1, minSamples: 10, timeout: '10s' }) ``` **6. Reserve tier — break-glass inclusion (`includeIf`).** Keep a pricier or rate-limited provider (tagged `tags: ["tier:reserve"]`) out of normal rotation, and bring it in only when the serving pool is *collectively* unfit — every survivor lagging or slow, or none left. Because `includeIf` adds rather than evicts, a single degraded primary never pulls the reserve in by itself; once admitted, `sortByScore` ranks reserve and survivors together. Keep both halves of the pattern: `excludeTag` declares the tier is out of normal rotation, `includeIf` declares the only conditions under which it returns — reading the pair top-to-bottom documents why the reserve exists. Pair with `routing.probe: off` on the reserve upstreams to also keep shadow-probe traffic off them while benched: ```js (upstreams, ctx) => upstreams .removeCordoned() .excludeIf(all(samplesAbove(10), errorRateAbove(0.7))) // Reserve stays out of normal rotation... .excludeTag('tier:reserve') // ...and returns ONLY when survivors are collectively lagging or slow. // The `length > 0 &&` guard matters: native `every` is true on an empty // pool, so the empty case is handled by its own explicit line below. .includeIf('tier:reserve', (p) => p.length > 0 && p.every(blockSecondsLagAbove(30))) .includeIf('tier:reserve', (p) => p.length > 0 && p.every(latencyAbove(2000, 99))) .includeIf('tier:reserve', (p) => p.length < 1) .sortByScore(PREFER_FASTEST) .stickyPrimary({ hysteresis: 0.30, minSwitchInterval: '30s' }) ``` ### Request/response behavior - The ordered upstream list is set on `req` via `req.SetUpstreams(upsList)`; the failsafe layer (hedge/retry) walks through it sequentially, never selecting the same upstream twice per execution. `req.ConsumedUpstreams` and `req.ErrorsByUpstream` sync.Maps track what has been tried. [`erpc/networks.go:L1023-L1074`](https://github.com/erpc/erpc/blob/main/erpc/networks.go#L1023-L1074) - If `policyEngine == nil` or `GetOrdered` returns an empty list (engine not yet ticked), the network falls back to raw upstream registry order. [`erpc/networks.go:L1032-1038`](https://github.com/erpc/erpc/blob/main/erpc/networks.go#L1032-L1038) - `RegisterNetwork` fires a **synchronous initial tick** before returning, so in practice the first request always sees a populated cache. [`internal/policy/engine.go:RegisterNetwork:L305`](https://github.com/erpc/erpc/blob/main/internal/policy/engine.go#L305) - On any tick error (`timeout`, `throw`, `invalid_return`), the **previous cache is retained unchanged** — routing continues on the last-good order. [`internal/policy/slot.go:L250-L263`](https://github.com/erpc/erpc/blob/main/internal/policy/slot.go#L250-L263) - An OpenTelemetry span `"PolicyEngine.GetOrdered"` wraps the `GetOrdered` call in the request path. Attributes: `upstreams.count`, `upstreams.sorted` (detailed tracing only). ### Best practices - **Pair `samplesAbove(N)` with every rate predicate.** `errorRateAbove(0.7)` on an upstream with 2 samples (1 failure → 50% rate) will exclude it spuriously. `all(samplesAbove(10), errorRateAbove(0.7))` waits for a statistically meaningful window. The default policy uses `samplesAbove(10)` for error/throttle and `samplesAbove(20)` before the latency-deviation check. - **Use `blockNumberLagAbove(N)` as the primary lag guard, not `blockSecondsLagAbove`.** The seconds variant is `0` until the block-time EMA warms (requires ≥3 block pairs). On cold start, `blockNumberLagAbove(16)` fires correctly; `blockSecondsLagAbove(30)` silently no-ops. - **Keep `evalScope: "network"` unless you genuinely need per-method ranking.** Each additional scope dimension multiplies ticker goroutines and JS eval cycles. If you need method-specific `scoreMultipliers`, set `evalScope: "network-method"` — without it, method-specific entries silently no-op. - **Don't set `routing.probe: "off"` on upstreams you care about recovering.** With probing off, an excluded upstream accumulates no shadow samples and only recovers when structural signals (lag clearing, etc.) naturally drive re-admission. Probing is what makes re-admission fast. - **Let `whenEmpty(() => upstreams)` stay in every policy.** Without it, a simultaneous spike in error rates across all upstreams returns `ErrUpstreamsExhausted` rather than degraded-but-working traffic. Fail-open is almost always better than fail-closed for RPC routing. - **Use `overall` multipliers rather than custom weight shapes for simple up/down-ranking.** A `scoreMultipliers: [{overall: 0.1}]` on a backup upstream cleanly demotes it without changing the scoring formula for everyone else. - **`stickyPrimary` hysteresis of `0.30` is conservative; lower it for faster recovery.** At `0.30`, a challenger must score 30% better than the incumbent. For setups with very similar upstreams, `0.15` may be more responsive without causing flapping. ### Edge cases & gotchas 1. **`scoreMultipliers.method` silently no-ops in `evalScope: "network"`.** `WildcardMatch("eth_getLogs", "*")` returns `false`. Entries with a specific `Method` field are only applied in per-method or per-method-finality slots. [`internal/policy/eval.go:resolveScoreMultipliers:L442`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L442) 2. **`blockHeadLagSeconds` / `finalizationLagSeconds` are `0` until block-time EMA warms.** The EMA requires ≥3 `(blockNumber, blockTimestamp)` pairs with differing timestamps. Predicates like `blockSecondsLagAbove(30)` silently no-op until then. Use `blockNumberLagAbove(16)` as the primary guard. [`internal/policy/eval.go:L89-96`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L89-L96) 3. **`whenEmpty(() => upstreams)` prevents fail-closed.** If every `excludeIf` rule fires simultaneously (project-wide outage), the safety net restores the full unfiltered set. The network sends traffic to degraded upstreams rather than returning `ErrUpstreamsExhausted`. [`internal/policy/default_policy.js:L46`](https://github.com/erpc/erpc/blob/main/internal/policy/default_policy.js#L46) 4. **No time-based re-admission.** An excluded upstream stays excluded until its tracker metrics actually improve. `probeExcluded` accelerates re-admission by feeding fresh samples; arbitrary time passage does not. [`erpc/networks_selection_policy_test.go:TestNetworkPolicy_NoTimeBasedReadmit_OnlyMetricsHeal`](https://github.com/erpc/erpc/blob/main/erpc/networks_selection_policy_test.go) 5. **`routing.probe: "off"` permanently excludes from probing.** The upstream accumulates no shadow samples and stays excluded until structural signals (e.g. lag clearing) naturally drive re-admission, or an operator manually uncordons. [`common/config.go:L784-788`](https://github.com/erpc/erpc/blob/main/common/config.go#L784-L788) 6. **`stickyPrimary` with method-disagreement escape hatch.** If the previous primary is excluded in THIS method's slot but not others, the slot returns the current sort order WITHOUT updating the shared sticky register. Other slots that still have the upstream in their survivor set keep it as primary. [`internal/policy/stdlib/stdlib.js:L795-802`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L795-L802) 7. **`scoreMultipliers.finality` YAML integer vs JS bit-flag confusion.** The config field uses the Go `DataFinalityState` enum (`"finalized"=0`, `"unfinalized"=1`, `"realtime"=2`, `"unknown"=3`). The JS globals `FINALIZED=4`, `REALTIME=1`, `UNFINALIZED=2`, `UNKNOWN=8` are bitmasks for `.when(mask, fn)` inside `evalFunc` bodies — they are not valid in the YAML config field. Setting `finality: [4]` in YAML fails to match any request. [`common/data.go:L10-L27`](https://github.com/erpc/erpc/blob/main/common/data.go#L10-L27), [`internal/policy/stdlib/install.go:L74-77`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/install.go#L74-L77) 8. **`latencyDeviationAbove` is constructed fresh per tick.** The closure captures `topByMethod` at factory time. Do NOT put `latencyDeviationAbove` in a variable and reuse across ticks — it would use stale peer data. The default policy is safe because the chain is a stateless expression evaluated fresh each tick. [`internal/policy/stdlib/stdlib.js:L1228-1324`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L1228-L1324) 9. **Bug #907 regression.** With per-finality tracking enabled, block-head-lag writes must fan out to both the specific-finality bucket AND the `{*, All}` rollup. Pre-fix, `blockNumberLagAbove` read the starved rollup and silently no-oped. Fixed and covered by `TestNetworkPolicy_RealPoll_LaggingUpstreamExcluded`. [`erpc/networks_selection_policy_realpoll_test.go`](https://github.com/erpc/erpc/blob/main/erpc/networks_selection_policy_realpoll_test.go) 10. **Alphabetical tiebreak in `sortByScore`.** When two upstreams have identical scores (e.g. both clean on cold start), the lexicographically earlier ID wins. Test assertions are order-sensitive until metrics diverge. [`internal/policy/stdlib/stdlib.js:L656`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L656) 11. **`upstreamDefaults.routing` is all-or-nothing.** If `upstream.routing` is non-nil (even empty), the default routing block is NOT applied. [`erpc/eval_multipliers_test.go:TestApplyDefaults_InheritsRouting`](https://github.com/erpc/erpc/blob/main/erpc/eval_multipliers_test.go) 12. **`p70 = 0` means unknown quality, not ultra-fast.** Fast-failing upstreams that reject connections in 1ms have empty quantile sketches. A predicate like `latencyAbove(500)` sees `0 < 500` = false and does NOT exclude such an upstream. Pair latency predicates with `samplesAbove(N)` or treat `p70 == 0` as a warning signal. [`internal/policy/eval.go:L89-96`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go#L89-L96) 13. **`probeExcluded` is a no-op array transform.** It does NOT modify the upstream list. Its only effect is depositing `__probeConfig` for the Go-side Prober. Probe traffic never blocks the request path; channel overflow increments `erpc_selection_probe_dropped_total` and drops the probe silently. The probe feed channel size is 256 (`probeFeedBufferSize = 256` in [`internal/policy/prober.go:L102`](https://github.com/erpc/erpc/blob/main/internal/policy/prober.go#L102)). [`internal/policy/stdlib/stdlib.js:L999-1010`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js#L999-L1010) 14. **Step-log is disabled by default in production.** Per-step chain trail logging (`DEBUG "policy step"`) is only active when `Engine.SetStepLogEnabled(true)` is called — by the simulator or when log level is DEBUG. In production there is zero overhead beyond one function-call indirection per stdlib step; no log entries and no per-step allocations are created. [`internal/policy/engine.go:L679-695`](https://github.com/erpc/erpc/blob/main/internal/policy/engine.go#L679-L695) 15. **`ErrUpstreamExcludedByPolicy` error code exists but has no production call site.** Defined at [`common/errors.go:L1823-L1836`](https://github.com/erpc/erpc/blob/main/common/errors.go#L1823-L1836); error code string: `"ErrUpstreamExcludedByPolicy"`. The health tracker silently ignores it (does NOT penalize the upstream's error rate). The policy engine achieves exclusion structurally — upstreams absent from the ordered list are never attempted by `NextUpstream` — rather than by returning this error. The error type is forward-compatible scaffolding. 16. **Idle slot eviction sweep runs every 1 minute.** The engine's `sweepIdleSlots` fires every `engineSweepInterval = 1 minute`. A narrow slot with `idleEvictionAfter = 1 h` may persist for up to 1 h + 1 min after its last access. The health tracker's own idle eviction threshold is separate and shorter: `DefaultIdleEvictionAfter = 30 minutes` for per-`(upstream, method, finality)` metric entries. [`internal/policy/engine.go:engineSweepInterval:L208`](https://github.com/erpc/erpc/blob/main/internal/policy/engine.go#L208) ### Observability | Metric | Type | Labels | When it fires | |---|---|---|---| | `erpc_selection_position` | Gauge | `project, network, method, upstream` | Every tick; `0`=primary, `1+`=runner-up, `-1`=excluded. | | `erpc_selection_score` | Gauge | `project, network, method, upstream` | Every tick; value is `u.score` from `sortByScore`. | | `erpc_selection_eligible_upstreams` | Gauge | `project, network, method` | Every successful tick; count of in-rotation upstreams. | | `erpc_selection_eval_duration_seconds` | Histogram | `project, network, method` | Every tick; JS eval wall-clock time. | | `erpc_selection_eval_errors_total` | Counter | `project, network, method, kind` | Tick failure; `kind` ∈ `{timeout, throw, invalid_return}`. | | `erpc_selection_rejection_total` | Counter | `project, network, method, upstream, step` | Per tick × excluded upstream; `step` = first stdlib primitive that dropped the upstream. | | `erpc_selection_exclusion_total` | Counter | `project, network, method, upstream, reason` | Per tick × excluded upstream × leaf predicate slug. | | `erpc_selection_shadow_exclusion_total` | Counter | `project, network, method, upstream, reason` | `shadowExcludeIf` would-have-excluded; upstream stays in rotation. | | `erpc_selection_excluded_seconds` | Gauge | `project, network, method, upstream` | Seconds continuously excluded; reset to `0` on readmit. | | `erpc_selection_readmit_total` | Counter | `project, network, method, upstream` | Excluded → in-rotation transition. | | `erpc_selection_readmit_age_seconds` | Histogram | `project, network, method` | Distribution of exclusion age at readmit. | | `erpc_selection_primary_switch_total` | Counter | `project, network, method, from, to` | Primary upstream ID changes. | | `erpc_selection_sticky_hold_total` | Counter | `project, network, method, upstream` | Ticks where `stickyPrimary` held the primary against a challenger. | | `erpc_selection_probe_requests_total` | Counter | `network, upstream, method` | Shadow-probe request fired to excluded upstream. | | `erpc_selection_probe_errors_total` | Counter | `network, upstream, method, reason` | Probe errored; `reason` ∈ `{timeout, throttled, auth, skipped, error}`. | | `erpc_selection_probe_skipped_total` | Counter | `network, reason` | Probe skipped; `reason` ∈ `{write_method, opt_out, sampled_out, max_concurrent, no_method}`. | | `erpc_selection_probe_dropped_total` | Counter | `network, reason` | Probe publish dropped (feed channel full); request path never blocks. | **`erpc_selection_position` value semantics** (assigned at [`internal/policy/slot.go:L477`](https://github.com/erpc/erpc/blob/main/internal/policy/slot.go#L477)): - `0` — primary upstream (first element of the JS result array); serves the next request when no failover occurs. - `1` — first runner-up; tried if the primary fails or is consumed by a concurrent hedge. - `2`, `3`, … — third, fourth choices in ranked order. - `-1` — excluded from rotation; all excluded upstreams share `-1` regardless of how many or which predicate dropped them. An upstream can have `position=0` in the wildcard slot and `position=1` in a per-method slot simultaneously (positions are per-slot). **`erpc_selection_eval_errors_total` `kind` values** (classified at [`internal/policy/slot.go:L460-L467`](https://github.com/erpc/erpc/blob/main/internal/policy/slot.go#L460-L467)): - `"timeout"` — JS eval exceeded `evalTimeout` (default 100 ms); previous cache retained. - `"throw"` — uncaught JS exception (runtime error, user-thrown error). - `"invalid_return"` — eval completed but returned a non-array, null/undefined, entries without `id`, or unknown upstream IDs. Note: `"fallback_default"` appears in the metric Help string and in simulator types as a comment, but **no production code path currently emits this value**. Treat it as forward-compatible documentation only. **`erpc_selection_shadow_exclusion_total` vs `erpc_selection_exclusion_total` semantics:** - `erpc_selection_exclusion_total` — upstream **was removed** from the ordered list; `erpc_selection_position` becomes `-1`. - `erpc_selection_shadow_exclusion_total` — upstream **was NOT removed** and remains in rotation; `shadowExcludeIf` predicate fired for monitoring only. Same leaf-slug attribution as `excludeIf`. Key dashboarding tip: `erpc_selection_position{upstream="X"} == -1` sustained over time means `X` is stuck excluded — check `erpc_selection_exclusion_total{upstream="X"}` for the leaf reason slug, then `erpc_selection_probe_requests_total{upstream="X"}` to confirm probing is running. ### Source code entry points - [`internal/policy/engine.go`](https://github.com/erpc/erpc/blob/main/internal/policy/engine.go) — `Engine` struct; `RegisterNetwork`, `GetOrdered`, `GetExcluded`, `PublishRequest`; idle slot eviction; probe config reconciliation. - [`internal/policy/slot.go`](https://github.com/erpc/erpc/blob/main/internal/policy/slot.go) — `tickOnce` full eval cycle; `snapshotMetrics`; `materializeOrder`; atomic cache swap at [L308-311](https://github.com/erpc/erpc/blob/main/internal/policy/slot.go#L308-L311). - [`internal/policy/eval.go`](https://github.com/erpc/erpc/blob/main/internal/policy/eval.go) — `runEval`; `buildJSUpstreams`; `readUpstreamMetrics`; `resolveScoreMultipliers`; `extractOrderedResult`. - [`internal/policy/default_policy.js`](https://github.com/erpc/erpc/blob/main/internal/policy/default_policy.js) — embedded default policy chain; upgraded from placeholder at register time. - [`internal/policy/stdlib/stdlib.js`](https://github.com/erpc/erpc/blob/main/internal/policy/stdlib/stdlib.js) — all chainable `Array.prototype` methods; predicate factories; `sortByScore`; `stickyPrimary`; `probeExcluded`; `latencyDeviationAbove` algorithm. - [`internal/policy/prober.go`](https://github.com/erpc/erpc/blob/main/internal/policy/prober.go) — `Prober`; shadow-mirror request dispatch; `windowCounter` for `minSamples` floor; per-upstream in-flight cap. - [`erpc/networks.go:L1023-L1074`](https://github.com/erpc/erpc/blob/main/erpc/networks.go#L1023-L1074) — request path: `GetOrdered` call, raw-registry fallback, `PublishRequest` to probe bus. - [`common/defaults.go:L2512-2609`](https://github.com/erpc/erpc/blob/main/common/defaults.go#L2512-L2609) — `SelectionPolicyConfig.SetDefaults`; all field defaults. - [`telemetry/metrics.go:L174-291`](https://github.com/erpc/erpc/blob/main/telemetry/metrics.go#L174-L291) — all `MetricSelection*` Prometheus vars. - [`erpc/networks_selection_policy_test.go`](https://github.com/erpc/erpc/blob/main/erpc/networks_selection_policy_test.go) — integration tests: slow/erroring/lagging/throttled exclusion, fallback tier, safety-net, sticky primary, no-time-based readmit. ### Related pages - [Hedge](/config/failsafe/hedge.llms.txt) — uses the ordered upstream list; each hedge leg gets the next upstream in selection order. - [Retry](/config/failsafe/retry.llms.txt) — walks the same ordered list sequentially on each retry attempt. - [Circuit breaker](/config/failsafe/circuit-breaker.llms.txt) — complementary per-upstream state machine checked after selection dispatches to an upstream; open breakers fail fast and the sweep advances down the ranked list. - [Rate limiters](/config/rate-limiters.llms.txt) — caps traffic per upstream independent of selection scoring. - [Upstream config](/config/projects/upstreams.llms.txt) — where `routing.scoreMultipliers` and `routing.probe` are set. - [Survive provider outages](/use-cases/survive-provider-outages.llms.txt) — the outcome this feature primarily enables. --- ## Navigation (machine-readable surface) - Up: [Projects](https://docs.erpc.cloud/config/projects.llms.txt) - Root index of every page: [llms.txt](https://docs.erpc.cloud/llms.txt) · everything in one file: [llms-full.txt](https://docs.erpc.cloud/llms-full.txt) ### Sibling pages - [CORS](https://docs.erpc.cloud/config/projects/cors.llms.txt) — Let your frontend talk to eRPC safely — configure which browser origins are allowed, in seconds, without blocking a single server-to-server call. - [Networks](https://docs.erpc.cloud/config/projects/networks.llms.txt) — One entry per chain — eRPC routes every request to the right upstreams, caches results, and retries failures, all without touching your code. - [Providers & vendors](https://docs.erpc.cloud/config/projects/providers.llms.txt) — One API key, every chain — declare a single provider entry and eRPC auto-generates upstreams for each network on first request, with 22 built-in vendor integrations. - [Shadow upstreams](https://docs.erpc.cloud/config/projects/shadow-upstreams.llms.txt) — Dark-launch a new RPC provider by mirroring live traffic to it in the background — zero latency impact, automatic response comparison, and Prometheus counters to prove it's ready. - [Static responses](https://docs.erpc.cloud/config/projects/static-responses.llms.txt) — Return hardcoded JSON-RPC replies instantly for specific method+params pairs — no upstream contact, zero quota consumed, microsecond latency. - [Upstreams](https://docs.erpc.cloud/config/projects/upstreams.llms.txt) — Add any RPC endpoint — Alchemy, a self-hosted node, a gRPC feed — and eRPC figures out what it can serve, heals it when it breaks, and routes around it when it can't.