Failsafe
Failsafe policies help with intermittent issues and increase resiliency. They can be configured at both Network and Upstream levels, with support for per-method configuration.
Available policies
timeout:
prevents requests from hanging indefinitelyretry:
recovers from transient failureshedge:
runs parallel requests when upstreams are slowcircuitBreaker:
temporarily removes failing upstreamsconsensus:
verifies multiple upstreams agree on results- Integrity increases data quality for specific methods
Per-method configuration
Failsafe policies can optionally be configured per-method using matchMethod
and matchFinality
fields. This allows fine-tuned behavior for different RPC methods and different block finality states.
matchMethod
: Pattern to match RPC methods (a matcher supports wildcards*
and OR operator|
)matchFinality
: Array of finality states to match
When multiple failsafe configs are defined, they are evaluated in order and the first matching config is used.
Finality States
The matchFinality
field can match against these data finality states:
-
finalized
: Data from blocks that are confirmed as finalized and safe from reorgs. This is determined by comparing the block number with the upstream's finalized block.- Example methods:
eth_getBlockByNumber
(for old blocks),eth_getLogs
(for finalized ranges) - Use case: Can have relaxed failsafe policies since data won't change
- Example methods:
-
unfinalized
: Data from recent blocks that could still be reorganized. Also includes any data from pending blocks.- Example methods:
eth_getBlockByNumber("latest")
,eth_call
with recent blocks - Use case: May need more aggressive retries and shorter timeouts
- Example methods:
-
realtime
: Data that changes frequently, typically with every new block.- Example methods:
eth_blockNumber
,eth_gasPrice
,eth_maxPriorityFeePerGas
,net_peerCount
- Use case: Often needs fast timeouts and may benefit from hedging
- Example methods:
-
unknown
: When the block number cannot be determined from the request/response.- Example methods:
eth_getTransactionByHash
,trace_transaction
,debug_traceTransaction
- Use case: Data is typically immutable once included, but block context is unknown
- Example methods:
projects:
- id: main
upstreams:
- id: my-upstream
failsafe:
# Default policy for all methods
- matchMethod: "*" # matches any method (default if omitted)
timeout:
duration: 30s
retry:
maxAttempts: 3
# Fast timeout for simple queries
- matchMethod: "eth_getBlock*|eth_getTransaction*"
timeout:
duration: 5s
retry:
maxAttempts: 2
delay: 100ms
# Longer timeout for heavy trace methods
- matchMethod: "trace_*|debug_*"
timeout:
duration: 60s
retry:
maxAttempts: 1 # expensive operations, minimize retries
# Different policy for finalized vs unfinalized data
- matchMethod: "eth_call|eth_estimateGas"
matchFinality: ["unfinalized", "realtime"]
timeout:
duration: 10s
retry:
maxAttempts: 5 # unfinalized data changes frequently, retry more
timeout
policy
Sets a timeout for requests. Network-level timeout applies to the entire request lifecycle (including retries), while upstream-level timeout applies to each individual attempt.
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 42161
failsafe:
- matchMethod: "*"
timeout:
duration: 30s # Total time including all retries
upstreams:
- id: blastapi-chain-42161
failsafe:
- matchMethod: "*"
timeout:
duration: 15s # Per-attempt timeout
retry
policy
Automatically retries failed requests with configurable backoff strategies.
Retryable Errors
5xx
server errors (intermittent issues)408
request timeout429
rate limit exceeded- Empty responses for certain methods (e.g.,
eth_getLogs
when node is lagging)
Non-Retryable Errors
4xx
client errors (invalid requests)- Unsupported method errors
projects:
- id: main
upstreams:
- id: my-upstream
failsafe:
- matchMethod: "*"
retry:
maxAttempts: 3 # Total attempts (initial + 2 retries)
delay: 1000ms # Initial delay between retries
backoffMaxDelay: 10s # Maximum delay after backoff
backoffFactor: 0.3 # Exponential backoff multiplier
jitter: 500ms # Random jitter (0-500ms) to prevent thundering herd
hedge
policy
The user wants me to focus on consolidating examples and using fewer code blocks. I should merge the quantile-based and fixed-delay examples into one with comments.
Starts parallel requests when an upstream is slow to respond. Highly recommended at network level for optimal performance.
Quantile-based hedging (recommended) uses response time statistics to determine optimal hedge timing, while fixed-delay hedging uses a static delay.
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
- matchMethod: "*"
hedge:
# Quantile-based (recommended): hedge after p99 response time
quantile: 0.99
minDelay: 100ms # Minimum wait before hedging
maxDelay: 2s # Maximum wait before hedging
maxCount: 1 # Max parallel hedged requests
# Alternative: Fixed-delay hedging
# delay: 500ms
# maxCount: 1
Monitor effectiveness via Prometheus metrics:
erpc_network_hedged_request_total
- total hedged requestserpc_network_hedge_discards_total
- wasted hedges (original responded first)
circuitBreaker
policy
Temporarily removes consistently failing upstreams to allow recovery time.
Circuit breaker states:
- Closed: Normal operation, upstream is healthy
- Open: Upstream is failing, temporarily removed from rotation
- Half-open: Testing if upstream has recovered with limited traffic
projects:
- id: main
upstreams:
- id: my-upstream
failsafe:
- matchMethod: "*"
circuitBreaker:
# Open circuit when 80% (160/200) of recent requests fail
failureThresholdCount: 160
failureThresholdCapacity: 200
halfOpenAfter: 60s # Try recovery after 1 minute
# Close circuit when 80% (8/10) succeed in half-open state
successThresholdCount: 8
successThresholdCapacity: 10
Real-World Examples
High-Performance DeFi Configuration
projects:
- id: defi-prod
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
# Aggressive hedging for all methods
- matchMethod: "*"
hedge:
quantile: 0.9 # p90 latency
minDelay: 50ms
maxCount: 2 # Up to 2 parallel hedges
timeout:
duration: 10s
upstreams:
- id: primary-node
failsafe:
# Price feeds need fast response
- matchMethod: "eth_call"
matchFinality: ["latest"]
timeout:
duration: 1s
retry:
maxAttempts: 1 # No time for retries
# Block data can be slower but must succeed
- matchMethod: "eth_getBlock*"
timeout:
duration: 5s
retry:
maxAttempts: 5
delay: 100ms
Finality-Based Configuration
failsafe:
# Finalized data: relaxed policies
- matchMethod: "*"
matchFinality: ["finalized"]
timeout:
duration: 30s
retry:
maxAttempts: 5
backoffFactor: 2
# Unfinalized data: aggressive timeouts
- matchMethod: "*"
matchFinality: ["unfinalized"]
timeout:
duration: 5s
retry:
maxAttempts: 2
delay: 100ms
# Realtime data: fast with hedging
- matchMethod: "*"
matchFinality: ["realtime"]
timeout:
duration: 2s
hedge:
delay: 500ms
maxCount: 1
# Unknown finality: moderate settings
- matchMethod: "*"
matchFinality: ["unknown"]
timeout:
duration: 15s
retry:
maxAttempts: 3
Indexer Configuration
projects:
- id: indexer
upstreams:
- id: archive-node
failsafe:
# Bulk log queries need long timeouts
- matchMethod: "eth_getLogs"
timeout:
duration: 120s
retry:
maxAttempts: 3
backoffFactor: 2
# Trace methods are expensive but critical
- matchMethod: "trace_*|arbtrace_*"
timeout:
duration: 180s
retry:
maxAttempts: 2
circuitBreaker:
failureThresholdCount: 10 # More tolerant for slow methods
failureThresholdCapacity: 20
halfOpenAfter: 5m
Disabling Policies
To disable any policy, set it to null
or ~
(YAML):
failsafe:
- matchMethod: "*"
hedge: ~ # Disable hedging
circuitBreaker: ~ # Disable circuit breaker