Config
Failsafe

Failsafe

Failsafe policies help with intermittent issues and increase resiliency. They can be configured at both Network and Upstream levels, with support for per-method configuration.

Available policies

  • timeout: prevents requests from hanging indefinitely
  • retry: recovers from transient failures
  • hedge: runs parallel requests when upstreams are slow
  • circuitBreaker: temporarily removes failing upstreams
  • consensus: verifies multiple upstreams agree on results
  • Integrity increases data quality for specific methods

Per-method configuration

Failsafe policies can optionally be configured per-method using matchMethod and matchFinality fields. This allows fine-tuned behavior for different RPC methods and different block finality states.

  • matchMethod: Pattern to match RPC methods (a matcher supports wildcards * and OR operator |)
  • matchFinality: Array of finality states to match

When multiple failsafe configs are defined, they are evaluated in order and the first matching config is used.

Finality States

The matchFinality field can match against these data finality states:

  • finalized: Data from blocks that are confirmed as finalized and safe from reorgs. This is determined by comparing the block number with the upstream's finalized block.

    • Example methods: eth_getBlockByNumber (for old blocks), eth_getLogs (for finalized ranges)
    • Use case: Can have relaxed failsafe policies since data won't change
  • unfinalized: Data from recent blocks that could still be reorganized. Also includes any data from pending blocks.

    • Example methods: eth_getBlockByNumber("latest"), eth_call with recent blocks
    • Use case: May need more aggressive retries and shorter timeouts
  • realtime: Data that changes frequently, typically with every new block.

    • Example methods: eth_blockNumber, eth_gasPrice, eth_maxPriorityFeePerGas, net_peerCount
    • Use case: Often needs fast timeouts and may benefit from hedging
  • unknown: When the block number cannot be determined from the request/response.

    • Example methods: eth_getTransactionByHash, trace_transaction, debug_traceTransaction
    • Use case: Data is typically immutable once included, but block context is unknown
erpc.yaml
projects:
  - id: main
    upstreams:
      - id: my-upstream
        failsafe:
          # Default policy for all methods
          - matchMethod: "*"        # matches any method (default if omitted)
            timeout:
              duration: 30s
            retry:
              maxAttempts: 3
          
          # Fast timeout for simple queries
          - matchMethod: "eth_getBlock*|eth_getTransaction*"
            timeout:
              duration: 5s
            retry:
              maxAttempts: 2
              delay: 100ms
          
          # Longer timeout for heavy trace methods
          - matchMethod: "trace_*|debug_*"
            timeout:
              duration: 60s
            retry:
              maxAttempts: 1    # expensive operations, minimize retries
          
          # Different policy for finalized vs unfinalized data
          - matchMethod: "eth_call|eth_estimateGas"
            matchFinality: ["unfinalized", "realtime"]
            timeout:
              duration: 10s
            retry:
              maxAttempts: 5    # unfinalized data changes frequently, retry more

timeout policy

Sets a timeout for requests. Network-level timeout applies to the entire request lifecycle (including retries), while upstream-level timeout applies to each individual attempt.

Timeout supports two modes: fixed (static duration) and dynamic (quantile-based, adapts to real latency).

Fixed timeout

The simplest configuration — a static duration that applies to all requests matching the policy.

erpc.yaml
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 42161
        failsafe:
          - matchMethod: "*"
            timeout:
              duration: 30s        # Total time including all retries
    
    upstreams:
      - id: blastapi-chain-42161
        failsafe:
          - matchMethod: "*"
            timeout:
              duration: 15s        # Per-attempt timeout

Dynamic quantile-based timeout

Quantile-based timeout (recommended) computes the timeout from real latency percentiles per method, so it automatically adapts to your traffic. Works similarly to quantile-based hedging.

When quantile is set, the timeout is computed dynamically from the DDSketch latency distribution for each RPC method. For example, quantile: 0.99 means "set the timeout at the p99 of observed latencies" — only the slowest 1% of requests would be timed out.

FieldTypeDescription
durationDurationCold-start fallback used until enough latency data is collected. Also serves as the fallback when maxDuration is not set.
quantilefloatPercentile of latency distribution to use as timeout (e.g., 0.99 for p99). Must be between 0 and 1.
minDurationDuration(Optional) Floor for the computed timeout. Prevents false timeouts when latencies are very low.
maxDurationDuration(Optional) Ceiling for the computed timeout. Can be used as the cold-start fallback when duration is omitted.

For most use cases, just duration + quantile is sufficient. Use quantile: 0.99 to timeout only truly stuck requests while letting the system self-tune. The minDuration and maxDuration fields are optional guard rails.

erpc.yaml
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          # Recommended: pure dynamic timeout with p99
          - matchMethod: "eth_call|eth_getLogs"
            timeout:
              duration: 30s        # Cold-start fallback
              quantile: 0.99       # Timeout at p99 of observed latencies
          
          # With optional guard rails
          - matchMethod: "*"
            timeout:
              duration: 60s        # Cold-start fallback
              quantile: 0.99
              minDuration: 200ms   # Never timeout faster than this
              maxDuration: 60s     # Never wait longer than this

Monitor the computed timeout values via Prometheus metric:

  • erpc_network_timeout_duration_seconds — histogram of dynamically computed timeout durations per method

retry policy

Automatically retries failed requests with configurable backoff strategies.

Retryable Errors

  • 5xx server errors (intermittent issues)
  • 408 request timeout
  • 429 rate limit exceeded
  • Empty responses for certain methods (e.g., eth_getLogs when node is lagging)

Non-Retryable Errors

  • 4xx client errors (invalid requests)
  • Unsupported method errors
erpc.yaml
projects:
  - id: main
    upstreams:
      - id: my-upstream
        failsafe:
          - matchMethod: "*"
            retry:
              maxAttempts: 3        # Total attempts (initial + 2 retries)
              delay: 1000ms         # Initial delay between retries
              backoffMaxDelay: 10s  # Maximum delay after backoff
              backoffFactor: 0.3    # Exponential backoff multiplier
              jitter: 500ms         # Random jitter (0-500ms) to prevent thundering herd

Empty responses

For comprehensive documentation on empty results, missing data errors, and block unavailability — including all config fields and production guidelines — see Empty or missing data handling.

Retry feature is useful to handle empty responses when a node is lagging behind or for some other reason returns an unexpected empty response.

  • What counts as empty: Results like null, [], "", {}, 0x, "0x", hex strings that are all zeros (e.g., 0x000...0), and method-specific empties (e.g., empty logs). Internally we detect these directly from response bytes.
  • Where retries apply: Only at the network level when the request has retryEmpty enabled (via directive defaults or request headers/params). Upstream-level retry does not retry on empties.
  • Default ignore list (retry.emptyResultAccept): Methods to NEVER retry when the response is empty (e.g., eth_getLogs, eth_call). Configure to override defaults.
  • Block availability check: For EVM, when empty and the upstream is not syncing, we try to extract the block number and check upstream availability. If the upstream can serve that block but still returned empty, we do not retry.
  • Availability confidence (retry.emptyResultConfidence):
    • finalizedBlock: If the target block is finalized (at or below finalized), empty responses are treated as valid (no retry). If the target block is after finalized, we will retry.
    • blockHead: If the target block is at or below the node's latest head, empty responses are treated as valid (no retry). If the target block is ahead of the head, we will retry.
  • Syncing nodes: If an upstream is syncing and returns empty, it is treated unfavorably and skipped for the remainder of the request.
  • Per-request de-dup: Upstreams that returned empty for a request are skipped on subsequent rotations for that same request.
  • Cap empty retries: retry.emptyResultMaxAttempts caps total attempts specifically for empty-result retries (default equals retry.maxAttempts).
  • Non-empty wins: If any non-empty response was seen, it is preserved and can be returned even if later attempts fail. In consensus, when configured, non-empty results are preferred.
  • Writes are never retried: Write methods (e.g., eth_send*) are not retried.
erpc.yaml
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 1
        directiveDefaults:
          retryEmpty: true # enable empty-result retries at network level
        failsafe:
          - matchMethod: "*"
            retry:
              maxAttempts: 4 # total attempts (initial + retries)
              emptyResultAccept: ["eth_getLogs", "eth_call"] # Never retry these methods when result is empty
              emptyResultConfidence: finalizedBlock # treat finalized empties as valid
              emptyResultMaxAttempts: 2 # cap attempts for empty-result retries only

hedge policy

Starts parallel requests when an upstream is slow to respond. Highly recommended at network level for optimal performance.

Quantile-based hedging (recommended) uses response time statistics to determine optimal hedge timing, while fixed-delay hedging uses a static delay.

erpc.yaml
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          - matchMethod: "*"
            hedge:
              # Quantile-based (recommended): hedge after p99 response time
              quantile: 0.99
              minDelay: 100ms     # Minimum wait before hedging
              maxDelay: 2s        # Maximum wait before hedging
              maxCount: 1         # Max parallel hedged requests
              
              # Alternative: Fixed-delay hedging
              # delay: 500ms
              # maxCount: 1

Monitor effectiveness via Prometheus metrics:

  • erpc_network_hedged_request_total - total hedged requests
  • erpc_network_hedge_discards_total - wasted hedges (original responded first)

circuitBreaker policy

Temporarily removes consistently failing upstreams to allow recovery time.

Circuit breaker states:

  • Closed: Normal operation, upstream is healthy
  • Open: Upstream is failing, temporarily removed from rotation
  • Half-open: Testing if upstream has recovered with limited traffic
erpc.yaml
projects:
  - id: main
    upstreams:
      - id: my-upstream
        failsafe:
          - matchMethod: "*"
            circuitBreaker:
              # Open circuit when 80% (160/200) of recent requests fail
              failureThresholdCount: 160
              failureThresholdCapacity: 200
              halfOpenAfter: 60s          # Try recovery after 1 minute
              # Close circuit when 80% (8/10) succeed in half-open state
              successThresholdCount: 8
              successThresholdCapacity: 10

Real-World Examples

High-Performance DeFi Configuration

erpc.yaml
projects:
  - id: defi-prod
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          # Aggressive hedging for all methods
          - matchMethod: "*"
            hedge:
              quantile: 0.9      # p90 latency
              minDelay: 50ms
              maxCount: 2        # Up to 2 parallel hedges
            timeout:
              duration: 10s
    
    upstreams:
      - id: primary-node
        failsafe:
          # Price feeds need fast response
          - matchMethod: "eth_call"
            matchFinality: ["latest"]
            timeout:
              duration: 1s
            retry:
              maxAttempts: 1     # No time for retries
          
          # Block data can be slower but must succeed
          - matchMethod: "eth_getBlock*"
            timeout:
              duration: 5s
            retry:
              maxAttempts: 5
              delay: 100ms

Finality-Based Configuration

erpc.yaml
failsafe:
  # Finalized data: relaxed policies
  - matchMethod: "*"
    matchFinality: ["finalized"]
    timeout:
      duration: 30s
    retry:
      maxAttempts: 5
      backoffFactor: 2
  
  # Unfinalized data: aggressive timeouts
  - matchMethod: "*"
    matchFinality: ["unfinalized"]
    timeout:
      duration: 5s
    retry:
      maxAttempts: 2
      delay: 100ms
  
  # Realtime data: fast with hedging
  - matchMethod: "*"
    matchFinality: ["realtime"]
    timeout:
      duration: 2s
    hedge:
      delay: 500ms
      maxCount: 1
  
  # Unknown finality: moderate settings
  - matchMethod: "*"
    matchFinality: ["unknown"]
    timeout:
      duration: 15s
    retry:
      maxAttempts: 3

Indexer Configuration

erpc.yaml
projects:
  - id: indexer
    upstreams:
      - id: archive-node
        failsafe:
          # Bulk log queries need long timeouts
          - matchMethod: "eth_getLogs"
            timeout:
              duration: 120s
            retry:
              maxAttempts: 3
              backoffFactor: 2
          
          # Trace methods are expensive but critical
          - matchMethod: "trace_*|arbtrace_*"
            timeout:
              duration: 180s
            retry:
              maxAttempts: 2
            circuitBreaker:
              failureThresholdCount: 10    # More tolerant for slow methods
              failureThresholdCapacity: 20
              halfOpenAfter: 5m

Disabling Policies

To disable any policy, set it to null or ~ (YAML):

erpc.yaml
failsafe:
  - matchMethod: "*"
    hedge: ~              # Disable hedging
    circuitBreaker: ~     # Disable circuit breaker