Config
Failsafe

Failsafe

Failsafe policies help with intermittent issues and increase resiliency. They can be configured at both Network and Upstream levels, with support for per-method configuration.

Available policies

  • timeout: prevents requests from hanging indefinitely
  • retry: recovers from transient failures
  • hedge: runs parallel requests when upstreams are slow
  • circuitBreaker: temporarily removes failing upstreams
  • consensus: verifies multiple upstreams agree on results
  • Integrity increases data quality for specific methods

Per-method configuration

Failsafe policies can optionally be configured per-method using matchMethod and matchFinality fields. This allows fine-tuned behavior for different RPC methods and different block finality states.

  • matchMethod: Pattern to match RPC methods (a matcher supports wildcards * and OR operator |)
  • matchFinality: Array of finality states to match

When multiple failsafe configs are defined, they are evaluated in order and the first matching config is used.

Finality States

The matchFinality field can match against these data finality states:

  • finalized: Data from blocks that are confirmed as finalized and safe from reorgs. This is determined by comparing the block number with the upstream's finalized block.

    • Example methods: eth_getBlockByNumber (for old blocks), eth_getLogs (for finalized ranges)
    • Use case: Can have relaxed failsafe policies since data won't change
  • unfinalized: Data from recent blocks that could still be reorganized. Also includes any data from pending blocks.

    • Example methods: eth_getBlockByNumber("latest"), eth_call with recent blocks
    • Use case: May need more aggressive retries and shorter timeouts
  • realtime: Data that changes frequently, typically with every new block.

    • Example methods: eth_blockNumber, eth_gasPrice, eth_maxPriorityFeePerGas, net_peerCount
    • Use case: Often needs fast timeouts and may benefit from hedging
  • unknown: When the block number cannot be determined from the request/response.

    • Example methods: eth_getTransactionByHash, trace_transaction, debug_traceTransaction
    • Use case: Data is typically immutable once included, but block context is unknown
erpc.yaml
projects:
  - id: main
    upstreams:
      - id: my-upstream
        failsafe:
          # Default policy for all methods
          - matchMethod: "*"        # matches any method (default if omitted)
            timeout:
              duration: 30s
            retry:
              maxAttempts: 3
          
          # Fast timeout for simple queries
          - matchMethod: "eth_getBlock*|eth_getTransaction*"
            timeout:
              duration: 5s
            retry:
              maxAttempts: 2
              delay: 100ms
          
          # Longer timeout for heavy trace methods
          - matchMethod: "trace_*|debug_*"
            timeout:
              duration: 60s
            retry:
              maxAttempts: 1    # expensive operations, minimize retries
          
          # Different policy for finalized vs unfinalized data
          - matchMethod: "eth_call|eth_estimateGas"
            matchFinality: ["unfinalized", "realtime"]
            timeout:
              duration: 10s
            retry:
              maxAttempts: 5    # unfinalized data changes frequently, retry more

timeout policy

Sets a timeout for requests. Network-level timeout applies to the entire request lifecycle (including retries), while upstream-level timeout applies to each individual attempt.

erpc.yaml
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 42161
        failsafe:
          - matchMethod: "*"
            timeout:
              duration: 30s        # Total time including all retries
    
    upstreams:
      - id: blastapi-chain-42161
        failsafe:
          - matchMethod: "*"
            timeout:
              duration: 15s        # Per-attempt timeout

retry policy

Automatically retries failed requests with configurable backoff strategies.

Retryable Errors

  • 5xx server errors (intermittent issues)
  • 408 request timeout
  • 429 rate limit exceeded
  • Empty responses for certain methods (e.g., eth_getLogs when node is lagging)

Non-Retryable Errors

  • 4xx client errors (invalid requests)
  • Unsupported method errors
erpc.yaml
projects:
  - id: main
    upstreams:
      - id: my-upstream
        failsafe:
          - matchMethod: "*"
            retry:
              maxAttempts: 3        # Total attempts (initial + 2 retries)
              delay: 1000ms         # Initial delay between retries
              backoffMaxDelay: 10s  # Maximum delay after backoff
              backoffFactor: 0.3    # Exponential backoff multiplier
              jitter: 500ms         # Random jitter (0-500ms) to prevent thundering herd

hedge policy

The user wants me to focus on consolidating examples and using fewer code blocks. I should merge the quantile-based and fixed-delay examples into one with comments.

Starts parallel requests when an upstream is slow to respond. Highly recommended at network level for optimal performance.

Quantile-based hedging (recommended) uses response time statistics to determine optimal hedge timing, while fixed-delay hedging uses a static delay.

erpc.yaml
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          - matchMethod: "*"
            hedge:
              # Quantile-based (recommended): hedge after p99 response time
              quantile: 0.99
              minDelay: 100ms     # Minimum wait before hedging
              maxDelay: 2s        # Maximum wait before hedging
              maxCount: 1         # Max parallel hedged requests
              
              # Alternative: Fixed-delay hedging
              # delay: 500ms
              # maxCount: 1

Monitor effectiveness via Prometheus metrics:

  • erpc_network_hedged_request_total - total hedged requests
  • erpc_network_hedge_discards_total - wasted hedges (original responded first)

circuitBreaker policy

Temporarily removes consistently failing upstreams to allow recovery time.

Circuit breaker states:

  • Closed: Normal operation, upstream is healthy
  • Open: Upstream is failing, temporarily removed from rotation
  • Half-open: Testing if upstream has recovered with limited traffic
erpc.yaml
projects:
  - id: main
    upstreams:
      - id: my-upstream
        failsafe:
          - matchMethod: "*"
            circuitBreaker:
              # Open circuit when 80% (160/200) of recent requests fail
              failureThresholdCount: 160
              failureThresholdCapacity: 200
              halfOpenAfter: 60s          # Try recovery after 1 minute
              # Close circuit when 80% (8/10) succeed in half-open state
              successThresholdCount: 8
              successThresholdCapacity: 10

Real-World Examples

High-Performance DeFi Configuration

erpc.yaml
projects:
  - id: defi-prod
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          # Aggressive hedging for all methods
          - matchMethod: "*"
            hedge:
              quantile: 0.9      # p90 latency
              minDelay: 50ms
              maxCount: 2        # Up to 2 parallel hedges
            timeout:
              duration: 10s
    
    upstreams:
      - id: primary-node
        failsafe:
          # Price feeds need fast response
          - matchMethod: "eth_call"
            matchFinality: ["latest"]
            timeout:
              duration: 1s
            retry:
              maxAttempts: 1     # No time for retries
          
          # Block data can be slower but must succeed
          - matchMethod: "eth_getBlock*"
            timeout:
              duration: 5s
            retry:
              maxAttempts: 5
              delay: 100ms

Finality-Based Configuration

erpc.yaml
failsafe:
  # Finalized data: relaxed policies
  - matchMethod: "*"
    matchFinality: ["finalized"]
    timeout:
      duration: 30s
    retry:
      maxAttempts: 5
      backoffFactor: 2
  
  # Unfinalized data: aggressive timeouts
  - matchMethod: "*"
    matchFinality: ["unfinalized"]
    timeout:
      duration: 5s
    retry:
      maxAttempts: 2
      delay: 100ms
  
  # Realtime data: fast with hedging
  - matchMethod: "*"
    matchFinality: ["realtime"]
    timeout:
      duration: 2s
    hedge:
      delay: 500ms
      maxCount: 1
  
  # Unknown finality: moderate settings
  - matchMethod: "*"
    matchFinality: ["unknown"]
    timeout:
      duration: 15s
    retry:
      maxAttempts: 3

Indexer Configuration

erpc.yaml
projects:
  - id: indexer
    upstreams:
      - id: archive-node
        failsafe:
          # Bulk log queries need long timeouts
          - matchMethod: "eth_getLogs"
            timeout:
              duration: 120s
            retry:
              maxAttempts: 3
              backoffFactor: 2
          
          # Trace methods are expensive but critical
          - matchMethod: "trace_*|arbtrace_*"
            timeout:
              duration: 180s
            retry:
              maxAttempts: 2
            circuitBreaker:
              failureThresholdCount: 10    # More tolerant for slow methods
              failureThresholdCapacity: 20
              halfOpenAfter: 5m

Disabling Policies

To disable any policy, set it to null or ~ (YAML):

erpc.yaml
failsafe:
  - matchMethod: "*"
    hedge: ~              # Disable hedging
    circuitBreaker: ~     # Disable circuit breaker