Config
Failsafe

Failsafe

There are various policies you can use either on Upstream-level or Network-level, to help with intermittent issues and increase general resiliency.

  • timeout: helps prevent requests from hanging indefinitely.
  • retry: is used to recover transient issues.
  • hedge: might run simultaneous requests when one upstream is too slow to respond.
  • circuitBreaker: temporarily removes a down upstream until it recovers.

Config source code: common/config.go (opens in a new tab)

timeout policy

This policy simply sets a timeout for the request, either on Network-level (when requests are sent to eRPC), or on Upstream-level (when requests are sent to a specific upstream).

erpc.yaml
# ...
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 42161
        failsafe:
          #...
          timeout:
            # Network-level timeout applies to the whole lifecycle of the request,
            # this includes retries on network and/or upstream level.
            duration: 30s
 
    upstreams:
      - id: blastapi-chain-42161
        #...
        failsafe:
          timeout:
            # Upstream-level timeout applies each request sent towards the upstream,
            # e.g. if retry policy is set to 2 retries total time will be 30s for:
            duration: 15s

retry policy

This policies will retry certain retriable failures, either on network-level and/or upstream-level.

erpc.yaml
# ...
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          # ...
          # On network-level retry policy applies to the incoming request to eRPC,
          # this is additional to the retry policy set on upstream level.
          retry:
            # Total retries besides the initial request:
            maxCount: 3
            # Min delay between retries:
            delay: 500ms
            # Maximum delay between retries:
            backoffMaxDelay: 10s
            # Multiplier for each retry for exponential backoff:
            backoffFactor: 0.3
            # Random jitter to avoid thundering herd,
            # e.g. add between 0 to 500ms to each retry delay:
            jitter: 500ms
 
    upstreams:
      - id: blastapi-chain-42161
        # ...
        failsafe:
          # Upstream-level retry policy applies each request sent towards the upstream,
          # this is additional to the retry policy set on network level.
          # For example if network has 2 retries and upstream has 2 retries,
          # total retries will be 4.
          retry:
            maxCount: 2
            delay: 1000ms
            backoffMaxDelay: 10s
            backoffFactor: 0.3
            jitter: 500ms

These errors will be retried:

  • 5xx and generally any error that indicate server-side (intermittent) issues.
  • 408 which means request timeout.
  • 429 which means rate limit exceeded, therefore retrying after few moments.
  • EmptyResponse for certain methods (e.g. eth_getLogs) if upstream A returns empty array, it could be due to lag in node syncing, so upstream B will be retried.

These errors will not be retried:

  • 4xx and generally any error that indicate client-side issues (invalid request, invalid parameters, etc).
  • UnsupportedMethods which means upstream does not support certain methods (e.g. eth_traceTransaction)

hedge policy

When a request towards an upstream is slow, the hedge policy will start a new simultaneous request towards the next upstream.

This policy is highly recommended to be set at least on network-level. It will ensure if upstream A is slow, a new request towards upstream B will be started, whichever responds faster will be returned to the client.

erpc.yaml
# ...
projects:
  - id: main
    networks:
      - architecture: evm
        evm:
          chainId: 1
        failsafe:
          # ...
          hedge:
            # Delay means how long to wait before starting a simultaneous hedged request.
            # e.g. if upstream A did not respond within 500ms, a new request towards upstream B will be started,
            # and whichever responds faster will be returned to the client.
            delay: 500ms
            # In total how many hedges to start.
            # e.g. if maxCount is 2, and upstream A did not respond within 500ms,
            # a new request towards upstream B will be started. If B also did not respond,
            # a new request towards upstream C will be started.
            maxCount: 1

circuitBreaker policy

When upstreams are constantly failing, the circuitBreaker policy will temporarily remove them from list of available upstreams.

This policy is recommended to be set on upstream-level. This will make sure temporarily broken upstreams are not used, and will give them time to recover.

erpc.yaml
# ...
projects:
  - id: main
    upstreams:
      - id: blastapi-chain-42161
        # ...
        failsafe:
          # ...
          circuitBreaker:
            # These two variables indicate how many failures and capacity to tolerate before opening the circuit.
            failureThresholdCount: 30
            failureThresholdCapacity: 100
            # How long to wait before trying to re-enable the upstream after circuit breaker was opened.
            halfOpenAfter: 60s
            # These two variables indicate how many successes are required in half-open state before closing the circuit,
            # and putting the upstream back in available upstreams.
            successThresholdCount: 8
            successThresholdCapacity: 10

Roadmap

On some doc pages we like to share our ideas for related future implementations, feel free to open a PR if you're up for a challenge:


  • Allow defining failsafe policies on a per-method basis (e.g. different behavior for eth_getLogs vs other methods).