Failsafe
There are various policies you can use either on Upstream-level or Network-level, to help with intermittent issues and increase general resiliency.
timeout:
helps prevent requests from hanging indefinitely.retry:
is used to recover transient issues.hedge:
might run simultaneous requests when one upstream is too slow to respond.circuitBreaker:
temporarily removes a down upstream until it recovers.
Config source code: common/config.go (opens in a new tab)
timeout
policy
This policy simply sets a timeout for the request, either on Network-level (when requests are sent to eRPC), or on Upstream-level (when requests are sent to a specific upstream).
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 42161
failsafe:
#...
timeout:
# Network-level timeout applies to the whole lifecycle of the request,
# this includes retries on network and/or upstream level.
duration: 30s
upstreams:
- id: blastapi-chain-42161
#...
failsafe:
timeout:
# Upstream-level timeout applies each request sent towards the upstream,
# e.g. if retry policy is set to 2 retries total time will be 30s for:
duration: 15s
retry
policy
This policies will retry certain retriable failures, either on network-level and/or upstream-level.
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
# ...
# On network-level retry policy applies to the incoming request to eRPC,
# this is additional to the retry policy set on upstream level.
retry:
# Total retries besides the initial request:
maxCount: 3
# Min delay between retries:
delay: 500ms
# Maximum delay between retries:
backoffMaxDelay: 10s
# Multiplier for each retry for exponential backoff:
backoffFactor: 0.3
# Random jitter to avoid thundering herd,
# e.g. add between 0 to 500ms to each retry delay:
jitter: 500ms
upstreams:
- id: blastapi-chain-42161
# ...
failsafe:
# Upstream-level retry policy applies each request sent towards the upstream,
# this is additional to the retry policy set on network level.
# For example if network has 2 retries and upstream has 2 retries,
# total retries will be 4.
retry:
maxCount: 2
delay: 1000ms
backoffMaxDelay: 10s
backoffFactor: 0.3
jitter: 500ms
These errors will be retried:
5xx
and generally any error that indicate server-side (intermittent) issues.408
which means request timeout.429
which means rate limit exceeded, therefore retrying after few moments.EmptyResponse
for certain methods (e.g. eth_getLogs) if upstream A returns empty array, it could be due to lag in node syncing, so upstream B will be retried.
These errors will not be retried:
4xx
and generally any error that indicate client-side issues (invalid request, invalid parameters, etc).UnsupportedMethods
which means upstream does not support certain methods (e.g. eth_traceTransaction)
hedge
policy
When a request towards an upstream is slow, the hedge
policy will start a new simultaneous request towards the next upstream.
This policy is highly recommended to be set at least on network-level. It will ensure if upstream A is slow, a new request towards upstream B will be started, whichever responds faster will be returned to the client.
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
# ...
hedge:
# Delay means how long to wait before starting a simultaneous hedged request.
# e.g. if upstream A did not respond within 500ms, a new request towards upstream B will be started,
# and whichever responds faster will be returned to the client.
delay: 500ms
# In total how many hedges to start.
# e.g. if maxCount is 2, and upstream A did not respond within 500ms,
# a new request towards upstream B will be started. If B also did not respond,
# a new request towards upstream C will be started.
maxCount: 1
To disable hedge policy, set it to null:
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
hedge: ~
circuitBreaker
policy
When upstreams are constantly failing, the circuitBreaker
policy will temporarily remove them from list of available upstreams.
This policy is recommended to be set on upstream-level. This will make sure temporarily broken upstreams are not used, and will give them time to recover.
# ...
projects:
- id: main
upstreams:
- id: blastapi-chain-42161
# ...
failsafe:
# ...
circuitBreaker:
# These two variables indicate how many failures and capacity to tolerate before opening the circuit.
# e.g. if 30 requests have failed out of last 100 requests, circuit breaker will be opened:
failureThresholdCount: 30
failureThresholdCapacity: 100
# How long to wait before trying to re-enable the upstream after circuit breaker was opened.
# e.g. after 60s give the upstream another chance:
halfOpenAfter: 60s
# These two variables indicate how many successes are required in half-open state before closing the circuit,
# and putting the upstream back in available upstreams.
# e.g. after 8 requests have succeeded out of last 10 requests, circuit breaker will be closed:
successThresholdCount: 8
successThresholdCapacity: 10
To disable circuit breaker policy, set it to null:
# ...
projects:
- id: main
upstreams:
- id: blastapi-chain-42161
failsafe:
circuitBreaker: ~
- Circuit breaker "open" means that upstream is temporarily removed from the list of available upstreams.
- Circuit breaker "half-open" means that upstream is tentatively put back into the list of available upstreams, but with reduced capacity (e.g. only 10 requests are allowed).
- Circuit breaker "closed" means that upstream is fully recovered and put back into the list of available upstreams.
Roadmap
On some doc pages we like to share our ideas for related future implementations, feel free to open a PR if you're up for a challenge:
- Allow defining failsafe policies on a per-method basis (e.g. different behavior for eth_getLogs vs other methods).