Failsafe
There are various policies you can use either on Upstream-level or Network-level, to help with intermittent issues and increase general resiliency.
timeout:
helps prevent requests from hanging indefinitely.retry:
is used to recover transient issues.hedge:
might run simultaneous requests when one upstream is too slow to respond.circuitBreaker:
temporarily removes a down upstream until it recovers.
Config source code: common/config.go (opens in a new tab)
timeout
policy
This policy simply sets a timeout for the request, either on Network-level (when requests are sent to eRPC), or on Upstream-level (when requests are sent to a specific upstream).
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 42161
failsafe:
#...
timeout:
# Network-level timeout applies to the whole lifecycle of the request,
# this includes retries on network and/or upstream level.
duration: 30s
upstreams:
- id: blastapi-chain-42161
#...
failsafe:
timeout:
# Upstream-level timeout applies each request sent towards the upstream,
# e.g. if retry policy is set to 2 retries total time will be 30s for:
duration: 15s
retry
policy
This policies will retry certain retriable failures, either on network-level and/or upstream-level.
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
# ...
# On network-level retry policy applies to the incoming request to eRPC,
# this is additional to the retry policy set on upstream level.
retry:
# Total retries besides the initial request:
maxCount: 3
# Min delay between retries:
delay: 500ms
# Maximum delay between retries:
backoffMaxDelay: 10s
# Multiplier for each retry for exponential backoff:
backoffFactor: 0.3
# Random jitter to avoid thundering herd,
# e.g. add between 0 to 500ms to each retry delay:
jitter: 500ms
upstreams:
- id: blastapi-chain-42161
# ...
failsafe:
# Upstream-level retry policy applies each request sent towards the upstream,
# this is additional to the retry policy set on network level.
# For example if network has 2 retries and upstream has 2 retries,
# total retries will be 4.
retry:
maxCount: 2
delay: 1000ms
backoffMaxDelay: 10s
backoffFactor: 0.3
jitter: 500ms
These errors will be retried:
5xx
and generally any error that indicate server-side (intermittent) issues.408
which means request timeout.429
which means rate limit exceeded, therefore retrying after few moments.EmptyResponse
for certain methods (e.g. eth_getLogs) if upstream A returns empty array, it could be due to lag in node syncing, so upstream B will be retried.
These errors will not be retried:
4xx
and generally any error that indicate client-side issues (invalid request, invalid parameters, etc).UnsupportedMethods
which means upstream does not support certain methods (e.g. eth_traceTransaction)
hedge
policy
When a request towards an upstream is slow, the hedge
policy will start a new simultaneous request towards the next upstream.
This policy is highly recommended to be set at least on network-level. It will ensure if upstream A is slow, a new request towards upstream B will be started, whichever responds faster will be returned to the client.
# ...
projects:
- id: main
networks:
- architecture: evm
evm:
chainId: 1
failsafe:
# ...
hedge:
# Delay means how long to wait before starting a simultaneous hedged request.
# e.g. if upstream A did not respond within 500ms, a new request towards upstream B will be started,
# and whichever responds faster will be returned to the client.
delay: 500ms
# In total how many hedges to start.
# e.g. if maxCount is 2, and upstream A did not respond within 500ms,
# a new request towards upstream B will be started. If B also did not respond,
# a new request towards upstream C will be started.
maxCount: 1
circuitBreaker
policy
When upstreams are constantly failing, the circuitBreaker
policy will temporarily remove them from list of available upstreams.
This policy is recommended to be set on upstream-level. This will make sure temporarily broken upstreams are not used, and will give them time to recover.
# ...
projects:
- id: main
upstreams:
- id: blastapi-chain-42161
# ...
failsafe:
# ...
circuitBreaker:
# These two variables indicate how many failures and capacity to tolerate before opening the circuit.
failureThresholdCount: 30
failureThresholdCapacity: 100
# How long to wait before trying to re-enable the upstream after circuit breaker was opened.
halfOpenAfter: 60s
# These two variables indicate how many successes are required in half-open state before closing the circuit,
# and putting the upstream back in available upstreams.
successThresholdCount: 8
successThresholdCapacity: 10
Roadmap
On some doc pages we like to share our ideas for related future implementations, feel free to open a PR if you're up for a challenge:
- Allow defining failsafe policies on a per-method basis (e.g. different behavior for eth_getLogs vs other methods).