Production guidelines

Here are some recommendations for running eRPC in production.

Memory usage

Biggest memory usage contributor in eRPC is size of responses of your requests. For example, for common requests such as eth_getBlockByNumber or eth_getTransactionReceipt the size (<1MB) will be relatively smaller than debug_traceTransaction (which could potentially be up to 50MB). When using eRPC in Kubernetes for example your might see occesional OOMKilled errors which is most often because of high RPS of large request/responses.

In majority of use-cases eRPC uses around 256MB of memory (and 1vCPU). To find the ideal memory limit based on your use-case start with a high limit first (e.g. 16GB) and route your production traffic (either shadow or real) to see what is the usage based on your request patterns.

For more control you can configure Go's garbage collection with the following env variables (e.g. when facing OOM Killed errors on Kubernetes):

# This flag controls when GC kicks in, for example when memory is increased by 30% try to run GC:
export GOGC=30
 
# This flag instructs Go to do a GC when memory goes over the 2GiB limit.
# IMPORTANT: if this value is too low, it might cause high GC frequency,
# which in turn might impact the performance without giving much memory benefits.
export GOMEMLIMIT=2GiB

Failsafe policies

Make sure to configure retry policy on both network-level and upstream-level.

Network-level retry configuration is useful to try other upstreams if one has an issue. Even when you only have 1 upstream, network-level retry is still useful. Recommendation is to configure maxCount to be equal to the number of upstreams.
Upstream-level retry configuration covers intermittent issues with a specific upstream. It is recommended to set at least 2 and at most 5 as maxCount.

Timeout policy depends on the expected response time for your use-case, for example when using "trace" methods on EVM chains, providers might take up to 10 seconds to respond. Therefore a low timeout might ultimately always fail. If you are not using heavy methods such as trace or large getLogs, you can use 3s as a default timeout.

Hedge policy is highly-recommended if you prefer "fast response as soon as possible". For example setting 500ms as "delay" will make sure if upstream A did not respond under 500 milliseconds, simultaneously another request to upstream B will be fired, and eRPC will respond back as soon as any of them comes back with result faster. Note: since more requests are sent, it might incur higher costs to achieve the "fast response" goal.

Caching database

Storing cached RPC responses requires high storage for read-heavy use-cases such as indexing 100m blocks on Arbitrum. eRPC is designed to be robust towards cache database issues, so even if database is completely down it will not impact the RPC availability.

As described in Database section depending on your requirements choose the right type. You can start with Redis which is easiest to setup, and if amount of cached data is larger than available memory you can switch to PostgreSQL.

Using eRPC cloud solution will be most cost-efficient in terms of caching storage costs, as we'll be able to break the costs over many projects.

Horizontal scaling

When running multiple eRPC instances (e.g., in a Kubernetes deployment with multiple replicas), it's recommended to enable shared state with Redis to ensure proper synchronization between instances.

The shared state feature allows your eRPC instances to share critical blockchain information such as latest and finalized block numbers, which reduces redundant upstream requests and improves integrity checks.

Even if Redis becomes temporarily unavailable, eRPC will continue serving requests by falling back to local state tracking. This might cause a slight increase in upstream requests as each instance will need to poll for latest/finalized blocks independently, but the impact is minimal and service availability is maintained.

The shared state feature requires minimal storage (less than 1MB per upstream) while significantly improving coordination between instances. For high-traffic deployments with multiple replicas, this pattern is strongly recommended.

Explicitly configure Chain ID

Even though eRPC can automatically detect the chain ID, it's recommended to explicitly configure the chain ID in the project configuration. This ensures faster startup time and more resilient rollouts.

There are mainly 2 places to configure the chain ID:

networks.*.evm.chainId under Networks section
upstreams.*.evm.chainId under Upstreams section

Healthcheck

For a zero-downtime smooth rollout, configure Healthcheck in your orchestration platform (e.g. kubernetes).

Example: Cilium/Envoy and zero-downtime deployments

When using Cilium with Envoy (either Ingress or Gateway-API) we observed that keeping waitBeforeShutdown and waitAfterShutdown both at 30s (together with a readiness probe that fails in ≤ 10 s) eliminates connection reset / refused errors during rolling updates:

server:
  waitBeforeShutdown: 30s # pod is in draining mode
  waitAfterShutdown:  30s # process stays alive until Envoy finishes

Shorter values let Envoy reuse a connection after the listener is gone or try to reach a pod that has already exited. Use these numbers as a safe starting point and adjust to match your own probe intervals.

Directives Monitoring