Monitoring & metrics
AIOpen as plain markdown for AIeRPC exposes a Prometheus (opens in a new tab) metrics endpoint you can scrape with any compatible backend — Grafana, Datadog, VictoriaMetrics, etc. Metrics cover every layer: inbound network requests, upstream forwarding, cache hits/misses, rate limiting, block-head lag, and upstream health scores.
You can configure:
- Where to listen — IPv4/IPv6 host and port for the
/metricsendpoint - Error label detail —
errorLabelMode: verbose(full error message) orcompact(error type only) - Histogram buckets — custom latency bucket boundaries
- Cardinality reduction — drop high-cardinality labels globally (
histogramDropLabels) and selectively restore them per-metric (histogramLabelOverrides)
Minimum useful config
server: # ...projects: # ...
metrics: enabled: true listenV4: true hostV4: "0.0.0.0" port: 4001Prometheus can then scrape http://<host>:4001/metrics.
Cardinality reduction
Error label mode
metrics: errorLabelMode: "compact" # default is "verbose"; "compact" is recommended for productioncompact uses only the error type as the label value instead of the full error message. Recommended in production — it prevents a misconfigured upstream from generating thousands of unique label values.
Drop high-cardinality labels globally
metrics: # Remove user/agent/category from every histogram series. # Useful when a managed scraper caps /metrics body size. histogramDropLabels: - user - agent - categoryRestore a label for one specific histogram
metrics:
histogramDropLabels:
- user
- category
# Re-add 'category' to upstream_request_duration_seconds only.
histogramLabelOverrides:
upstream_request_duration_seconds:
- categoryKeys are metric names without the erpc_ prefix. The override adds the label back for that metric family only.
Custom histogram buckets
metrics:
histogramBuckets: "0.01,0.1,0.5,1,5,10,60,300"Fewer or narrower buckets mean fewer time series stored in your monitoring backend. The default buckets cover 10ms–300s.
Grafana dashboard
The repo includes ready-made templates. See erpc/monitoring (opens in a new tab) and docker-compose.yml (opens in a new tab) for a local stack.

Copy for your AI assistant — full monitoring & metrics referenceExpand for every option, default, and edge case — or copy this entire section into your AI assistant.
MetricsConfig — every field
| Field | Type | Default | Notes |
|---|---|---|---|
enabled | bool | false | Master switch. When false, no /metrics endpoint is started. |
listenV4 | bool | true | Bind an IPv4 listener. |
hostV4 | string | "0.0.0.0" | IPv4 bind address. Use "127.0.0.1" to restrict to loopback. |
listenV6 | bool | false | Bind an IPv6 listener. |
hostV6 | string | "[::]" | IPv6 bind address. |
port | int | 4001 | Port for both IPv4 and IPv6 listeners. |
errorLabelMode | "verbose"|"compact" | "verbose" | Controls error label detail. verbose = full error message (backward compatible); compact = error type only (strongly recommended for production to prevent label explosion). |
histogramBuckets | string | built-in defaults | Comma-separated float list of bucket boundaries in seconds. Applies to all histograms. Example: "0.01,0.1,0.5,1,5,10,60,300". |
histogramDropLabels | string[] | none | Label names to remove from every histogram metric globally. Common candidates: user, agent, category. |
histogramLabelOverrides | map[string]string[] | none | Per-metric label restoration. Keys are metric names without the erpc_ prefix (e.g. upstream_request_duration_seconds). Values are the labels to re-add for that metric only, overriding histogramDropLabels for it. |
Resilience policy metrics
Emitted by the failsafe executor. Use these to size retry budgets, tune hedge delays, watch breaker churn, and detect tail-latency laggards inside consensus.
| Metric | Type | Description |
|---|---|---|
erpc_upstream_selection_total{upstream,reason} | Counter | Why each upstream attempt was picked. reason ∈ primary / retry / hedge / consensus_slot / sweep. |
erpc_upstream_attempt_outcome_total{upstream,outcome,is_hedge,is_retry} | Counter | Per-attempt classification. outcome ∈ success / empty / transport_error / server_error / client_error / rate_limited / missing_data / exec_revert / block_unavailable / breaker_open / cancelled / timeout / skipped. |
erpc_network_retry_attempt_total{reason} | Counter | Retry pressure by trigger: retryable_error / block_unavailable / missing_data / empty_result / pending_tx / execution_exception_retryable. |
erpc_network_hedged_request_total{upstream} | Counter | Hedge fires per upstream. |
erpc_network_hedge_winner_total{upstream} | Counter | Hedge race winners — consistently winning = promote to primary; consistently losing = drop. |
erpc_network_hedge_discards_total{upstream} | Counter | Hedge attempts cancelled because a sibling won — wasted work signal. |
erpc_network_hedge_delay_seconds | Histogram | Computed hedge fire delay (from quantile-driven config). |
erpc_network_timeout_fired_total{scope} | Counter | Timeouts firing per scope (network / upstream). |
erpc_network_timeout_duration_seconds | Histogram | Quantile-derived timeout values actually used. |
erpc_upstream_breaker_state_change_total{upstream,transition} | Counter | Circuit-breaker state churn (closed_to_open, open_to_half_open, half_open_to_closed, ...). |
erpc_consensus_short_circuit_total{reason} | Counter | Consensus rounds resolved before all participants returned. |
erpc_consensus_wait_capped_total{trigger} | Counter | Consensus maxWaitOnResult / maxWaitOnEmpty firings — high rates flag a slow upstream that should be tightened or dropped. |
Per-request execution trace
Every response carries the full attempt log as X-ERPC-* headers (X-ERPC-Upstreams-Tried, X-ERPC-Upstreams-Outcomes, X-ERPC-Upstreams-Reasons, X-ERPC-Upstreams-Durations-Ms, X-ERPC-Upstreams-Flags) — clients can debug retry/hedge/consensus decisions without server-side traces. Toggle verbosity with server.executionHeaders: all|summary|off.
Complete metrics table
| Metric | Type | Description |
|---|---|---|
erpc_upstream_request_total | Counter | Total requests sent to upstreams. |
erpc_upstream_request_duration_seconds | Histogram | Duration of upstream requests. |
erpc_upstream_request_errors_total | Counter | Total upstream request errors. |
erpc_upstream_request_self_rate_limited_total | Counter | Requests self-rate-limited before sending to upstream. |
erpc_upstream_request_remote_rate_limited_total | Counter | Requests rate-limited by the upstream itself. |
erpc_upstream_request_skipped_total | Counter | Requests skipped by an upstream (e.g. not applicable). |
erpc_upstream_request_missing_data_error_total | Counter | Requests where upstream is missing data or not yet synced. |
erpc_upstream_request_empty_response_total | Counter | Empty responses from upstreams. |
erpc_upstream_block_head_lag | Gauge | Blocks behind the most up-to-date upstream (head). |
erpc_upstream_finalization_lag | Gauge | Finalized blocks behind the most up-to-date upstream. |
erpc_upstream_score_overall | Gauge | Composite health/performance score for an upstream. |
erpc_upstream_latest_block_number | Gauge | Latest block number seen from an upstream. |
erpc_upstream_finalized_block_number | Gauge | Finalized block number seen from an upstream. |
erpc_network_latest_block_timestamp_distance_seconds | Gauge | Seconds between the network's latest block timestamp and now. Labeled by origin (evm_state_poller or network_response). |
erpc_upstream_cordoned | Gauge | Whether the upstream is excluded from routing by selection policy. 0 = active, 1 = cordoned. |
erpc_upstream_stale_latest_block_total | Counter | Times an upstream returned a stale latest block vs peers. |
erpc_upstream_stale_finalized_block_total | Counter | Times an upstream returned a stale finalized block vs peers. |
erpc_upstream_latest_block_polled_total | Counter | Times the latest block was pro-actively polled from an upstream. |
erpc_upstream_finalized_block_polled_total | Counter | Times the finalized block was pro-actively polled from an upstream. |
erpc_network_request_received_total | Counter | Total inbound requests received by the network. |
erpc_network_multiplexed_request_total | Counter | Multiplexed (de-duplicated) requests received by the network. |
erpc_network_failed_request_total | Counter | Total failed requests at the network level. |
erpc_network_request_self_rate_limited_total | Counter | Inbound requests self-rate-limited at the network level. |
erpc_network_successful_request_total | Counter | Total successful requests at the network level. |
erpc_network_cache_hits_total | Counter | Cache hits for network requests. |
erpc_network_cache_misses_total | Counter | Cache misses for network requests. |
erpc_network_request_duration_seconds | Histogram | End-to-end request duration at the network level. |
erpc_project_request_self_rate_limited_total | Counter | Requests self-rate-limited at the project level. |
erpc_rate_limits_total | Counter | Unified rate-limiting events (remote limits and budget decisions). Replaces deprecated erpc_budget_decision_total. |
erpc_rate_limiter_budget_max_count | Gauge | Maximum requests/sec for a rate limiter budget. |
erpc_rate_limiter_failopen_total | Counter | Rate-limiter fail-open events (requests allowed due to errors/timeouts). |
erpc_rate_limiter_remote_inflight | Gauge | In-flight remote rate-limit checks (e.g. Redis) per budget. Rising without bound signals Redis overload. |
erpc_rate_limiter_remote_admission_shedded_total | Counter | Fail-open events from the admission semaphore being full (never attempted the remote call). |
erpc_rate_limiter_remote_duration_seconds | Histogram | Duration of remote rate-limit checks; fine-grained sub-second buckets. |
erpc_auth_request_self_rate_limited_total | Counter | Requests rate-limited by an auth strategy. |
erpc_auth_failed_total | Counter | Failed authentication attempts (labeled by strategy, reason, agent_name). |
erpc_cache_set_success_total | Counter | Successful cache set operations. |
erpc_cache_set_error_total | Counter | Failed cache set operations. |
erpc_cache_set_skipped_total | Counter | Skipped cache set operations. |
erpc_cache_get_success_hit_total | Counter | Cache get hits. |
erpc_cache_get_success_miss_total | Counter | Cache get misses. |
erpc_cache_get_error_total | Counter | Cache get errors. |
erpc_cache_get_skipped_total | Counter | Cache get skips (no matching policy). |
erpc_shadow_response_identical_total | Counter | Shadow upstream responses identical to the primary response. |
erpc_shadow_response_mismatch_total | Counter | Shadow upstream responses that differ from the primary response. |
erpc_shadow_response_error_total | Counter | Shadow upstream requests that resulted in an error. |
erpc_network_hedged_request_total | Counter | Hedged requests towards a network (labeled by upstream, attempt). |
erpc_network_hedge_discards_total | Counter | Hedged responses discarded (attempt > 1 = wasted requests; labeled by hedge). |
erpc_network_hedge_delay_seconds | Histogram | Hedge delay actually applied per request; reveals effective hedge aggressiveness. |
erpc_ristretto_cache_current_cost | Gauge | Current total memory cost of the Ristretto in-memory cache per connector. Primary saturation signal. |
erpc_ristretto_cache_sets_failed_total | Counter | Ristretto set operations dropped or rejected (capacity exceeded). |
erpc_cors_requests_total | Counter | Total CORS requests received. |
erpc_cors_preflight_requests_total | Counter | CORS preflight requests received. |
erpc_cors_disallowed_origin_total | Counter | CORS requests from disallowed origins. |
getLogs and trace_filter split metrics
These metrics track eRPC's automatic request-splitting for eth_getLogs and trace_filter / arbtrace_filter. Upstream-scoped metrics fire per-upstream attempt; network-scoped metrics fire once per logical split at the network layer.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_upstream_evm_get_logs_stale_upper_bound_total | Counter | project, vendor, network, upstream, category, confidence | eth_getLogs skipped because upstream's latest block < requested toBlock. |
erpc_upstream_evm_get_logs_stale_lower_bound_total | Counter | project, vendor, network, upstream, category, confidence | eth_getLogs skipped because fromBlock is below the upstream's available range. |
erpc_upstream_evm_get_logs_range_exceeded_auto_splitting_threshold_total | Counter | project, vendor, network, upstream | Requests that exceeded getLogsAutoSplittingRangeThreshold and were auto-split. |
erpc_upstream_evm_get_logs_forced_splits_total | Counter | project, vendor, network, upstream, dimension | Upstream-level splits forced by an upstream error (dimension: block_range, addresses, topics). |
erpc_upstream_evm_get_logs_split_success_total | Counter | project, vendor, network, upstream | Successful eth_getLogs sub-requests after an upstream-level split. |
erpc_upstream_evm_get_logs_split_failure_total | Counter | project, vendor, network, upstream | Failed eth_getLogs sub-requests after an upstream-level split. |
erpc_network_evm_get_logs_forced_splits_total | Counter | project, network, dimension, user, agent_name | Network-level eth_getLogs splits by dimension; complements upstream-scoped variant. |
erpc_network_evm_get_logs_split_success_total | Counter | project, network, user, agent_name | Successful eth_getLogs sub-requests at the network layer. |
erpc_network_evm_get_logs_split_failure_total | Counter | project, network, user, agent_name | Failed eth_getLogs sub-requests at the network layer. |
erpc_network_evm_trace_filter_range_requested | Histogram | project, network, method, user, finality | Requested block-range sizes for trace_filter / arbtrace_filter. |
erpc_network_evm_trace_filter_forced_splits_total | Counter | project, network, method, dimension, user, agent_name | Splits for trace_filter / arbtrace_filter (labeled by method and dimension). |
erpc_network_evm_trace_filter_split_success_total | Counter | project, network, method, user, agent_name | Successful sub-requests after a trace_filter split. |
erpc_network_evm_trace_filter_split_failure_total | Counter | project, network, method, user, agent_name | Failed sub-requests after a trace_filter split. |
Consensus monitoring
When the consensus selection policy is active, eRPC emits a dedicated family of metrics. erpc_consensus_misbehavior_detected_total is the primary alert target — a non-zero rate means an upstream is returning data that diverges from the majority.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_consensus_total | Counter | project, network, category, outcome, finality | Consensus rounds attempted; outcome distinguishes success, no_consensus, timeout, etc. |
erpc_consensus_misbehavior_detected_total | Counter | project, network, upstream, category, finality, response_type, larger_than_consensus | Upstream returned data different from consensus; non-zero rate is the primary alert signal. |
erpc_consensus_upstream_punished_total | Counter | project, network, upstream | Times an upstream was scored down for misbehavior. |
erpc_consensus_short_circuit_total | Counter | project, network, category, reason, finality | Rounds that short-circuited (early exit before all upstreams responded). |
erpc_consensus_errors_total | Counter | project, network, category, error, finality | Consensus-level errors by type (distinct from upstream errors). |
erpc_consensus_upstream_errors_total | Counter | project, network, upstream, category, finality, response_type, error_code | Per-upstream errors observed during a consensus round. |
erpc_consensus_panics_total | Counter | project, network, category, finality | Panic recoveries inside the consensus engine. |
erpc_consensus_cancellations_total | Counter | project, network, category, phase, finality | Context cancellations by phase (collect, decide). |
erpc_consensus_responses_collected | Histogram | project, network, category, vendors, short_circuited, finality | Responses gathered before a decision; reveals how often quorum is reached early. |
erpc_consensus_agreement_count | Histogram | project, network, category, finality | Upstreams agreeing on the winning result; low values indicate frequent split votes. |
erpc_consensus_duration_seconds | Histogram | project, network, category, outcome, finality | End-to-end duration of a consensus round. |
# Alert: misbehaving upstream detected
rate(erpc_consensus_misbehavior_detected_total[5m]) > 0.1
# Consensus success rate by network
sum(rate(erpc_consensus_total{outcome="success"}[5m])) by (network) /
sum(rate(erpc_consensus_total[5m])) by (network)
# Average upstreams agreeing per round (low = fragile quorum)
histogram_quantile(0.5, sum(rate(erpc_consensus_agreement_count_bucket[5m])) by (le, network))x402 payment metrics
When the x402 payment middleware is enabled, eRPC emits additional counters for payment attempts, successes, and failures. Check health/metrics.go in the repo for the current list — they follow the erpc_x402_* naming convention.
Rate-limit monitoring
erpc_rate_limits_total is the unified counter for all rate-limiting events. It replaces the deprecated erpc_budget_decision_total (which is still emitted for backward compatibility but should not be used for new dashboards).
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_rate_limits_total | Counter | project, network, vendor, upstream, category, finality, user, agent_name, budget, scope, auth, origin | All rate-limit decisions; scope distinguishes network/upstream/auth, origin distinguishes local/remote. |
erpc_rate_limiter_budget_max_count | Gauge | budget, method, scope | Effective req/s cap for a budget (updated by auto-tuner). |
erpc_rate_limiter_failopen_total | Counter | project, network, user, agent_name, budget, category, reason | Fail-open events; reason = limit_timeout means the remote call was too slow. |
# Unified rate-limit event rate by scope (local vs remote)
sum(rate(erpc_rate_limits_total[5m])) by (network, scope, origin)
# Alert: fail-open events rising (remote rate-limiter degraded)
sum(rate(erpc_rate_limiter_failopen_total[5m])) by (budget, reason) > 0.1Remote rate-limiter monitoring (Redis-backed)
When a Redis-backed rate limiter is configured, watch erpc_rate_limiter_remote_inflight — a climbing gauge without a matching drop indicates Redis is saturated or unreachable.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_rate_limiter_remote_inflight | Gauge | budget | In-flight Redis DoLimit calls per budget. Climbs without bound if Redis is overwhelmed. |
erpc_rate_limiter_remote_admission_shedded_total | Counter | budget | Admission semaphore full — remote call was never attempted; request was fail-opened instead. |
erpc_rate_limiter_remote_duration_seconds | Histogram | budget, result | Round-trip latency of remote rate-limit calls; buckets go from 1ms to 5s. |
# Alert: admission shedding active (semaphore full)
rate(erpc_rate_limiter_remote_admission_shedded_total[1m]) > 0
# p99 Redis round-trip latency
histogram_quantile(0.99,
sum(rate(erpc_rate_limiter_remote_duration_seconds_bucket[5m])) by (le, budget)
)Shadow upstream monitoring
Shadow upstreams receive a copy of every request after the primary response is returned. Use these metrics to validate a candidate upstream's correctness before promoting it.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_shadow_response_identical_total | Counter | project, vendor, network, upstream, category | Shadow matched the primary response exactly. |
erpc_shadow_response_mismatch_total | Counter | project, vendor, network, upstream, category, finality, emptyish, larger | Shadow differed; larger indicates the shadow returned more data than primary. |
erpc_shadow_response_error_total | Counter | project, vendor, network, upstream, category, error | Shadow request errored; does not affect client response. |
# Mismatch rate for a shadow upstream
rate(erpc_shadow_response_mismatch_total{upstream="my-candidate"}[5m])
# Shadow match rate (closer to 1 = safer to promote)
sum(rate(erpc_shadow_response_identical_total[5m])) by (upstream) /
(
sum(rate(erpc_shadow_response_identical_total[5m])) by (upstream) +
sum(rate(erpc_shadow_response_mismatch_total[5m])) by (upstream)
)Hedge policy monitoring
Hedge requests fire a second upstream call after a configurable delay if the first has not yet responded. See Hedge policy for configuration.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_network_hedged_request_total | Counter | project, network, upstream, category, attempt, finality, user, agent_name | Total hedge attempts; attempt = 2 means second leg fired. |
erpc_network_hedge_discards_total | Counter | project, network, upstream, category, attempt, hedge, finality, user, agent_name | Hedge responses discarded (won lost the race); each discard = one wasted upstream call. |
erpc_network_hedge_delay_seconds | Histogram | project, network, category, finality | Actual hedge delay applied; compare against configured delay to detect quantile-based adaptation. |
# Hedge fire rate by network (how often second leg launches)
sum(rate(erpc_network_hedged_request_total{attempt="2"}[5m])) by (network)
# Wasted-request ratio from hedging
sum(rate(erpc_network_hedge_discards_total[5m])) by (network) /
sum(rate(erpc_network_hedged_request_total[5m])) by (network)Ristretto in-memory cache monitoring
The Ristretto cache is eRPC's built-in in-process cache layer. erpc_ristretto_cache_current_cost is the primary saturation signal — when it approaches the configured maxCost, items are evicted and the effective hit rate degrades.
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_ristretto_cache_current_cost | Gauge | connector | Current total cost (bytes) held in the cache per connector. Compare against maxCost config. |
erpc_ristretto_cache_sets_failed_total | Counter | connector | Set operations dropped by Ristretto (over capacity or rejected by policy). |
# Cache fill level (requires knowing maxCost from config)
erpc_ristretto_cache_current_cost{connector="my-memory-connector"}
# Alert: high Ristretto rejection rate (cache under pressure)
rate(erpc_ristretto_cache_sets_failed_total[5m]) > 10Auth failure monitoring
| Metric | Type | Labels | What it tells you |
|---|---|---|---|
erpc_auth_failed_total | Counter | project, network, strategy, reason, agent_name | Failed auth attempts; strategy (e.g. jwt, secret) and reason identify the failure mode. |
# Alert: auth failure spike by strategy
sum(rate(erpc_auth_failed_total[5m])) by (project, strategy, reason) > 1PromQL examples
# Request rate per second by network over last 5 minutes
sum(rate(erpc_network_request_received_total{}[5m])) by (network)
# Total daily requests by project and network
sum(increase(erpc_network_request_received_total{}[24h])) by (project, network)
# Top 5 project+network combos by request volume
topk(5, sum(rate(erpc_network_request_received_total{}[5m])) by (project, network))
# Error rate percentage by network and upstream
100 * sum(rate(erpc_upstream_request_errors_total{}[5m])) by (network, upstream) /
sum(rate(erpc_upstream_request_total{}[5m])) by (network, upstream)
# Top error types in the last hour
topk(10, sum(increase(erpc_upstream_request_errors_total{}[1h])) by (error))
# Missing data errors by network and upstream
sum(rate(erpc_upstream_request_missing_data_error_total{}[5m])) by (network, upstream)
# 95th percentile request duration by network
histogram_quantile(0.95, sum(rate(erpc_network_request_duration_seconds_bucket{}[5m])) by (le, network))
# Average upstream latency for eth_call
sum(rate(erpc_upstream_request_duration_seconds_sum{category="eth_call"}[5m])) by (network, upstream) /
sum(rate(erpc_upstream_request_duration_seconds_count{category="eth_call"}[5m])) by (network, upstream)
# Identify slow upstreams (avg duration > 500ms)
sum(rate(erpc_upstream_request_duration_seconds_sum{}[5m])) by (network, upstream) /
sum(rate(erpc_upstream_request_duration_seconds_count{}[5m])) by (network, upstream) > 0.5
# Cache hit ratio by network
sum(rate(erpc_network_cache_hits_total{}[5m])) by (network) /
(
sum(rate(erpc_network_cache_hits_total{}[5m])) by (network) +
sum(rate(erpc_network_cache_misses_total{}[5m])) by (network)
)
# Cache miss rate for eth_getBlockByNumber
rate(erpc_network_cache_misses_total{category="eth_getBlockByNumber"}[5m])
# Self rate-limited requests by project and network
sum(rate(erpc_network_request_self_rate_limited_total{}[5m])) by (project, network)
# Auth rate limiting by strategy
sum(rate(erpc_auth_request_self_rate_limited_total{strategy="jwt"}[5m])) by (project)
# Remote rate limiting by upstream
sum(rate(erpc_upstream_request_remote_rate_limited_total{}[5m])) by (upstream)
# Block head lag by network and upstream
max(erpc_upstream_block_head_lag) by (network, upstream)
# Alert: finalization lag > 5 blocks
max(erpc_upstream_finalization_lag) by (network) > 5
# Block height spread across upstreams on a network
max(erpc_upstream_latest_block_number) by (network) -
min(erpc_upstream_latest_block_number) by (network)
# Overall upstream health scores
avg(erpc_upstream_score_overall) by (network, upstream)
# CORS disallowed origins
sum(rate(erpc_cors_disallowed_origin_total{}[5m])) by (project, origin)
# How far behind is the latest block (all origins)
erpc_network_latest_block_timestamp_distance_seconds
# From internal EVM state poller only
erpc_network_latest_block_timestamp_distance_seconds{origin="evm_state_poller"}
# From what clients receive (including cached responses)
erpc_network_latest_block_timestamp_distance_seconds{origin="network_response"}
# Alert: block timestamp > 30s behind wall clock
erpc_network_latest_block_timestamp_distance_seconds > 30
# Retry pressure by reason — spikes on `block_unavailable` usually mean a slow upstream
sum(rate(erpc_network_retry_attempt_total[5m])) by (network, reason)
# Hedge effectiveness: which upstream usually wins the race (promote candidates)
topk(5, sum(rate(erpc_network_hedge_winner_total[10m])) by (network, upstream))
# Wasted hedge work — high rate means hedge `delay` is too short
sum(rate(erpc_network_hedge_discards_total[5m])) by (network, upstream)
# Per-attempt outcome distribution — separates real vs speculative traffic
sum(rate(erpc_upstream_attempt_outcome_total[5m])) by (upstream, outcome, is_hedge)
# Circuit-breaker churn — frequent open/half_open transitions = bad upstream
sum(increase(erpc_upstream_breaker_state_change_total{transition="closed_to_open"}[1h])) by (upstream)
# Consensus wait-cap firings — a hot signal for laggard upstreams in a consensus group
sum(rate(erpc_consensus_wait_capped_total[5m])) by (network, trigger)Grafana dashboard
The erpc/monitoring (opens in a new tab) directory contains a ready-made Grafana dashboard JSON and a Prometheus config. The docker-compose.yml (opens in a new tab) at the repo root brings up both with docker compose up grafana prometheus.
Cardinality reduction strategies
High metric cardinality is the most common production issue with eRPC metrics. Strategies from lowest to highest impact:
errorLabelMode: compact— prevents one misbehaving upstream from exploding theerrorlabel cardinality. Always set this in production.histogramDropLabels: [user, agent]—userandagentlabels are per-API-key and per-client-agent respectively; each unique value multiplies every histogram's bucket count. Drop unless you specifically need per-user or per-agent latency histograms.histogramDropLabels: [category]—categoryis the JSON-RPC method category (e.g.eth_call,eth_getLogs). Dropping it collapses all method categories into one histogram series per (network, upstream). UsehistogramLabelOverridesto keep it for selected histograms.- Custom
histogramBuckets— fewer buckets = fewer series. Default buckets cover a wide range; trim to the p50–p99 range you actually care about.
Common pitfall: managed Prometheus scrapers (Grafana Cloud, Prometheus-managed) have a default body-size limit on /metrics responses. With default settings and many upstreams + high cardinality labels, the response can exceed several MB. If you see scrape errors mentioning body size, start with histogramDropLabels: [user, agent, category].
Common pitfalls
errorLabelMode: verbosewith a broken upstream — one upstream returning varying error messages creates a unique label value per message, causing label cardinality to grow unboundedly. Switch tocompactin production.userandagentlabels on histograms — each unique API key or User-Agent value multiplies the number of time series for every histogram. Drop withhistogramDropLabelsunless you have a specific need.- Scraper body-size limits — managed scrapers often cap the response at 10–64 MB. Large deployments with many upstreams and high-cardinality labels can hit this. Reduce cardinality before hitting the limit (the scrape silently fails rather than returning partial data).
- IPv6 and
listenV6: true— ensure the host's Docker/network config supports IPv6 and the port is correctly mapped. The IPv6 listener binds tohostV6independently; both IPv4 and IPv6 listeners use the sameport. histogramLabelOverrideskey format — use the metric name without theerpc_prefix and without_bucket/_sum/_countsuffixes. Example:upstream_request_duration_seconds, noterpc_upstream_request_duration_seconds_bucket.
Append .llms.txt to this URL (or use the AI link above) to fetch the entire expanded reference as plain markdown for an AI assistant.