Operation
Monitoring

Monitoring & metrics

AIOpen as plain markdown for AI

eRPC exposes a Prometheus (opens in a new tab) metrics endpoint you can scrape with any compatible backend — Grafana, Datadog, VictoriaMetrics, etc. Metrics cover every layer: inbound network requests, upstream forwarding, cache hits/misses, rate limiting, block-head lag, and upstream health scores.

You can configure:

  • Where to listen — IPv4/IPv6 host and port for the /metrics endpoint
  • Error label detailerrorLabelMode: verbose (full error message) or compact (error type only)
  • Histogram buckets — custom latency bucket boundaries
  • Cardinality reduction — drop high-cardinality labels globally (histogramDropLabels) and selectively restore them per-metric (histogramLabelOverrides)

Minimum useful config

metrics
erpc.yaml
server: # ...projects: # ...
metrics:  enabled: true  listenV4: true  hostV4: "0.0.0.0"  port: 4001

Prometheus can then scrape http://<host>:4001/metrics.

Cardinality reduction

Error label mode

metricserrorLabelMode
erpc.yaml
metrics:  errorLabelMode: "compact" # default is "verbose"; "compact" is recommended for production

compact uses only the error type as the label value instead of the full error message. Recommended in production — it prevents a misconfigured upstream from generating thousands of unique label values.

Drop high-cardinality labels globally

metricshistogramDropLabels
erpc.yaml
metrics:  # Remove user/agent/category from every histogram series.  # Useful when a managed scraper caps /metrics body size.  histogramDropLabels:    - user    - agent    - category

Restore a label for one specific histogram

metrics:
  histogramDropLabels:
    - user
    - category
  # Re-add 'category' to upstream_request_duration_seconds only.
  histogramLabelOverrides:
    upstream_request_duration_seconds:
      - category

Keys are metric names without the erpc_ prefix. The override adds the label back for that metric family only.

Custom histogram buckets

metrics:
  histogramBuckets: "0.01,0.1,0.5,1,5,10,60,300"

Fewer or narrower buckets mean fewer time series stored in your monitoring backend. The default buckets cover 10ms–300s.

Grafana dashboard

The repo includes ready-made templates. See erpc/monitoring (opens in a new tab) and docker-compose.yml (opens in a new tab) for a local stack.

eRPC Grafana Dashboard

Copy for your AI assistant — full monitoring & metrics referenceExpand for every option, default, and edge case — or copy this entire section into your AI assistant.

MetricsConfig — every field

FieldTypeDefaultNotes
enabledboolfalseMaster switch. When false, no /metrics endpoint is started.
listenV4booltrueBind an IPv4 listener.
hostV4string"0.0.0.0"IPv4 bind address. Use "127.0.0.1" to restrict to loopback.
listenV6boolfalseBind an IPv6 listener.
hostV6string"[::]"IPv6 bind address.
portint4001Port for both IPv4 and IPv6 listeners.
errorLabelMode"verbose"|"compact""verbose"Controls error label detail. verbose = full error message (backward compatible); compact = error type only (strongly recommended for production to prevent label explosion).
histogramBucketsstringbuilt-in defaultsComma-separated float list of bucket boundaries in seconds. Applies to all histograms. Example: "0.01,0.1,0.5,1,5,10,60,300".
histogramDropLabelsstring[]noneLabel names to remove from every histogram metric globally. Common candidates: user, agent, category.
histogramLabelOverridesmap[string]string[]nonePer-metric label restoration. Keys are metric names without the erpc_ prefix (e.g. upstream_request_duration_seconds). Values are the labels to re-add for that metric only, overriding histogramDropLabels for it.

Resilience policy metrics

Emitted by the failsafe executor. Use these to size retry budgets, tune hedge delays, watch breaker churn, and detect tail-latency laggards inside consensus.

MetricTypeDescription
erpc_upstream_selection_total{upstream,reason}CounterWhy each upstream attempt was picked. reasonprimary / retry / hedge / consensus_slot / sweep.
erpc_upstream_attempt_outcome_total{upstream,outcome,is_hedge,is_retry}CounterPer-attempt classification. outcomesuccess / empty / transport_error / server_error / client_error / rate_limited / missing_data / exec_revert / block_unavailable / breaker_open / cancelled / timeout / skipped.
erpc_network_retry_attempt_total{reason}CounterRetry pressure by trigger: retryable_error / block_unavailable / missing_data / empty_result / pending_tx / execution_exception_retryable.
erpc_network_hedged_request_total{upstream}CounterHedge fires per upstream.
erpc_network_hedge_winner_total{upstream}CounterHedge race winners — consistently winning = promote to primary; consistently losing = drop.
erpc_network_hedge_discards_total{upstream}CounterHedge attempts cancelled because a sibling won — wasted work signal.
erpc_network_hedge_delay_secondsHistogramComputed hedge fire delay (from quantile-driven config).
erpc_network_timeout_fired_total{scope}CounterTimeouts firing per scope (network / upstream).
erpc_network_timeout_duration_secondsHistogramQuantile-derived timeout values actually used.
erpc_upstream_breaker_state_change_total{upstream,transition}CounterCircuit-breaker state churn (closed_to_open, open_to_half_open, half_open_to_closed, ...).
erpc_consensus_short_circuit_total{reason}CounterConsensus rounds resolved before all participants returned.
erpc_consensus_wait_capped_total{trigger}CounterConsensus maxWaitOnResult / maxWaitOnEmpty firings — high rates flag a slow upstream that should be tightened or dropped.

Per-request execution trace

Every response carries the full attempt log as X-ERPC-* headers (X-ERPC-Upstreams-Tried, X-ERPC-Upstreams-Outcomes, X-ERPC-Upstreams-Reasons, X-ERPC-Upstreams-Durations-Ms, X-ERPC-Upstreams-Flags) — clients can debug retry/hedge/consensus decisions without server-side traces. Toggle verbosity with server.executionHeaders: all|summary|off.

Complete metrics table

MetricTypeDescription
erpc_upstream_request_totalCounterTotal requests sent to upstreams.
erpc_upstream_request_duration_secondsHistogramDuration of upstream requests.
erpc_upstream_request_errors_totalCounterTotal upstream request errors.
erpc_upstream_request_self_rate_limited_totalCounterRequests self-rate-limited before sending to upstream.
erpc_upstream_request_remote_rate_limited_totalCounterRequests rate-limited by the upstream itself.
erpc_upstream_request_skipped_totalCounterRequests skipped by an upstream (e.g. not applicable).
erpc_upstream_request_missing_data_error_totalCounterRequests where upstream is missing data or not yet synced.
erpc_upstream_request_empty_response_totalCounterEmpty responses from upstreams.
erpc_upstream_block_head_lagGaugeBlocks behind the most up-to-date upstream (head).
erpc_upstream_finalization_lagGaugeFinalized blocks behind the most up-to-date upstream.
erpc_upstream_score_overallGaugeComposite health/performance score for an upstream.
erpc_upstream_latest_block_numberGaugeLatest block number seen from an upstream.
erpc_upstream_finalized_block_numberGaugeFinalized block number seen from an upstream.
erpc_network_latest_block_timestamp_distance_secondsGaugeSeconds between the network's latest block timestamp and now. Labeled by origin (evm_state_poller or network_response).
erpc_upstream_cordonedGaugeWhether the upstream is excluded from routing by selection policy. 0 = active, 1 = cordoned.
erpc_upstream_stale_latest_block_totalCounterTimes an upstream returned a stale latest block vs peers.
erpc_upstream_stale_finalized_block_totalCounterTimes an upstream returned a stale finalized block vs peers.
erpc_upstream_latest_block_polled_totalCounterTimes the latest block was pro-actively polled from an upstream.
erpc_upstream_finalized_block_polled_totalCounterTimes the finalized block was pro-actively polled from an upstream.
erpc_network_request_received_totalCounterTotal inbound requests received by the network.
erpc_network_multiplexed_request_totalCounterMultiplexed (de-duplicated) requests received by the network.
erpc_network_failed_request_totalCounterTotal failed requests at the network level.
erpc_network_request_self_rate_limited_totalCounterInbound requests self-rate-limited at the network level.
erpc_network_successful_request_totalCounterTotal successful requests at the network level.
erpc_network_cache_hits_totalCounterCache hits for network requests.
erpc_network_cache_misses_totalCounterCache misses for network requests.
erpc_network_request_duration_secondsHistogramEnd-to-end request duration at the network level.
erpc_project_request_self_rate_limited_totalCounterRequests self-rate-limited at the project level.
erpc_rate_limits_totalCounterUnified rate-limiting events (remote limits and budget decisions). Replaces deprecated erpc_budget_decision_total.
erpc_rate_limiter_budget_max_countGaugeMaximum requests/sec for a rate limiter budget.
erpc_rate_limiter_failopen_totalCounterRate-limiter fail-open events (requests allowed due to errors/timeouts).
erpc_rate_limiter_remote_inflightGaugeIn-flight remote rate-limit checks (e.g. Redis) per budget. Rising without bound signals Redis overload.
erpc_rate_limiter_remote_admission_shedded_totalCounterFail-open events from the admission semaphore being full (never attempted the remote call).
erpc_rate_limiter_remote_duration_secondsHistogramDuration of remote rate-limit checks; fine-grained sub-second buckets.
erpc_auth_request_self_rate_limited_totalCounterRequests rate-limited by an auth strategy.
erpc_auth_failed_totalCounterFailed authentication attempts (labeled by strategy, reason, agent_name).
erpc_cache_set_success_totalCounterSuccessful cache set operations.
erpc_cache_set_error_totalCounterFailed cache set operations.
erpc_cache_set_skipped_totalCounterSkipped cache set operations.
erpc_cache_get_success_hit_totalCounterCache get hits.
erpc_cache_get_success_miss_totalCounterCache get misses.
erpc_cache_get_error_totalCounterCache get errors.
erpc_cache_get_skipped_totalCounterCache get skips (no matching policy).
erpc_shadow_response_identical_totalCounterShadow upstream responses identical to the primary response.
erpc_shadow_response_mismatch_totalCounterShadow upstream responses that differ from the primary response.
erpc_shadow_response_error_totalCounterShadow upstream requests that resulted in an error.
erpc_network_hedged_request_totalCounterHedged requests towards a network (labeled by upstream, attempt).
erpc_network_hedge_discards_totalCounterHedged responses discarded (attempt > 1 = wasted requests; labeled by hedge).
erpc_network_hedge_delay_secondsHistogramHedge delay actually applied per request; reveals effective hedge aggressiveness.
erpc_ristretto_cache_current_costGaugeCurrent total memory cost of the Ristretto in-memory cache per connector. Primary saturation signal.
erpc_ristretto_cache_sets_failed_totalCounterRistretto set operations dropped or rejected (capacity exceeded).
erpc_cors_requests_totalCounterTotal CORS requests received.
erpc_cors_preflight_requests_totalCounterCORS preflight requests received.
erpc_cors_disallowed_origin_totalCounterCORS requests from disallowed origins.

getLogs and trace_filter split metrics

These metrics track eRPC's automatic request-splitting for eth_getLogs and trace_filter / arbtrace_filter. Upstream-scoped metrics fire per-upstream attempt; network-scoped metrics fire once per logical split at the network layer.

MetricTypeLabelsWhat it tells you
erpc_upstream_evm_get_logs_stale_upper_bound_totalCounterproject, vendor, network, upstream, category, confidenceeth_getLogs skipped because upstream's latest block < requested toBlock.
erpc_upstream_evm_get_logs_stale_lower_bound_totalCounterproject, vendor, network, upstream, category, confidenceeth_getLogs skipped because fromBlock is below the upstream's available range.
erpc_upstream_evm_get_logs_range_exceeded_auto_splitting_threshold_totalCounterproject, vendor, network, upstreamRequests that exceeded getLogsAutoSplittingRangeThreshold and were auto-split.
erpc_upstream_evm_get_logs_forced_splits_totalCounterproject, vendor, network, upstream, dimensionUpstream-level splits forced by an upstream error (dimension: block_range, addresses, topics).
erpc_upstream_evm_get_logs_split_success_totalCounterproject, vendor, network, upstreamSuccessful eth_getLogs sub-requests after an upstream-level split.
erpc_upstream_evm_get_logs_split_failure_totalCounterproject, vendor, network, upstreamFailed eth_getLogs sub-requests after an upstream-level split.
erpc_network_evm_get_logs_forced_splits_totalCounterproject, network, dimension, user, agent_nameNetwork-level eth_getLogs splits by dimension; complements upstream-scoped variant.
erpc_network_evm_get_logs_split_success_totalCounterproject, network, user, agent_nameSuccessful eth_getLogs sub-requests at the network layer.
erpc_network_evm_get_logs_split_failure_totalCounterproject, network, user, agent_nameFailed eth_getLogs sub-requests at the network layer.
erpc_network_evm_trace_filter_range_requestedHistogramproject, network, method, user, finalityRequested block-range sizes for trace_filter / arbtrace_filter.
erpc_network_evm_trace_filter_forced_splits_totalCounterproject, network, method, dimension, user, agent_nameSplits for trace_filter / arbtrace_filter (labeled by method and dimension).
erpc_network_evm_trace_filter_split_success_totalCounterproject, network, method, user, agent_nameSuccessful sub-requests after a trace_filter split.
erpc_network_evm_trace_filter_split_failure_totalCounterproject, network, method, user, agent_nameFailed sub-requests after a trace_filter split.

Consensus monitoring

When the consensus selection policy is active, eRPC emits a dedicated family of metrics. erpc_consensus_misbehavior_detected_total is the primary alert target — a non-zero rate means an upstream is returning data that diverges from the majority.

MetricTypeLabelsWhat it tells you
erpc_consensus_totalCounterproject, network, category, outcome, finalityConsensus rounds attempted; outcome distinguishes success, no_consensus, timeout, etc.
erpc_consensus_misbehavior_detected_totalCounterproject, network, upstream, category, finality, response_type, larger_than_consensusUpstream returned data different from consensus; non-zero rate is the primary alert signal.
erpc_consensus_upstream_punished_totalCounterproject, network, upstreamTimes an upstream was scored down for misbehavior.
erpc_consensus_short_circuit_totalCounterproject, network, category, reason, finalityRounds that short-circuited (early exit before all upstreams responded).
erpc_consensus_errors_totalCounterproject, network, category, error, finalityConsensus-level errors by type (distinct from upstream errors).
erpc_consensus_upstream_errors_totalCounterproject, network, upstream, category, finality, response_type, error_codePer-upstream errors observed during a consensus round.
erpc_consensus_panics_totalCounterproject, network, category, finalityPanic recoveries inside the consensus engine.
erpc_consensus_cancellations_totalCounterproject, network, category, phase, finalityContext cancellations by phase (collect, decide).
erpc_consensus_responses_collectedHistogramproject, network, category, vendors, short_circuited, finalityResponses gathered before a decision; reveals how often quorum is reached early.
erpc_consensus_agreement_countHistogramproject, network, category, finalityUpstreams agreeing on the winning result; low values indicate frequent split votes.
erpc_consensus_duration_secondsHistogramproject, network, category, outcome, finalityEnd-to-end duration of a consensus round.
# Alert: misbehaving upstream detected
rate(erpc_consensus_misbehavior_detected_total[5m]) > 0.1

# Consensus success rate by network
sum(rate(erpc_consensus_total{outcome="success"}[5m])) by (network) /
sum(rate(erpc_consensus_total[5m])) by (network)

# Average upstreams agreeing per round (low = fragile quorum)
histogram_quantile(0.5, sum(rate(erpc_consensus_agreement_count_bucket[5m])) by (le, network))

x402 payment metrics

When the x402 payment middleware is enabled, eRPC emits additional counters for payment attempts, successes, and failures. Check health/metrics.go in the repo for the current list — they follow the erpc_x402_* naming convention.

Rate-limit monitoring

erpc_rate_limits_total is the unified counter for all rate-limiting events. It replaces the deprecated erpc_budget_decision_total (which is still emitted for backward compatibility but should not be used for new dashboards).

MetricTypeLabelsWhat it tells you
erpc_rate_limits_totalCounterproject, network, vendor, upstream, category, finality, user, agent_name, budget, scope, auth, originAll rate-limit decisions; scope distinguishes network/upstream/auth, origin distinguishes local/remote.
erpc_rate_limiter_budget_max_countGaugebudget, method, scopeEffective req/s cap for a budget (updated by auto-tuner).
erpc_rate_limiter_failopen_totalCounterproject, network, user, agent_name, budget, category, reasonFail-open events; reason = limit_timeout means the remote call was too slow.
# Unified rate-limit event rate by scope (local vs remote)
sum(rate(erpc_rate_limits_total[5m])) by (network, scope, origin)

# Alert: fail-open events rising (remote rate-limiter degraded)
sum(rate(erpc_rate_limiter_failopen_total[5m])) by (budget, reason) > 0.1

Remote rate-limiter monitoring (Redis-backed)

When a Redis-backed rate limiter is configured, watch erpc_rate_limiter_remote_inflight — a climbing gauge without a matching drop indicates Redis is saturated or unreachable.

MetricTypeLabelsWhat it tells you
erpc_rate_limiter_remote_inflightGaugebudgetIn-flight Redis DoLimit calls per budget. Climbs without bound if Redis is overwhelmed.
erpc_rate_limiter_remote_admission_shedded_totalCounterbudgetAdmission semaphore full — remote call was never attempted; request was fail-opened instead.
erpc_rate_limiter_remote_duration_secondsHistogrambudget, resultRound-trip latency of remote rate-limit calls; buckets go from 1ms to 5s.
# Alert: admission shedding active (semaphore full)
rate(erpc_rate_limiter_remote_admission_shedded_total[1m]) > 0

# p99 Redis round-trip latency
histogram_quantile(0.99,
  sum(rate(erpc_rate_limiter_remote_duration_seconds_bucket[5m])) by (le, budget)
)

Shadow upstream monitoring

Shadow upstreams receive a copy of every request after the primary response is returned. Use these metrics to validate a candidate upstream's correctness before promoting it.

MetricTypeLabelsWhat it tells you
erpc_shadow_response_identical_totalCounterproject, vendor, network, upstream, categoryShadow matched the primary response exactly.
erpc_shadow_response_mismatch_totalCounterproject, vendor, network, upstream, category, finality, emptyish, largerShadow differed; larger indicates the shadow returned more data than primary.
erpc_shadow_response_error_totalCounterproject, vendor, network, upstream, category, errorShadow request errored; does not affect client response.
# Mismatch rate for a shadow upstream
rate(erpc_shadow_response_mismatch_total{upstream="my-candidate"}[5m])

# Shadow match rate (closer to 1 = safer to promote)
sum(rate(erpc_shadow_response_identical_total[5m])) by (upstream) /
(
  sum(rate(erpc_shadow_response_identical_total[5m])) by (upstream) +
  sum(rate(erpc_shadow_response_mismatch_total[5m])) by (upstream)
)

Hedge policy monitoring

Hedge requests fire a second upstream call after a configurable delay if the first has not yet responded. See Hedge policy for configuration.

MetricTypeLabelsWhat it tells you
erpc_network_hedged_request_totalCounterproject, network, upstream, category, attempt, finality, user, agent_nameTotal hedge attempts; attempt = 2 means second leg fired.
erpc_network_hedge_discards_totalCounterproject, network, upstream, category, attempt, hedge, finality, user, agent_nameHedge responses discarded (won lost the race); each discard = one wasted upstream call.
erpc_network_hedge_delay_secondsHistogramproject, network, category, finalityActual hedge delay applied; compare against configured delay to detect quantile-based adaptation.
# Hedge fire rate by network (how often second leg launches)
sum(rate(erpc_network_hedged_request_total{attempt="2"}[5m])) by (network)

# Wasted-request ratio from hedging
sum(rate(erpc_network_hedge_discards_total[5m])) by (network) /
sum(rate(erpc_network_hedged_request_total[5m])) by (network)

Ristretto in-memory cache monitoring

The Ristretto cache is eRPC's built-in in-process cache layer. erpc_ristretto_cache_current_cost is the primary saturation signal — when it approaches the configured maxCost, items are evicted and the effective hit rate degrades.

MetricTypeLabelsWhat it tells you
erpc_ristretto_cache_current_costGaugeconnectorCurrent total cost (bytes) held in the cache per connector. Compare against maxCost config.
erpc_ristretto_cache_sets_failed_totalCounterconnectorSet operations dropped by Ristretto (over capacity or rejected by policy).
# Cache fill level (requires knowing maxCost from config)
erpc_ristretto_cache_current_cost{connector="my-memory-connector"}

# Alert: high Ristretto rejection rate (cache under pressure)
rate(erpc_ristretto_cache_sets_failed_total[5m]) > 10

Auth failure monitoring

MetricTypeLabelsWhat it tells you
erpc_auth_failed_totalCounterproject, network, strategy, reason, agent_nameFailed auth attempts; strategy (e.g. jwt, secret) and reason identify the failure mode.
# Alert: auth failure spike by strategy
sum(rate(erpc_auth_failed_total[5m])) by (project, strategy, reason) > 1

PromQL examples

# Request rate per second by network over last 5 minutes
sum(rate(erpc_network_request_received_total{}[5m])) by (network)

# Total daily requests by project and network
sum(increase(erpc_network_request_received_total{}[24h])) by (project, network)

# Top 5 project+network combos by request volume
topk(5, sum(rate(erpc_network_request_received_total{}[5m])) by (project, network))

# Error rate percentage by network and upstream
100 * sum(rate(erpc_upstream_request_errors_total{}[5m])) by (network, upstream) /
sum(rate(erpc_upstream_request_total{}[5m])) by (network, upstream)

# Top error types in the last hour
topk(10, sum(increase(erpc_upstream_request_errors_total{}[1h])) by (error))

# Missing data errors by network and upstream
sum(rate(erpc_upstream_request_missing_data_error_total{}[5m])) by (network, upstream)

# 95th percentile request duration by network
histogram_quantile(0.95, sum(rate(erpc_network_request_duration_seconds_bucket{}[5m])) by (le, network))

# Average upstream latency for eth_call
sum(rate(erpc_upstream_request_duration_seconds_sum{category="eth_call"}[5m])) by (network, upstream) /
sum(rate(erpc_upstream_request_duration_seconds_count{category="eth_call"}[5m])) by (network, upstream)

# Identify slow upstreams (avg duration > 500ms)
sum(rate(erpc_upstream_request_duration_seconds_sum{}[5m])) by (network, upstream) /
sum(rate(erpc_upstream_request_duration_seconds_count{}[5m])) by (network, upstream) > 0.5

# Cache hit ratio by network
sum(rate(erpc_network_cache_hits_total{}[5m])) by (network) /
(
  sum(rate(erpc_network_cache_hits_total{}[5m])) by (network) +
  sum(rate(erpc_network_cache_misses_total{}[5m])) by (network)
)

# Cache miss rate for eth_getBlockByNumber
rate(erpc_network_cache_misses_total{category="eth_getBlockByNumber"}[5m])

# Self rate-limited requests by project and network
sum(rate(erpc_network_request_self_rate_limited_total{}[5m])) by (project, network)

# Auth rate limiting by strategy
sum(rate(erpc_auth_request_self_rate_limited_total{strategy="jwt"}[5m])) by (project)

# Remote rate limiting by upstream
sum(rate(erpc_upstream_request_remote_rate_limited_total{}[5m])) by (upstream)

# Block head lag by network and upstream
max(erpc_upstream_block_head_lag) by (network, upstream)

# Alert: finalization lag > 5 blocks
max(erpc_upstream_finalization_lag) by (network) > 5

# Block height spread across upstreams on a network
max(erpc_upstream_latest_block_number) by (network) -
min(erpc_upstream_latest_block_number) by (network)

# Overall upstream health scores
avg(erpc_upstream_score_overall) by (network, upstream)

# CORS disallowed origins
sum(rate(erpc_cors_disallowed_origin_total{}[5m])) by (project, origin)

# How far behind is the latest block (all origins)
erpc_network_latest_block_timestamp_distance_seconds

# From internal EVM state poller only
erpc_network_latest_block_timestamp_distance_seconds{origin="evm_state_poller"}

# From what clients receive (including cached responses)
erpc_network_latest_block_timestamp_distance_seconds{origin="network_response"}

# Alert: block timestamp > 30s behind wall clock
erpc_network_latest_block_timestamp_distance_seconds > 30

# Retry pressure by reason — spikes on `block_unavailable` usually mean a slow upstream
sum(rate(erpc_network_retry_attempt_total[5m])) by (network, reason)

# Hedge effectiveness: which upstream usually wins the race (promote candidates)
topk(5, sum(rate(erpc_network_hedge_winner_total[10m])) by (network, upstream))

# Wasted hedge work — high rate means hedge `delay` is too short
sum(rate(erpc_network_hedge_discards_total[5m])) by (network, upstream)

# Per-attempt outcome distribution — separates real vs speculative traffic
sum(rate(erpc_upstream_attempt_outcome_total[5m])) by (upstream, outcome, is_hedge)

# Circuit-breaker churn — frequent open/half_open transitions = bad upstream
sum(increase(erpc_upstream_breaker_state_change_total{transition="closed_to_open"}[1h])) by (upstream)

# Consensus wait-cap firings — a hot signal for laggard upstreams in a consensus group
sum(rate(erpc_consensus_wait_capped_total[5m])) by (network, trigger)

Grafana dashboard

The erpc/monitoring (opens in a new tab) directory contains a ready-made Grafana dashboard JSON and a Prometheus config. The docker-compose.yml (opens in a new tab) at the repo root brings up both with docker compose up grafana prometheus.

Cardinality reduction strategies

High metric cardinality is the most common production issue with eRPC metrics. Strategies from lowest to highest impact:

  1. errorLabelMode: compact — prevents one misbehaving upstream from exploding the error label cardinality. Always set this in production.
  2. histogramDropLabels: [user, agent]user and agent labels are per-API-key and per-client-agent respectively; each unique value multiplies every histogram's bucket count. Drop unless you specifically need per-user or per-agent latency histograms.
  3. histogramDropLabels: [category]category is the JSON-RPC method category (e.g. eth_call, eth_getLogs). Dropping it collapses all method categories into one histogram series per (network, upstream). Use histogramLabelOverrides to keep it for selected histograms.
  4. Custom histogramBuckets — fewer buckets = fewer series. Default buckets cover a wide range; trim to the p50–p99 range you actually care about.

Common pitfall: managed Prometheus scrapers (Grafana Cloud, Prometheus-managed) have a default body-size limit on /metrics responses. With default settings and many upstreams + high cardinality labels, the response can exceed several MB. If you see scrape errors mentioning body size, start with histogramDropLabels: [user, agent, category].

Common pitfalls

  • errorLabelMode: verbose with a broken upstream — one upstream returning varying error messages creates a unique label value per message, causing label cardinality to grow unboundedly. Switch to compact in production.
  • user and agent labels on histograms — each unique API key or User-Agent value multiplies the number of time series for every histogram. Drop with histogramDropLabels unless you have a specific need.
  • Scraper body-size limits — managed scrapers often cap the response at 10–64 MB. Large deployments with many upstreams and high-cardinality labels can hit this. Reduce cardinality before hitting the limit (the scrape silently fails rather than returning partial data).
  • IPv6 and listenV6: true — ensure the host's Docker/network config supports IPv6 and the port is correctly mapped. The IPv6 listener binds to hostV6 independently; both IPv4 and IPv6 listeners use the same port.
  • histogramLabelOverrides key format — use the metric name without the erpc_ prefix and without _bucket/_sum/_count suffixes. Example: upstream_request_duration_seconds, not erpc_upstream_request_duration_seconds_bucket.

Append .llms.txt to this URL (or use the AI link above) to fetch the entire expanded reference as plain markdown for an AI assistant.