Monitoring & Observability

Comprehensive guide to monitoring Prism with Prometheus metrics, health checks, and logging.

Overview

Prism provides comprehensive observability through:

Prometheus Metrics: 50+ metrics covering requests, caching, upstream health, routing
Health Endpoint: Real-time system status and upstream health
Structured Logging: JSON or pretty-printed logs with configurable levels
Request Tracing: Cache status headers for debugging

Enabling Metrics

[metrics]
enabled = true
prometheus_port = 9090  # Optional, metrics also on main port

Access metrics:

curl http://localhost:3030/metrics

Prometheus Metrics

Core RPC Metrics

Request Counters

Total requests by method and upstream:

rpc_requests_total{method="eth_getLogs",upstream="alchemy"} 125000

Successful requests:

rpc_requests_success_total{method="eth_getLogs",upstream="alchemy"} 124500

Failed requests:

rpc_requests_error_total{method="eth_getLogs",upstream="alchemy"} 500

Request Latency

Latency histogram by method and upstream:

rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.05"} 80000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.1"} 115000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.5"} 123000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="+Inf"} 125000
rpc_request_duration_seconds_sum{method="eth_getLogs",upstream="alchemy"} 6250.5
rpc_request_duration_seconds_count{method="eth_getLogs",upstream="alchemy"} 125000

Calculate percentiles in Prometheus:

# P50 latency
histogram_quantile(0.5, rate(rpc_request_duration_seconds_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(rpc_request_duration_seconds_bucket[5m]))

# P99 latency
histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

Cache Metrics

Cache Hits and Misses

rpc_cache_hits_total{method="eth_getBlockByNumber"} 95000
rpc_cache_misses_total{method="eth_getBlockByNumber"} 5000

Cache hit rate:

rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))

Cache Statistics

Block cache:

rpc_block_cache_hot_window_size 200
rpc_block_cache_lru_entries 8500
rpc_block_cache_bytes 52428800  # 50MB

Log cache:

rpc_log_cache_chunks 50
rpc_log_cache_indexed_blocks 50000
rpc_log_cache_bitmap_entries 25000
rpc_log_cache_bytes 104857600  # 100MB

Transaction cache:

rpc_transaction_cache_entries 25000
rpc_receipt_cache_entries 25000
rpc_transaction_cache_bytes 41943040  # 40MB

Partial Cache Fulfillment

rpc_partial_cache_fulfillments_total{method="eth_getLogs"} 15000

rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="0.5"} 5000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="0.75"} 10000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="1.0"} 15000

Cache Evictions

rpc_cache_evictions_total{cache_type="block"} 1500
rpc_cache_evictions_total{cache_type="log"} 3200
rpc_cache_evictions_total{cache_type="transaction"} 2100

Upstream Health Metrics

Health Status

Upstream health gauge (1=healthy, 0=unhealthy):

rpc_upstream_health{upstream="alchemy"} 1
rpc_upstream_health{upstream="infura"} 1
rpc_upstream_health{upstream="quicknode"} 0

Number of healthy upstreams:

rpc_healthy_upstreams 2

Health Check Results

rpc_health_check_success_total{upstream="alchemy"} 2880
rpc_health_check_failure_total{upstream="alchemy"} 5

rpc_health_check_duration_seconds_bucket{upstream="alchemy",le="0.05"} 2800
rpc_health_check_duration_seconds_bucket{upstream="alchemy",le="0.1"} 2880

Upstream Latency

Latency percentiles:

rpc_upstream_latency_p50_ms{upstream="alchemy"} 48
rpc_upstream_latency_p95_ms{upstream="alchemy"} 120
rpc_upstream_latency_p99_ms{upstream="alchemy"} 350
rpc_upstream_latency_avg_ms{upstream="alchemy"} 62

Upstream Error Metrics

Error Counts

rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 25
rpc_upstream_errors_total{upstream="alchemy",error_type="connection_failed"} 3
rpc_upstream_errors_total{upstream="infura",error_type="rpc_rate_limit"} 145

Error Details

rpc_jsonrpc_errors_total{upstream="alchemy",code="-32005",category="provider_error"} 15

Error Rate

rpc_upstream_error_rate{upstream="alchemy"} 0.004  # 0.4% error rate

Circuit Breaker Metrics

Circuit breaker state (0=closed, 0.5=half-open, 1=open):

rpc_circuit_breaker_state{upstream="alchemy"} 0

State transitions:

rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="open"} 3
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="closed"} 3
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="half_open"} 3

Failure count:

rpc_circuit_breaker_failure_count{upstream="alchemy"} 0

Open duration:

rpc_circuit_breaker_open_duration_seconds_bucket{upstream="alchemy",le="60"} 2
rpc_circuit_breaker_open_duration_seconds_bucket{upstream="alchemy",le="300"} 3

Routing Metrics

Upstream Selection

rpc_upstream_selections_total{upstream="alchemy",reason="best_score"} 85000
rpc_upstream_selections_total{upstream="infura",reason="fallback"} 15000

Scoring Metrics

Composite scores:

rpc_upstream_composite_score{upstream="alchemy"} 0.875
rpc_upstream_composite_score{upstream="infura"} 0.720

Score factors:

rpc_upstream_latency_factor{upstream="alchemy"} 0.92
rpc_upstream_error_rate_factor{upstream="alchemy"} 0.996
rpc_upstream_throttle_factor{upstream="alchemy"} 1.0
rpc_upstream_block_lag_factor{upstream="alchemy"} 1.0

Block lag:

rpc_upstream_block_head_lag{upstream="alchemy"} 0
rpc_upstream_block_head_lag{upstream="infura"} 2

Hedging Metrics

Hedged requests:

rpc_hedged_requests_total{primary="alchemy",hedged="infura"} 1500

Hedge wins (which request finished first):

rpc_hedge_wins_total{upstream="alchemy",type="primary"} 850
rpc_hedge_wins_total{upstream="infura",type="hedged"} 650

Hedge delay:

rpc_hedge_delay_ms_bucket{upstream="alchemy",le="50"} 800
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="100"} 1400
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="200"} 1500

Hedge skips:

rpc_hedge_skipped_total{reason="insufficient_data"} 250

Consensus Metrics

rpc_consensus_requests_total{result="success"} 25000
rpc_consensus_requests_total{result="failure"} 15

rpc_consensus_agreement_rate 0.9994

rpc_consensus_duration_seconds_bucket{le="0.1"} 5000
rpc_consensus_duration_seconds_bucket{le="0.5"} 24000

Authentication Metrics

Auth attempts:

rpc_auth_success_total{key_id="production-api"} 125000
rpc_auth_failure_total{key_id="unknown"} 45

Auth cache:

rpc_auth_cache_hits_total 120000
rpc_auth_cache_misses_total 5045
rpc_auth_cache_entries 50

Rate limiting:

rpc_rate_limit_allowed_total{key="production-api"} 124500
rpc_rate_limit_rejected_total{key="production-api"} 500

Quota management:

rpc_auth_quota_exceeded_total{key_id="production-api"} 3

Method permissions:

rpc_auth_method_denied_total{key_id="logs-only",method="eth_getBlockByNumber"} 12

Chain State Metrics

Current chain tip:

rpc_chain_tip_block 18500000

Chain tip update latency:

rpc_chain_tip_update_latency_seconds_bucket{le="0.1"} 2800

Reorg detection:

rpc_reorgs_detected_total 5

rpc_reorg_depth_bucket{le="2"} 3
rpc_reorg_depth_bucket{le="5"} 4
rpc_reorg_depth_bucket{le="10"} 5

rpc_last_reorg_block 18499995

System Metrics

Active connections:

rpc_active_connections 150

WebSocket connections:

rpc_websocket_active_connections 4

rpc_websocket_connections_total 125
rpc_websocket_disconnections_total 121

Batch requests:

rpc_batch_requests_total 2500

rpc_batch_request_size_bucket{le="5"} 1500
rpc_batch_request_size_bucket{le="10"} 2200
rpc_batch_request_size_bucket{le="25"} 2500

rpc_batch_request_duration_seconds_bucket{le="0.5"} 2000
rpc_batch_request_duration_seconds_bucket{le="1.0"} 2400

Health Endpoint

Get real-time system status at /health.

Request

GET /health HTTP/1.1
Host: localhost:3030

Response

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime_seconds": 86400,
  "upstreams": [
    {
      "name": "alchemy-mainnet",
      "healthy": true,
      "chain_id": 1,
      "latest_block": 18500000,
      "finalized_block": 18499900,
      "response_time_ms": 45,
      "error_count": 0,
      "circuit_breaker": "closed",
      "latency": {
        "p50_ms": 48,
        "p95_ms": 120,
        "p99_ms": 350
      },
      "requests": {
        "total": 125000,
        "success": 124500,
        "errors": 500
      }
    },
    {
      "name": "infura-mainnet",
      "healthy": true,
      "chain_id": 1,
      "latest_block": 18500001,
      "finalized_block": 18499901,
      "response_time_ms": 52,
      "error_count": 0,
      "circuit_breaker": "closed",
      "latency": {
        "p50_ms": 55,
        "p95_ms": 140,
        "p99_ms": 420
      },
      "requests": {
        "total": 75000,
        "success": 74200,
        "errors": 800
      }
    }
  ],
  "cache": {
    "enabled": true,
    "blocks": {
      "hot_window_size": 200,
      "cached_headers": 8500,
      "cached_bodies": 6200,
      "memory_bytes": 52428800
    },
    "logs": {
      "chunks": 50,
      "indexed_blocks": 50000,
      "memory_bytes": 104857600
    },
    "transactions": {
      "cached_transactions": 25000,
      "cached_receipts": 25000,
      "memory_bytes": 41943040
    },
    "hit_rate": 0.87
  },
  "metrics": {
    "total_requests": 200000,
    "total_errors": 1300,
    "average_latency_ms": 48.5,
    "cache_hit_rate": 0.87
  }
}

Health Status Values

Status

Description

healthy

All systems operational

degraded

Some upstreams unhealthy

unhealthy

No healthy upstreams

Logging

Configuration

[logging]
level = "info"         # trace, debug, info, warn, error
format = "pretty"      # pretty or json

Log Levels

Level

Use Case

trace

Extreme verbosity, development only

debug

Development and troubleshooting

info

Normal production logging

warn

Potential issues, non-critical

error

Errors requiring attention

Log Format

Pretty (Development):

2024-12-03T10:30:45.123Z  INFO prism::server: Server starting on 127.0.0.1:3030
2024-12-03T10:30:45.456Z  INFO prism::upstream: Added upstream: alchemy-mainnet
2024-12-03T10:30:45.789Z  INFO prism::health: Health checker started (interval: 60s)
2024-12-03T10:31:15.234Z  WARN prism::upstream: Upstream latency high upstream=infura latency_ms=850
2024-12-03T10:32:00.567Z ERROR prism::upstream: Circuit breaker opened upstream=quicknode failures=5

JSON (Production):

{"timestamp":"2024-12-03T10:30:45.123Z","level":"INFO","target":"prism::server","message":"Server starting on 127.0.0.1:3030"}
{"timestamp":"2024-12-03T10:30:45.456Z","level":"INFO","target":"prism::upstream","message":"Added upstream: alchemy-mainnet"}
{"timestamp":"2024-12-03T10:31:15.234Z","level":"WARN","target":"prism::upstream","message":"Upstream latency high","upstream":"infura","latency_ms":850}
{"timestamp":"2024-12-03T10:32:00.567Z","level":"ERROR","target":"prism::upstream","message":"Circuit breaker opened","upstream":"quicknode","failures":5}

Important Log Events

Startup:

INFO Server starting on 127.0.0.1:3030
INFO Added upstream: alchemy-mainnet
INFO Health checker started (interval: 60s)
INFO Cache manager initialized

Upstream Issues:

WARN Upstream latency high upstream=infura latency_ms=850
WARN Health check failed upstream=quicknode error="connection timeout"
ERROR Circuit breaker opened upstream=quicknode failures=5

Cache Events:

INFO Hot window advanced new_start=18499800
DEBUG Cache hit method=eth_getBlockByNumber block=18500000
DEBUG Partial cache fulfillment method=eth_getLogs cached_pct=0.75

Reorg Detection:

WARN Reorg detected depth=2 block=18499995
INFO Cache invalidated from_block=18499995 to_block=18500000

Alerting Recommendations

Critical Alerts

No Healthy Upstreams:

rpc_healthy_upstreams == 0

High Error Rate:

rate(rpc_requests_error_total[5m]) / rate(rpc_requests_total[5m]) > 0.05

All Circuit Breakers Open:

sum(rpc_circuit_breaker_state) == count(rpc_circuit_breaker_state)

Warning Alerts

Low Cache Hit Rate:

rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m])) < 0.7

High P99 Latency:

histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m])) > 1.0

Upstream Block Lag:

rpc_upstream_block_head_lag > 10

Frequent Reorgs:

rate(rpc_reorgs_detected_total[1h]) > 2

Sample Alert Rules

groups:
  - name: prism
    interval: 30s
    rules:
      - alert: NoHealthyUpstreams
        expr: rpc_healthy_upstreams == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No healthy upstreams available"
          description: "All upstreams are unhealthy, service is degraded"

      - alert: HighErrorRate
        expr: |
          rate(rpc_requests_error_total[5m]) /
          rate(rpc_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: LowCacheHitRate
        expr: |
          rate(rpc_cache_hits_total[5m]) /
          (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m])) < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      - alert: CircuitBreakerOpen
        expr: rpc_circuit_breaker_state == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open for {{ $labels.upstream }}"
          description: "Upstream {{ $labels.upstream }} is isolated"

Dashboards

Grafana Dashboard Example

Request Throughput Panel:

sum(rate(rpc_requests_total[5m])) by (method)

Cache Hit Rate Panel:

sum(rate(rpc_cache_hits_total[5m])) /
  (sum(rate(rpc_cache_hits_total[5m])) + sum(rate(rpc_cache_misses_total[5m])))

Latency Percentiles Panel:

histogram_quantile(0.50, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))

Upstream Health Panel:

rpc_upstream_health

Error Rate Panel:

sum(rate(rpc_requests_error_total[5m])) by (upstream)

Next: Explore the Routing Strategies or Caching Guide for more details.

PreviousCaching NextRouting Strategies

Last updated 3 months ago

Good evening

hashtagTable of Contents

hashtagOverview

hashtagEnabling Metrics

hashtagPrometheus Metrics

hashtagCore RPC Metrics

hashtagRequest Counters

hashtagRequest Latency

hashtagCache Metrics

hashtagCache Hits and Misses

hashtagCache Statistics

hashtagPartial Cache Fulfillment

hashtagCache Evictions

hashtagUpstream Health Metrics

hashtagHealth Status

hashtagHealth Check Results

hashtagUpstream Latency

hashtagUpstream Error Metrics

hashtagError Counts

hashtagError Details

hashtagError Rate

hashtagCircuit Breaker Metrics

hashtagRouting Metrics

hashtagUpstream Selection

hashtagScoring Metrics

hashtagHedging Metrics

hashtagConsensus Metrics

hashtagAuthentication Metrics

hashtagChain State Metrics

hashtagSystem Metrics

hashtagHealth Endpoint

hashtagRequest

hashtagResponse

hashtagHealth Status Values

hashtagLogging

hashtagConfiguration

hashtagLog Levels

hashtagLog Format

hashtagImportant Log Events

hashtagAlerting Recommendations

hashtagCritical Alerts

hashtagWarning Alerts

hashtagSample Alert Rules

hashtagDashboards

hashtagGrafana Dashboard Example

Table of Contents