Monitoring & Observability

Comprehensive guide to monitoring Prism with Prometheus metrics, health checks, and logging.

Table of Contents


Overview

Prism provides comprehensive observability through:

  • Prometheus Metrics: 50+ metrics covering requests, caching, upstream health, routing

  • Health Endpoint: Real-time system status and upstream health

  • Structured Logging: JSON or pretty-printed logs with configurable levels

  • Request Tracing: Cache status headers for debugging

Enabling Metrics

[metrics]
enabled = true
prometheus_port = 9090  # Optional, metrics also on main port

Access metrics:

curl http://localhost:3030/metrics

Prometheus Metrics

Core RPC Metrics

Request Counters

Total requests by method and upstream:

rpc_requests_total{method="eth_getLogs",upstream="alchemy"} 125000

Successful requests:

rpc_requests_success_total{method="eth_getLogs",upstream="alchemy"} 124500

Failed requests:

rpc_requests_error_total{method="eth_getLogs",upstream="alchemy"} 500

Request Latency

Latency histogram by method and upstream:

rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.05"} 80000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.1"} 115000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.5"} 123000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="+Inf"} 125000
rpc_request_duration_seconds_sum{method="eth_getLogs",upstream="alchemy"} 6250.5
rpc_request_duration_seconds_count{method="eth_getLogs",upstream="alchemy"} 125000

Calculate percentiles in Prometheus:

# P50 latency
histogram_quantile(0.5, rate(rpc_request_duration_seconds_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(rpc_request_duration_seconds_bucket[5m]))

# P99 latency
histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

Cache Metrics

Cache Hits and Misses

rpc_cache_hits_total{method="eth_getBlockByNumber"} 95000
rpc_cache_misses_total{method="eth_getBlockByNumber"} 5000

Cache hit rate:

rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))

Cache Statistics

Block cache:

rpc_block_cache_hot_window_size 200
rpc_block_cache_lru_entries 8500
rpc_block_cache_bytes 52428800  # 50MB

Log cache:

rpc_log_cache_chunks 50
rpc_log_cache_indexed_blocks 50000
rpc_log_cache_bitmap_entries 25000
rpc_log_cache_bytes 104857600  # 100MB

Transaction cache:

rpc_transaction_cache_entries 25000
rpc_receipt_cache_entries 25000
rpc_transaction_cache_bytes 41943040  # 40MB

Partial Cache Fulfillment

rpc_partial_cache_fulfillments_total{method="eth_getLogs"} 15000

rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="0.5"} 5000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="0.75"} 10000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="1.0"} 15000

Cache Evictions

rpc_cache_evictions_total{cache_type="block"} 1500
rpc_cache_evictions_total{cache_type="log"} 3200
rpc_cache_evictions_total{cache_type="transaction"} 2100

Upstream Health Metrics

Health Status

Upstream health gauge (1=healthy, 0=unhealthy):

rpc_upstream_health{upstream="alchemy"} 1
rpc_upstream_health{upstream="infura"} 1
rpc_upstream_health{upstream="quicknode"} 0

Number of healthy upstreams:

rpc_healthy_upstreams 2

Health Check Results

rpc_health_check_success_total{upstream="alchemy"} 2880
rpc_health_check_failure_total{upstream="alchemy"} 5

rpc_health_check_duration_seconds_bucket{upstream="alchemy",le="0.05"} 2800
rpc_health_check_duration_seconds_bucket{upstream="alchemy",le="0.1"} 2880

Upstream Latency

Latency percentiles:

rpc_upstream_latency_p50_ms{upstream="alchemy"} 48
rpc_upstream_latency_p95_ms{upstream="alchemy"} 120
rpc_upstream_latency_p99_ms{upstream="alchemy"} 350
rpc_upstream_latency_avg_ms{upstream="alchemy"} 62

Upstream Error Metrics

Error Counts

rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 25
rpc_upstream_errors_total{upstream="alchemy",error_type="connection_failed"} 3
rpc_upstream_errors_total{upstream="infura",error_type="rpc_rate_limit"} 145

Error Details

rpc_jsonrpc_errors_total{upstream="alchemy",code="-32005",category="provider_error"} 15

Error Rate

rpc_upstream_error_rate{upstream="alchemy"} 0.004  # 0.4% error rate

Circuit Breaker Metrics

Circuit breaker state (0=closed, 0.5=half-open, 1=open):

rpc_circuit_breaker_state{upstream="alchemy"} 0

State transitions:

rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="open"} 3
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="closed"} 3
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="half_open"} 3

Failure count:

rpc_circuit_breaker_failure_count{upstream="alchemy"} 0

Open duration:

rpc_circuit_breaker_open_duration_seconds_bucket{upstream="alchemy",le="60"} 2
rpc_circuit_breaker_open_duration_seconds_bucket{upstream="alchemy",le="300"} 3

Routing Metrics

Upstream Selection

rpc_upstream_selections_total{upstream="alchemy",reason="best_score"} 85000
rpc_upstream_selections_total{upstream="infura",reason="fallback"} 15000

Scoring Metrics

Composite scores:

rpc_upstream_composite_score{upstream="alchemy"} 0.875
rpc_upstream_composite_score{upstream="infura"} 0.720

Score factors:

rpc_upstream_latency_factor{upstream="alchemy"} 0.92
rpc_upstream_error_rate_factor{upstream="alchemy"} 0.996
rpc_upstream_throttle_factor{upstream="alchemy"} 1.0
rpc_upstream_block_lag_factor{upstream="alchemy"} 1.0

Block lag:

rpc_upstream_block_head_lag{upstream="alchemy"} 0
rpc_upstream_block_head_lag{upstream="infura"} 2

Hedging Metrics

Hedged requests:

rpc_hedged_requests_total{primary="alchemy",hedged="infura"} 1500

Hedge wins (which request finished first):

rpc_hedge_wins_total{upstream="alchemy",type="primary"} 850
rpc_hedge_wins_total{upstream="infura",type="hedged"} 650

Hedge delay:

rpc_hedge_delay_ms_bucket{upstream="alchemy",le="50"} 800
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="100"} 1400
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="200"} 1500

Hedge skips:

rpc_hedge_skipped_total{reason="insufficient_data"} 250

Consensus Metrics

rpc_consensus_requests_total{result="success"} 25000
rpc_consensus_requests_total{result="failure"} 15

rpc_consensus_agreement_rate 0.9994

rpc_consensus_duration_seconds_bucket{le="0.1"} 5000
rpc_consensus_duration_seconds_bucket{le="0.5"} 24000

Authentication Metrics

Auth attempts:

rpc_auth_success_total{key_id="production-api"} 125000
rpc_auth_failure_total{key_id="unknown"} 45

Auth cache:

rpc_auth_cache_hits_total 120000
rpc_auth_cache_misses_total 5045
rpc_auth_cache_entries 50

Rate limiting:

rpc_rate_limit_allowed_total{key="production-api"} 124500
rpc_rate_limit_rejected_total{key="production-api"} 500

Quota management:

rpc_auth_quota_exceeded_total{key_id="production-api"} 3

Method permissions:

rpc_auth_method_denied_total{key_id="logs-only",method="eth_getBlockByNumber"} 12

Chain State Metrics

Current chain tip:

rpc_chain_tip_block 18500000

Chain tip update latency:

rpc_chain_tip_update_latency_seconds_bucket{le="0.1"} 2800

Reorg detection:

rpc_reorgs_detected_total 5

rpc_reorg_depth_bucket{le="2"} 3
rpc_reorg_depth_bucket{le="5"} 4
rpc_reorg_depth_bucket{le="10"} 5

rpc_last_reorg_block 18499995

System Metrics

Active connections:

rpc_active_connections 150

WebSocket connections:

rpc_websocket_active_connections 4

rpc_websocket_connections_total 125
rpc_websocket_disconnections_total 121

Batch requests:

rpc_batch_requests_total 2500

rpc_batch_request_size_bucket{le="5"} 1500
rpc_batch_request_size_bucket{le="10"} 2200
rpc_batch_request_size_bucket{le="25"} 2500

rpc_batch_request_duration_seconds_bucket{le="0.5"} 2000
rpc_batch_request_duration_seconds_bucket{le="1.0"} 2400

Health Endpoint

Get real-time system status at /health.

Request

GET /health HTTP/1.1
Host: localhost:3030

Response

{
  "status": "healthy",
  "version": "1.0.0",
  "uptime_seconds": 86400,
  "upstreams": [
    {
      "name": "alchemy-mainnet",
      "healthy": true,
      "chain_id": 1,
      "latest_block": 18500000,
      "finalized_block": 18499900,
      "response_time_ms": 45,
      "error_count": 0,
      "circuit_breaker": "closed",
      "latency": {
        "p50_ms": 48,
        "p95_ms": 120,
        "p99_ms": 350
      },
      "requests": {
        "total": 125000,
        "success": 124500,
        "errors": 500
      }
    },
    {
      "name": "infura-mainnet",
      "healthy": true,
      "chain_id": 1,
      "latest_block": 18500001,
      "finalized_block": 18499901,
      "response_time_ms": 52,
      "error_count": 0,
      "circuit_breaker": "closed",
      "latency": {
        "p50_ms": 55,
        "p95_ms": 140,
        "p99_ms": 420
      },
      "requests": {
        "total": 75000,
        "success": 74200,
        "errors": 800
      }
    }
  ],
  "cache": {
    "enabled": true,
    "blocks": {
      "hot_window_size": 200,
      "cached_headers": 8500,
      "cached_bodies": 6200,
      "memory_bytes": 52428800
    },
    "logs": {
      "chunks": 50,
      "indexed_blocks": 50000,
      "memory_bytes": 104857600
    },
    "transactions": {
      "cached_transactions": 25000,
      "cached_receipts": 25000,
      "memory_bytes": 41943040
    },
    "hit_rate": 0.87
  },
  "metrics": {
    "total_requests": 200000,
    "total_errors": 1300,
    "average_latency_ms": 48.5,
    "cache_hit_rate": 0.87
  }
}

Health Status Values

Status
Description

healthy

All systems operational

degraded

Some upstreams unhealthy

unhealthy

No healthy upstreams


Logging

Configuration

[logging]
level = "info"         # trace, debug, info, warn, error
format = "pretty"      # pretty or json

Log Levels

Level
Use Case

trace

Extreme verbosity, development only

debug

Development and troubleshooting

info

Normal production logging

warn

Potential issues, non-critical

error

Errors requiring attention

Log Format

Pretty (Development):

2024-12-03T10:30:45.123Z  INFO prism::server: Server starting on 127.0.0.1:3030
2024-12-03T10:30:45.456Z  INFO prism::upstream: Added upstream: alchemy-mainnet
2024-12-03T10:30:45.789Z  INFO prism::health: Health checker started (interval: 60s)
2024-12-03T10:31:15.234Z  WARN prism::upstream: Upstream latency high upstream=infura latency_ms=850
2024-12-03T10:32:00.567Z ERROR prism::upstream: Circuit breaker opened upstream=quicknode failures=5

JSON (Production):

{"timestamp":"2024-12-03T10:30:45.123Z","level":"INFO","target":"prism::server","message":"Server starting on 127.0.0.1:3030"}
{"timestamp":"2024-12-03T10:30:45.456Z","level":"INFO","target":"prism::upstream","message":"Added upstream: alchemy-mainnet"}
{"timestamp":"2024-12-03T10:31:15.234Z","level":"WARN","target":"prism::upstream","message":"Upstream latency high","upstream":"infura","latency_ms":850}
{"timestamp":"2024-12-03T10:32:00.567Z","level":"ERROR","target":"prism::upstream","message":"Circuit breaker opened","upstream":"quicknode","failures":5}

Important Log Events

Startup:

INFO Server starting on 127.0.0.1:3030
INFO Added upstream: alchemy-mainnet
INFO Health checker started (interval: 60s)
INFO Cache manager initialized

Upstream Issues:

WARN Upstream latency high upstream=infura latency_ms=850
WARN Health check failed upstream=quicknode error="connection timeout"
ERROR Circuit breaker opened upstream=quicknode failures=5

Cache Events:

INFO Hot window advanced new_start=18499800
DEBUG Cache hit method=eth_getBlockByNumber block=18500000
DEBUG Partial cache fulfillment method=eth_getLogs cached_pct=0.75

Reorg Detection:

WARN Reorg detected depth=2 block=18499995
INFO Cache invalidated from_block=18499995 to_block=18500000

Alerting Recommendations

Critical Alerts

No Healthy Upstreams:

rpc_healthy_upstreams == 0

High Error Rate:

rate(rpc_requests_error_total[5m]) / rate(rpc_requests_total[5m]) > 0.05

All Circuit Breakers Open:

sum(rpc_circuit_breaker_state) == count(rpc_circuit_breaker_state)

Warning Alerts

Low Cache Hit Rate:

rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m])) < 0.7

High P99 Latency:

histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m])) > 1.0

Upstream Block Lag:

rpc_upstream_block_head_lag > 10

Frequent Reorgs:

rate(rpc_reorgs_detected_total[1h]) > 2

Sample Alert Rules

groups:
  - name: prism
    interval: 30s
    rules:
      - alert: NoHealthyUpstreams
        expr: rpc_healthy_upstreams == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No healthy upstreams available"
          description: "All upstreams are unhealthy, service is degraded"

      - alert: HighErrorRate
        expr: |
          rate(rpc_requests_error_total[5m]) /
          rate(rpc_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: LowCacheHitRate
        expr: |
          rate(rpc_cache_hits_total[5m]) /
          (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m])) < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

      - alert: CircuitBreakerOpen
        expr: rpc_circuit_breaker_state == 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open for {{ $labels.upstream }}"
          description: "Upstream {{ $labels.upstream }} is isolated"

Dashboards

Grafana Dashboard Example

Request Throughput Panel:

sum(rate(rpc_requests_total[5m])) by (method)

Cache Hit Rate Panel:

sum(rate(rpc_cache_hits_total[5m])) /
  (sum(rate(rpc_cache_hits_total[5m])) + sum(rate(rpc_cache_misses_total[5m])))

Latency Percentiles Panel:

histogram_quantile(0.50, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))

Upstream Health Panel:

rpc_upstream_health

Error Rate Panel:

sum(rate(rpc_requests_error_total[5m])) by (upstream)

Next: Explore the Routing Strategies or Caching Guide for more details.

Last updated