Monitoring & Observability
Comprehensive guide to monitoring Prism with Prometheus metrics, health checks, and logging.
Table of Contents
Overview
Prism provides comprehensive observability through:
Prometheus Metrics: 50+ metrics covering requests, caching, upstream health, routing
Health Endpoint: Real-time system status and upstream health
Structured Logging: JSON or pretty-printed logs with configurable levels
Request Tracing: Cache status headers for debugging
Enabling Metrics
[metrics]
enabled = true
prometheus_port = 9090 # Optional, metrics also on main portAccess metrics:
curl http://localhost:3030/metricsPrometheus Metrics
Core RPC Metrics
Request Counters
Total requests by method and upstream:
rpc_requests_total{method="eth_getLogs",upstream="alchemy"} 125000Successful requests:
rpc_requests_success_total{method="eth_getLogs",upstream="alchemy"} 124500Failed requests:
rpc_requests_error_total{method="eth_getLogs",upstream="alchemy"} 500Request Latency
Latency histogram by method and upstream:
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.05"} 80000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.1"} 115000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="0.5"} 123000
rpc_request_duration_seconds_bucket{method="eth_getLogs",upstream="alchemy",le="+Inf"} 125000
rpc_request_duration_seconds_sum{method="eth_getLogs",upstream="alchemy"} 6250.5
rpc_request_duration_seconds_count{method="eth_getLogs",upstream="alchemy"} 125000Calculate percentiles in Prometheus:
# P50 latency
histogram_quantile(0.5, rate(rpc_request_duration_seconds_bucket[5m]))
# P95 latency
histogram_quantile(0.95, rate(rpc_request_duration_seconds_bucket[5m]))
# P99 latency
histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))Cache Metrics
Cache Hits and Misses
rpc_cache_hits_total{method="eth_getBlockByNumber"} 95000
rpc_cache_misses_total{method="eth_getBlockByNumber"} 5000Cache hit rate:
rate(rpc_cache_hits_total[5m]) /
(rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))Cache Statistics
Block cache:
rpc_block_cache_hot_window_size 200
rpc_block_cache_lru_entries 8500
rpc_block_cache_bytes 52428800 # 50MBLog cache:
rpc_log_cache_chunks 50
rpc_log_cache_indexed_blocks 50000
rpc_log_cache_bitmap_entries 25000
rpc_log_cache_bytes 104857600 # 100MBTransaction cache:
rpc_transaction_cache_entries 25000
rpc_receipt_cache_entries 25000
rpc_transaction_cache_bytes 41943040 # 40MBPartial Cache Fulfillment
rpc_partial_cache_fulfillments_total{method="eth_getLogs"} 15000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="0.5"} 5000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="0.75"} 10000
rpc_partial_cache_fulfillment_percentage_bucket{method="eth_getLogs",le="1.0"} 15000Cache Evictions
rpc_cache_evictions_total{cache_type="block"} 1500
rpc_cache_evictions_total{cache_type="log"} 3200
rpc_cache_evictions_total{cache_type="transaction"} 2100Upstream Health Metrics
Health Status
Upstream health gauge (1=healthy, 0=unhealthy):
rpc_upstream_health{upstream="alchemy"} 1
rpc_upstream_health{upstream="infura"} 1
rpc_upstream_health{upstream="quicknode"} 0Number of healthy upstreams:
rpc_healthy_upstreams 2Health Check Results
rpc_health_check_success_total{upstream="alchemy"} 2880
rpc_health_check_failure_total{upstream="alchemy"} 5
rpc_health_check_duration_seconds_bucket{upstream="alchemy",le="0.05"} 2800
rpc_health_check_duration_seconds_bucket{upstream="alchemy",le="0.1"} 2880Upstream Latency
Latency percentiles:
rpc_upstream_latency_p50_ms{upstream="alchemy"} 48
rpc_upstream_latency_p95_ms{upstream="alchemy"} 120
rpc_upstream_latency_p99_ms{upstream="alchemy"} 350
rpc_upstream_latency_avg_ms{upstream="alchemy"} 62Upstream Error Metrics
Error Counts
rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 25
rpc_upstream_errors_total{upstream="alchemy",error_type="connection_failed"} 3
rpc_upstream_errors_total{upstream="infura",error_type="rpc_rate_limit"} 145Error Details
rpc_jsonrpc_errors_total{upstream="alchemy",code="-32005",category="provider_error"} 15Error Rate
rpc_upstream_error_rate{upstream="alchemy"} 0.004 # 0.4% error rateCircuit Breaker Metrics
Circuit breaker state (0=closed, 0.5=half-open, 1=open):
rpc_circuit_breaker_state{upstream="alchemy"} 0State transitions:
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="open"} 3
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="closed"} 3
rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="half_open"} 3Failure count:
rpc_circuit_breaker_failure_count{upstream="alchemy"} 0Open duration:
rpc_circuit_breaker_open_duration_seconds_bucket{upstream="alchemy",le="60"} 2
rpc_circuit_breaker_open_duration_seconds_bucket{upstream="alchemy",le="300"} 3Routing Metrics
Upstream Selection
rpc_upstream_selections_total{upstream="alchemy",reason="best_score"} 85000
rpc_upstream_selections_total{upstream="infura",reason="fallback"} 15000Scoring Metrics
Composite scores:
rpc_upstream_composite_score{upstream="alchemy"} 0.875
rpc_upstream_composite_score{upstream="infura"} 0.720Score factors:
rpc_upstream_latency_factor{upstream="alchemy"} 0.92
rpc_upstream_error_rate_factor{upstream="alchemy"} 0.996
rpc_upstream_throttle_factor{upstream="alchemy"} 1.0
rpc_upstream_block_lag_factor{upstream="alchemy"} 1.0Block lag:
rpc_upstream_block_head_lag{upstream="alchemy"} 0
rpc_upstream_block_head_lag{upstream="infura"} 2Hedging Metrics
Hedged requests:
rpc_hedged_requests_total{primary="alchemy",hedged="infura"} 1500Hedge wins (which request finished first):
rpc_hedge_wins_total{upstream="alchemy",type="primary"} 850
rpc_hedge_wins_total{upstream="infura",type="hedged"} 650Hedge delay:
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="50"} 800
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="100"} 1400
rpc_hedge_delay_ms_bucket{upstream="alchemy",le="200"} 1500Hedge skips:
rpc_hedge_skipped_total{reason="insufficient_data"} 250Consensus Metrics
rpc_consensus_requests_total{result="success"} 25000
rpc_consensus_requests_total{result="failure"} 15
rpc_consensus_agreement_rate 0.9994
rpc_consensus_duration_seconds_bucket{le="0.1"} 5000
rpc_consensus_duration_seconds_bucket{le="0.5"} 24000Authentication Metrics
Auth attempts:
rpc_auth_success_total{key_id="production-api"} 125000
rpc_auth_failure_total{key_id="unknown"} 45Auth cache:
rpc_auth_cache_hits_total 120000
rpc_auth_cache_misses_total 5045
rpc_auth_cache_entries 50Rate limiting:
rpc_rate_limit_allowed_total{key="production-api"} 124500
rpc_rate_limit_rejected_total{key="production-api"} 500Quota management:
rpc_auth_quota_exceeded_total{key_id="production-api"} 3Method permissions:
rpc_auth_method_denied_total{key_id="logs-only",method="eth_getBlockByNumber"} 12Chain State Metrics
Current chain tip:
rpc_chain_tip_block 18500000Chain tip update latency:
rpc_chain_tip_update_latency_seconds_bucket{le="0.1"} 2800Reorg detection:
rpc_reorgs_detected_total 5
rpc_reorg_depth_bucket{le="2"} 3
rpc_reorg_depth_bucket{le="5"} 4
rpc_reorg_depth_bucket{le="10"} 5
rpc_last_reorg_block 18499995System Metrics
Active connections:
rpc_active_connections 150WebSocket connections:
rpc_websocket_active_connections 4
rpc_websocket_connections_total 125
rpc_websocket_disconnections_total 121Batch requests:
rpc_batch_requests_total 2500
rpc_batch_request_size_bucket{le="5"} 1500
rpc_batch_request_size_bucket{le="10"} 2200
rpc_batch_request_size_bucket{le="25"} 2500
rpc_batch_request_duration_seconds_bucket{le="0.5"} 2000
rpc_batch_request_duration_seconds_bucket{le="1.0"} 2400Health Endpoint
Get real-time system status at /health.
Request
GET /health HTTP/1.1
Host: localhost:3030Response
{
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 86400,
"upstreams": [
{
"name": "alchemy-mainnet",
"healthy": true,
"chain_id": 1,
"latest_block": 18500000,
"finalized_block": 18499900,
"response_time_ms": 45,
"error_count": 0,
"circuit_breaker": "closed",
"latency": {
"p50_ms": 48,
"p95_ms": 120,
"p99_ms": 350
},
"requests": {
"total": 125000,
"success": 124500,
"errors": 500
}
},
{
"name": "infura-mainnet",
"healthy": true,
"chain_id": 1,
"latest_block": 18500001,
"finalized_block": 18499901,
"response_time_ms": 52,
"error_count": 0,
"circuit_breaker": "closed",
"latency": {
"p50_ms": 55,
"p95_ms": 140,
"p99_ms": 420
},
"requests": {
"total": 75000,
"success": 74200,
"errors": 800
}
}
],
"cache": {
"enabled": true,
"blocks": {
"hot_window_size": 200,
"cached_headers": 8500,
"cached_bodies": 6200,
"memory_bytes": 52428800
},
"logs": {
"chunks": 50,
"indexed_blocks": 50000,
"memory_bytes": 104857600
},
"transactions": {
"cached_transactions": 25000,
"cached_receipts": 25000,
"memory_bytes": 41943040
},
"hit_rate": 0.87
},
"metrics": {
"total_requests": 200000,
"total_errors": 1300,
"average_latency_ms": 48.5,
"cache_hit_rate": 0.87
}
}Health Status Values
healthy
All systems operational
degraded
Some upstreams unhealthy
unhealthy
No healthy upstreams
Logging
Configuration
[logging]
level = "info" # trace, debug, info, warn, error
format = "pretty" # pretty or jsonLog Levels
trace
Extreme verbosity, development only
debug
Development and troubleshooting
info
Normal production logging
warn
Potential issues, non-critical
error
Errors requiring attention
Log Format
Pretty (Development):
2024-12-03T10:30:45.123Z INFO prism::server: Server starting on 127.0.0.1:3030
2024-12-03T10:30:45.456Z INFO prism::upstream: Added upstream: alchemy-mainnet
2024-12-03T10:30:45.789Z INFO prism::health: Health checker started (interval: 60s)
2024-12-03T10:31:15.234Z WARN prism::upstream: Upstream latency high upstream=infura latency_ms=850
2024-12-03T10:32:00.567Z ERROR prism::upstream: Circuit breaker opened upstream=quicknode failures=5JSON (Production):
{"timestamp":"2024-12-03T10:30:45.123Z","level":"INFO","target":"prism::server","message":"Server starting on 127.0.0.1:3030"}
{"timestamp":"2024-12-03T10:30:45.456Z","level":"INFO","target":"prism::upstream","message":"Added upstream: alchemy-mainnet"}
{"timestamp":"2024-12-03T10:31:15.234Z","level":"WARN","target":"prism::upstream","message":"Upstream latency high","upstream":"infura","latency_ms":850}
{"timestamp":"2024-12-03T10:32:00.567Z","level":"ERROR","target":"prism::upstream","message":"Circuit breaker opened","upstream":"quicknode","failures":5}Important Log Events
Startup:
INFO Server starting on 127.0.0.1:3030
INFO Added upstream: alchemy-mainnet
INFO Health checker started (interval: 60s)
INFO Cache manager initializedUpstream Issues:
WARN Upstream latency high upstream=infura latency_ms=850
WARN Health check failed upstream=quicknode error="connection timeout"
ERROR Circuit breaker opened upstream=quicknode failures=5Cache Events:
INFO Hot window advanced new_start=18499800
DEBUG Cache hit method=eth_getBlockByNumber block=18500000
DEBUG Partial cache fulfillment method=eth_getLogs cached_pct=0.75Reorg Detection:
WARN Reorg detected depth=2 block=18499995
INFO Cache invalidated from_block=18499995 to_block=18500000Alerting Recommendations
Critical Alerts
No Healthy Upstreams:
rpc_healthy_upstreams == 0High Error Rate:
rate(rpc_requests_error_total[5m]) / rate(rpc_requests_total[5m]) > 0.05All Circuit Breakers Open:
sum(rpc_circuit_breaker_state) == count(rpc_circuit_breaker_state)Warning Alerts
Low Cache Hit Rate:
rate(rpc_cache_hits_total[5m]) /
(rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m])) < 0.7High P99 Latency:
histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m])) > 1.0Upstream Block Lag:
rpc_upstream_block_head_lag > 10Frequent Reorgs:
rate(rpc_reorgs_detected_total[1h]) > 2Sample Alert Rules
groups:
- name: prism
interval: 30s
rules:
- alert: NoHealthyUpstreams
expr: rpc_healthy_upstreams == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No healthy upstreams available"
description: "All upstreams are unhealthy, service is degraded"
- alert: HighErrorRate
expr: |
rate(rpc_requests_error_total[5m]) /
rate(rpc_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: LowCacheHitRate
expr: |
rate(rpc_cache_hits_total[5m]) /
(rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m])) < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Cache hit rate is {{ $value | humanizePercentage }}"
- alert: CircuitBreakerOpen
expr: rpc_circuit_breaker_state == 1
for: 2m
labels:
severity: warning
annotations:
summary: "Circuit breaker open for {{ $labels.upstream }}"
description: "Upstream {{ $labels.upstream }} is isolated"Dashboards
Grafana Dashboard Example
Request Throughput Panel:
sum(rate(rpc_requests_total[5m])) by (method)Cache Hit Rate Panel:
sum(rate(rpc_cache_hits_total[5m])) /
(sum(rate(rpc_cache_hits_total[5m])) + sum(rate(rpc_cache_misses_total[5m])))Latency Percentiles Panel:
histogram_quantile(0.50, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(rpc_request_duration_seconds_bucket[5m])) by (le))Upstream Health Panel:
rpc_upstream_healthError Rate Panel:
sum(rate(rpc_requests_error_total[5m])) by (upstream)Next: Explore the Routing Strategies or Caching Guide for more details.
Last updated