Monitoring & Observability

Comprehensive guide to monitoring Prism with Prometheus metrics, health checks, and logging.

Table of Contents


Overview

Prism provides comprehensive observability through:

  • Prometheus Metrics: 50+ metrics covering requests, caching, upstream health, routing

  • Health Endpoint: Real-time system status and upstream health

  • Structured Logging: JSON or pretty-printed logs with configurable levels

  • Request Tracing: Cache status headers for debugging

Enabling Metrics

Access metrics:


Prometheus Metrics

Core RPC Metrics

Request Counters

Total requests by method and upstream:

Successful requests:

Failed requests:

Request Latency

Latency histogram by method and upstream:

Calculate percentiles in Prometheus:

Cache Metrics

Cache Hits and Misses

Cache hit rate:

Cache Statistics

Block cache:

Log cache:

Transaction cache:

Partial Cache Fulfillment

Cache Evictions

Upstream Health Metrics

Health Status

Upstream health gauge (1=healthy, 0=unhealthy):

Number of healthy upstreams:

Health Check Results

Upstream Latency

Latency percentiles:

Upstream Error Metrics

Error Counts

Error Details

Error Rate

Circuit Breaker Metrics

Circuit breaker state (0=closed, 0.5=half-open, 1=open):

State transitions:

Failure count:

Open duration:

Routing Metrics

Upstream Selection

Scoring Metrics

Composite scores:

Score factors:

Block lag:

Hedging Metrics

Hedged requests:

Hedge wins (which request finished first):

Hedge delay:

Hedge skips:

Consensus Metrics

Authentication Metrics

Auth attempts:

Auth cache:

Rate limiting:

Quota management:

Method permissions:

Chain State Metrics

Current chain tip:

Chain tip update latency:

Reorg detection:

System Metrics

Active connections:

WebSocket connections:

Batch requests:


Health Endpoint

Get real-time system status at /health.

Request

Response

Health Status Values

Status
Description

healthy

All systems operational

degraded

Some upstreams unhealthy

unhealthy

No healthy upstreams


Logging

Configuration

Log Levels

Level
Use Case

trace

Extreme verbosity, development only

debug

Development and troubleshooting

info

Normal production logging

warn

Potential issues, non-critical

error

Errors requiring attention

Log Format

Pretty (Development):

JSON (Production):

Important Log Events

Startup:

Upstream Issues:

Cache Events:

Reorg Detection:


Alerting Recommendations

Critical Alerts

No Healthy Upstreams:

High Error Rate:

All Circuit Breakers Open:

Warning Alerts

Low Cache Hit Rate:

High P99 Latency:

Upstream Block Lag:

Frequent Reorgs:

Sample Alert Rules


Dashboards

Grafana Dashboard Example

Request Throughput Panel:

Cache Hit Rate Panel:

Latency Percentiles Panel:

Upstream Health Panel:

Error Rate Panel:


Next: Explore the Routing Strategies or Caching Guide for more details.

Last updated