Troubleshooting & FAQ

Comprehensive troubleshooting guide for diagnosing and resolving common Prism issues.

Table of Contents


Connection & Upstream Issues

No Healthy Upstreams Available

Error: "No healthy upstreams available" (Error code: -32050)

Symptoms:

  • All requests fail with -32050 error

  • /health endpoint returns "status": "unhealthy"

  • Metrics show rpc_healthy_upstreams == 0

Causes & Solutions:

1. All Upstreams Failing Health Checks

Check health status:

curl http://localhost:3030/health | jq '.upstreams'

Check metrics:

curl -s http://localhost:3030/metrics | grep rpc_upstream_health
# rpc_upstream_health{upstream="alchemy"} 0
# rpc_upstream_health{upstream="infura"} 0

Solutions:

  • Verify upstream URLs are correct in configuration

  • Check network connectivity to upstream providers

  • Verify API keys are valid and not rate-limited

  • Check upstream provider status pages

2. Incorrect Configuration

Check configuration:

[[upstreams]]
name = "alchemy-mainnet"
url = "https://eth-mainnet.g.alchemy.com/v2/YOUR_API_KEY"  # Verify this URL
chain_id = 1

Common mistakes:

  • Missing or invalid API key in URL

  • Wrong chain ID (e.g., mainnet vs testnet mismatch)

  • HTTP instead of HTTPS

  • Incorrect endpoint path

3. Circuit Breakers All Open

Check circuit breaker state:

curl -s http://localhost:3030/metrics | grep rpc_circuit_breaker_state
# rpc_circuit_breaker_state{upstream="alchemy"} 1  # 1 = open (bad)

Solution: Wait for circuit breakers to reset, or restart Prism to reset state:

# Circuit breakers will automatically transition to half-open after timeout
# Default timeout: 60 seconds

Reduce sensitivity:

[upstreams.circuit_breaker]
failure_threshold = 10    # Increase from default 5
timeout_seconds = 30      # Decrease recovery time

Upstream Connection Timeouts

Error: "Request timeout" or "Connection failed: connection timeout"

Symptoms:

  • Requests take 30+ seconds and then timeout

  • High P99 latency in metrics

  • Upstream error metrics show error_type="timeout"

Check timeout metrics:

curl -s http://localhost:3030/metrics | grep timeout
# rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 125

Solutions:

1. Increase Timeout Values

[upstreams]
timeout_seconds = 60      # Increase from default 30
max_retries = 3           # Increase retry attempts

2. Check Network Latency

# Test direct connection to upstream
time curl -s -o /dev/null -w "%{time_total}\n" \
  https://eth-mainnet.g.alchemy.com/v2/YOUR_KEY \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

If latency > 5 seconds:

  • Network issue between your server and upstream provider

  • Try different upstream providers

  • Use providers geographically closer to your server

3. Reduce Concurrent Load

[upstreams]
max_concurrent_requests = 50  # Reduce from default 100

HTTP 429 Rate Limiting

Error: "HTTP error: 429" or RPC error -32005: "Limit exceeded"

Symptoms:

  • Intermittent failures during high traffic

  • Error rate spikes in metrics

  • Upstream provider returns "Too Many Requests"

Check rate limit errors:

curl -s http://localhost:3030/metrics | grep rate_limit
# rpc_upstream_errors_total{upstream="alchemy",error_type="rpc_rate_limit"} 450
# rpc_jsonrpc_errors_total{upstream="alchemy",code="-32005"} 450

Solutions:

1. Add More Upstreams

# Distribute load across multiple providers
[[upstreams]]
name = "alchemy-mainnet"
url = "https://eth-mainnet.g.alchemy.com/v2/KEY1"

[[upstreams]]
name = "infura-mainnet"
url = "https://mainnet.infura.io/v3/KEY2"

[[upstreams]]
name = "quicknode-mainnet"
url = "https://your-endpoint.quiknode.pro/KEY3/"

2. Enable Caching to Reduce Upstream Calls

[cache]
enabled = true

[cache.block_cache]
hot_window_size = 500      # Cache more recent blocks

[cache.log_cache]
chunk_size = 100           # Larger chunks = better cache hit rate
max_chunks = 200

Verify cache effectiveness:

# Check cache hit rate
curl -s http://localhost:3030/metrics | grep cache_hits

3. Implement Rate Limiting at Prism Level

[rate_limiting]
enabled = true
requests_per_second = 50   # Limit client request rate

Cache Problems

Low Cache Hit Rate

Symptoms:

  • Cache hit rate < 70% consistently

  • Most requests show X-Cache-Status: MISS

  • High upstream request counts

Check cache hit rate:

curl -s http://localhost:3030/metrics | grep -E 'cache_(hits|misses)'
# Calculate: hits / (hits + misses)

PromQL query:

rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))

Solutions:

1. Increase Cache Sizes

[cache.block_cache]
hot_window_size = 1000     # Increase from default 200
max_headers = 50000        # Increase from default 10000
max_bodies = 20000         # Increase from default 5000

[cache.log_cache]
max_chunks = 500           # Increase from default 100

[cache.transaction_cache]
max_transactions = 100000  # Increase from default 50000
max_receipts = 100000

2. Adjust Chunk Size for Log Cache

[cache.log_cache]
chunk_size = 100           # Smaller = more granular, better hit rate
                           # Larger = fewer chunks, less memory overhead

Trade-off:

  • Smaller chunk_size (10-50): Better partial cache hits, more memory

  • Larger chunk_size (100-500): Less memory, may miss partial ranges

3. Enable Cache Warming

Warm cache with recent data on startup:

[cache]
warm_on_startup = true
warm_blocks_count = 1000   # Warm last 1000 blocks

4. Check Request Patterns

Problem: Random historical queries bypass cache

Example: Querying random old blocks

# These all miss cache if not in hot window
eth_getBlockByNumber("0x500000", false)
eth_getBlockByNumber("0x600000", false)
eth_getBlockByNumber("0x750000", false)

Solution:

  • Use recent blocks when possible (last 200 blocks have highest hit rate)

  • For historical queries, batch sequential ranges to benefit from cache


High Cache Eviction Rate

Symptoms:

  • rpc_cache_evictions_total increasing rapidly

  • Cache hit rate declining over time

  • Memory pressure on system

Check eviction metrics:

curl -s http://localhost:3030/metrics | grep evictions
# rpc_cache_evictions_total{cache_type="block"} 15000
# rpc_cache_evictions_total{cache_type="log"} 32000

Solutions:

1. Increase Cache Memory Limits

[cache.block_cache]
max_headers = 100000       # Allow more entries before eviction
max_bodies = 50000

[cache.log_cache]
max_chunks = 1000          # More chunks = less eviction

2. Adjust Hot Window Size

[cache.block_cache]
hot_window_size = 500      # Keep more recent blocks in fast cache

3. Check Memory Usage

# Monitor Prism process memory
ps aux | grep prism

# Check system memory
free -h

If memory constrained:

  • Reduce cache sizes

  • Add more RAM to server

  • Enable cache compression (if available)


Cache Invalidation After Reorgs

Symptoms:

  • Cache hit rate drops suddenly

  • Logs show "reorg detected" messages

  • Blocks being refetched repeatedly

Check reorg metrics:

curl -s http://localhost:3030/metrics | grep reorg
# rpc_reorgs_detected_total 8
# rpc_last_reorg_block 18499995

Expected behavior: Cache invalidates blocks during reorgs to maintain consistency

Solutions:

1. Adjust Safety Depth

[reorg_manager]
safety_depth = 20          # Increase from default 12
                           # Blocks beyond this depth are "safe" from reorgs

Trade-off:

  • Larger safety_depth: Less cache invalidation, but stale data during deep reorgs

  • Smaller safety_depth: More aggressive invalidation, always fresh data

2. Check for Reorg Storms

Frequent reorgs indicate:

  • Network issues with upstreams

  • Misconfigured chain_id (wrong network)

  • Upstream providers disagreeing on chain state

Verify chain consistency:

# Check all upstreams report same chain tip
curl http://localhost:3030/health | jq '.upstreams[] | {name, latest_block}'

If upstreams disagree by > 10 blocks:

  • One or more upstreams may be syncing

  • Consider removing slow/stale upstreams

3. Enable Reorg Coalescing

[reorg_manager]
coalesce_window_ms = 100   # Batch rapid reorgs within 100ms window

Authentication Errors

Invalid or Missing API Key

Error: -32054: "Authentication failed" or -32600: "Invalid Request"

Symptoms:

  • All requests rejected with authentication error

  • No X-API-Key header in requests

Solutions:

1. Include API Key in Request

Header method:

curl -X POST http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key-here" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

Query parameter method:

curl -X POST "http://localhost:3030/?api_key=your-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

2. Verify API Key Configuration

[authentication]
enabled = true

[[authentication.keys]]
key = "prod-key-12345"
name = "production-api"
allowed_methods = ["*"]  # Or specific methods
rate_limit_per_second = 100

3. Check Authentication Metrics

curl -s http://localhost:3030/metrics | grep auth
# rpc_auth_success_total{key_id="production-api"} 125000
# rpc_auth_failure_total{key_id="unknown"} 45

Method Not Allowed for API Key

Error: -32055: "Method not allowed"

Symptoms:

  • Some methods work, others return permission error

  • Authentication succeeds but specific RPC calls fail

Check key permissions:

[[authentication.keys]]
key = "logs-only-key"
allowed_methods = [
  "eth_getLogs",
  "eth_blockNumber"
]
# Calling eth_getBlockByNumber with this key will fail

Solution: Update key permissions or use appropriate key:

[[authentication.keys]]
key = "full-access-key"
allowed_methods = ["*"]  # Allow all methods

Rate Limit Exceeded

Error: -32053: "Rate limit exceeded"

Symptoms:

  • Requests succeed initially, then fail during high traffic

  • Rate limit metrics show rejections

Check rate limit metrics:

curl -s http://localhost:3030/metrics | grep rate_limit
# rpc_rate_limit_rejected_total{key="production-api"} 500

Solutions:

1. Increase Rate Limit

[[authentication.keys]]
key = "production-api"
rate_limit_per_second = 200  # Increase from 100

2. Implement Client-Side Rate Limiting

// Use rate limiting library on client side
const Bottleneck = require('bottleneck');

const limiter = new Bottleneck({
  minTime: 20,  // 50 requests per second
  maxConcurrent: 10
});

limiter.schedule(() => makeRpcRequest());

3. Use Multiple API Keys

Distribute load across multiple API keys:

const keys = ['key1', 'key2', 'key3'];
const keyIndex = requestCount % keys.length;
const apiKey = keys[keyIndex];

Performance Problems

High Request Latency (P99 > 1s)

Symptoms:

  • Slow response times for clients

  • High P99 latency in metrics

  • Timeouts during peak traffic

Check latency metrics:

curl -s http://localhost:3030/metrics | grep duration_seconds

PromQL query:

histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

Solutions:

1. Enable Caching

[cache]
enabled = true

Verify cache is working:

# Check cache status header
curl -D - http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' | grep -i cache
# X-Cache-Status: FULL

2. Add Faster Upstreams

[[upstreams]]
name = "quicknode-mainnet"
url = "https://your-endpoint.quiknode.pro/KEY/"  # Often lower latency
priority = 10  # Higher priority for faster upstream

3. Enable Request Hedging

[routing]
strategy = "hedge"

[routing.hedge]
enabled = true
hedge_delay_ms = 100       # Send backup request after 100ms

How hedging helps:

  • Sends request to primary upstream

  • If no response in 100ms, sends to backup upstream

  • Returns whichever responds first

  • Reduces tail latency

4. Optimize Connection Pooling

[upstreams]
max_concurrent_requests = 200  # Allow more concurrent requests
timeout_seconds = 30           # Reduce from 60 if upstreams are fast

Low Throughput (RPS < Expected)

Symptoms:

  • Cannot achieve desired requests per second

  • Concurrent requests queue up

  • CPU or network not saturated

Check throughput metrics:

curl -s http://localhost:3030/metrics | grep rpc_requests_total

PromQL query:

rate(rpc_requests_total[1m])  # Requests per second

Solutions:

1. Increase Concurrency Limits

[upstreams]
max_concurrent_requests = 500  # Increase from default 100

[server]
max_connections = 10000        # Increase connection pool

2. Add More Upstreams

# Distribute load across more providers
[[upstreams]]
name = "alchemy"
# ... config

[[upstreams]]
name = "infura"
# ... config

[[upstreams]]
name = "quicknode"
# ... config

3. Enable Load Balancing

[routing]
strategy = "least_loaded"  # Distribute evenly across upstreams

4. Use Batch Requests

Instead of:

// 100 separate HTTP requests
for (let i = 0; i < 100; i++) {
  await fetch('/rpc', { method: 'POST', body: singleRequest });
}

Do this:

// 1 batched HTTP request
const batch = requests.map(req => ({ ...req, id: i }));
await fetch('/rpc', { method: 'POST', body: JSON.stringify(batch) });

Circuit Breaker Issues

Circuit Breaker Stuck Open

Symptoms:

  • Upstream marked healthy but circuit breaker stays open

  • Requests continue to fail with "Circuit breaker is open"

  • Circuit breaker state metric shows 1 (open) for extended period

Check circuit breaker state:

curl -s http://localhost:3030/metrics | grep circuit_breaker_state
# rpc_circuit_breaker_state{upstream="alchemy"} 1  # 1 = open

Check transition history:

curl -s http://localhost:3030/metrics | grep circuit_breaker_transitions
# rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="open"} 3
# rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="half_open"} 0

Solutions:

1. Wait for Automatic Recovery

Circuit breaker will transition to half-open after timeout:

[upstreams.circuit_breaker]
timeout_seconds = 60  # Default timeout

Check when circuit will recover:

# Look for last failure time in logs
# Circuit opens at: T
# Will try recovery at: T + 60 seconds

2. Reduce Circuit Breaker Sensitivity

[upstreams.circuit_breaker]
failure_threshold = 10     # Increase from default 5
timeout_seconds = 30       # Decrease recovery time from 60

3. Restart Prism (Last Resort)

# Restarting resets all circuit breaker state
systemctl restart prism
# or
docker restart prism-container

Circuit Breaker Opens Too Easily

Symptoms:

  • Circuit breaker opens during normal operation

  • Temporary errors trigger circuit breaker

  • Frequent open/close cycles

Check failure threshold:

curl -s http://localhost:3030/metrics | grep circuit_breaker_failure_count
# rpc_circuit_breaker_failure_count{upstream="alchemy"} 5  # At threshold

Solutions:

1. Increase Failure Threshold

[upstreams.circuit_breaker]
failure_threshold = 10     # Increase from default 5
# Requires 10 consecutive failures before opening

2. Add Retry Logic

[upstreams]
max_retries = 3            # Retry failed requests
retry_delay_ms = 100       # Wait 100ms between retries

3. Adjust Error Classification

Review logs to identify error types:

journalctl -u prism | grep "circuit breaker"

Error types that trigger circuit breaker:

  • Provider errors (-32603: Internal error)

  • Parse errors (malformed responses)

  • Timeouts (upstream unresponsive)

Error types that DON'T trigger circuit breaker:

  • Client errors (-32600: Invalid Request)

  • Rate limits (-32005: Limit exceeded)

  • Execution errors (transaction reverts)


Reorg Handling Issues

Cache Contains Stale Data After Reorg

Symptoms:

  • Queries return different data than upstream

  • Block hashes don't match expected values

  • Transactions show incorrect status

Check reorg detection:

curl -s http://localhost:3030/metrics | grep reorg
# rpc_reorgs_detected_total 5
# rpc_last_reorg_block 18499995

Verify cache invalidation:

# Check logs for invalidation messages
journalctl -u prism | grep "invalidating block"

Solutions:

1. Verify Reorg Detection Is Working

Check health endpoint:

curl http://localhost:3030/health | jq '.upstreams[] | {name, latest_block, finalized_block}'

If all upstreams show same tip: Reorg detection is working

If upstreams disagree: May indicate reorg in progress

2. Reduce Safety Depth

[reorg_manager]
safety_depth = 6           # Reduce from default 12
# More aggressive cache invalidation

3. Clear Cache Manually

Restart Prism to clear all caches:

systemctl restart prism

Or use cache clear endpoint (if implemented):

curl -X POST http://localhost:3030/admin/clear-cache \
  -H "X-Admin-Key: your-admin-key"

Too Many Reorg Detections

Symptoms:

  • rpc_reorgs_detected_total increasing rapidly

  • Frequent cache invalidation

  • Low cache hit rate due to constant invalidation

Check reorg frequency:

curl -s http://localhost:3030/metrics | grep reorgs_detected
# rpc_reorgs_detected_total 150  # Very high

PromQL query:

rate(rpc_reorgs_detected_total[1h])  # Reorgs per second

Causes & Solutions:

1. Upstreams Reporting Different Chain States

Symptom: Upstreams disagree on current tip or block hashes

Check upstream consistency:

# Query all upstreams directly
for upstream in alchemy infura quicknode; do
  echo "=== $upstream ==="
  curl -s https://$upstream.url/... \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    | jq .result
done

If blocks differ by > 5: One or more upstreams may be syncing or stale

Solution: Remove stale upstreams from configuration

2. WebSocket Reconnections Causing False Reorgs

Check WebSocket metrics:

curl -s http://localhost:3030/metrics | grep websocket
# rpc_websocket_disconnections_total 45

Solution: Improve WebSocket stability

[upstreams.websocket]
enabled = true
reconnect_delay_ms = 5000  # Wait longer before reconnect
max_reconnect_attempts = 10

3. Network Issues Between Prism and Upstreams

Symptom: Intermittent connectivity causes missed block notifications

Solution:

  • Move Prism closer to upstream providers (same region)

  • Use more reliable network connection

  • Enable WebSocket fallback to HTTP polling


WebSocket Connection Failures

WebSocket Disconnects Repeatedly

Symptoms:

  • Frequent "WebSocket disconnected" in logs

  • High reconnection count in metrics

  • Missing block notifications

Check WebSocket status:

curl -s http://localhost:3030/metrics | grep websocket
# rpc_websocket_active_connections 0  # Should be > 0
# rpc_websocket_disconnections_total 125

Solutions:

1. Increase Reconnection Delay

[upstreams.websocket]
reconnect_delay_ms = 5000  # Increase from default 1000
max_reconnect_attempts = 20

2. Check Upstream WebSocket Support

Test WebSocket connection manually:

# Install wscat: npm install -g wscat
wscat -c wss://eth-mainnet.g.alchemy.com/v2/YOUR_KEY

# Subscribe to new heads
> {"jsonrpc":"2.0","method":"eth_subscribe","params":["newHeads"],"id":1}

# Should receive: {"jsonrpc":"2.0","result":"0x...","id":1}
# Then block notifications every ~12 seconds

If connection fails:

  • Upstream may not support WebSocket

  • API key may not have WebSocket access

  • Firewall blocking WebSocket connections

3. Disable WebSocket (Fallback to HTTP)

[upstreams.websocket]
enabled = false  # Disable WebSocket, use HTTP polling only

Note: HTTP polling is less efficient but more reliable


Missing Block Notifications

Symptoms:

  • Chain tip not updating in Prism

  • /health shows stale latest_block

  • Reorg detection not working

Check chain tip updates:

# Monitor health endpoint
watch -n 5 'curl -s http://localhost:3030/health | jq .upstreams[0].latest_block'

Solutions:

1. Verify WebSocket Subscription

Check logs for subscription confirmation:

journalctl -u prism | grep "subscribed to newHeads"

If no subscription message: WebSocket not connected

2. Enable HTTP Polling Fallback

[upstreams.health_check]
enabled = true
interval_seconds = 30      # Poll every 30 seconds
timeout_seconds = 10

How it helps:

  • Health checker polls eth_blockNumber periodically

  • Updates chain tip even if WebSocket fails

  • Detects rollbacks and reorgs

3. Check Firewall Rules

WebSocket requires outbound connections:

# Allow outbound HTTPS/WSS (port 443)
sudo ufw allow out 443/tcp

# Test connectivity
curl -v https://eth-mainnet.g.alchemy.com/v2/YOUR_KEY
# Should see "Connected to eth-mainnet.g.alchemy.com"

Error Codes Reference

JSON-RPC Standard Errors

Code
Message
Description
Troubleshooting

-32700

Parse error

Invalid JSON received

Check request syntax; ensure valid JSON

-32600

Invalid Request

Request object malformed

Verify jsonrpc: "2.0", method, id fields

-32601

Method not found

Method doesn't exist or unsupported

Check method name spelling; see supported methods

-32602

Invalid params

Invalid method parameters

Verify parameter types and count

-32603

Internal error

Internal server error

Check Prism logs; may be upstream issue

Ethereum JSON-RPC Errors

Code
Message
Description
Troubleshooting

-32000

Server error

Generic server error

Check error message for details

-32001

Resource not found

Requested resource doesn't exist

Block/transaction may not exist yet

-32002

Resource unavailable

Resource temporarily unavailable

Retry after delay; upstream may be syncing

-32003

Transaction rejected

Transaction wouldn't be accepted

Check transaction parameters

-32004

Method not supported

Method not implemented

Method not supported by upstream

-32005

Limit exceeded

Request exceeds defined limit

Reduce query range; add more upstreams

Prism-Specific Errors

Code
Message
Description
Troubleshooting

-32050

No healthy upstreams

All upstreams unavailable

Check upstream configuration; verify API keys

-32051

Circuit breaker open

Upstream circuit breaker open

Wait for recovery or restart; check upstream health

-32052

Consensus failure

Upstreams disagree on response

Check upstream consistency; remove stale upstreams

-32053

Rate limit exceeded

Request rate limit exceeded

Implement client rate limiting; increase limits

-32054

Authentication failed

Invalid or missing API key

Include X-API-Key header; verify key is valid

-32055

Method not allowed

API key lacks method permission

Update key permissions or use different key

Upstream Provider Errors

Execution Errors (Client's fault, NOT penalized):

  • "execution reverted" - Smart contract reverted

  • "out of gas" - Transaction ran out of gas

  • "insufficient funds" - Account balance too low

  • "nonce too low" - Transaction nonce already used

  • "gas too low" - Gas limit too low for transaction

Provider Errors (Upstream's fault, triggers circuit breaker):

  • "Internal error" (-32603) - Upstream server error

  • "server error" (-32000) - Generic upstream error

Rate Limit Errors (Transient, retry on different upstream):

  • "Limit exceeded" (-32005) - Upstream rate limit

  • HTTP 429 - Too many requests


Debug Strategies

General Debugging Workflow

  1. Check Health Endpoint

    curl http://localhost:3030/health | jq .
    • Verify status: "healthy"

    • Check upstream health and response times

    • Verify cache statistics

  2. Check Metrics

    curl http://localhost:3030/metrics | grep -E '(error|unhealthy|circuit|reorg)'
    • Look for error rate spikes

    • Check circuit breaker states

    • Review reorg activity

  3. Enable Debug Logging

    [logging]
    level = "debug"  # or "trace" for extreme verbosity
    format = "pretty"
  4. Monitor Real-Time Logs

    # Systemd
    journalctl -u prism -f
    
    # Docker
    docker logs -f prism-container
    
    # Direct binary
    tail -f /var/log/prism/prism.log

Debugging Specific Issues

Debug Cache Misses

1. Enable verbose logging:

[logging]
level = "debug"

2. Look for cache decision logs:

journalctl -u prism | grep "cache"
# DEBUG cache hit method=eth_getBlockByNumber block=18500000
# DEBUG cache miss method=eth_getLogs range=18400000-18500000

3. Check cache status headers:

curl -D - http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x11a6e3b",false],"id":1}' \
  | grep -i cache

Possible values:

  • FULL: Complete cache hit

  • PARTIAL: Some data from cache, rest from upstream

  • MISS: No cached data, all from upstream

  • EMPTY: Cached empty result (e.g., no logs in range)

Debug Upstream Selection

1. Enable routing logs:

journalctl -u prism | grep "upstream selected"

2. Check upstream scoring:

curl -s http://localhost:3030/metrics | grep composite_score
# rpc_upstream_composite_score{upstream="alchemy"} 0.875
# rpc_upstream_composite_score{upstream="infura"} 0.720

3. Review selection reasons:

curl -s http://localhost:3030/metrics | grep upstream_selections
# rpc_upstream_selections_total{upstream="alchemy",reason="best_score"} 85000
# rpc_upstream_selections_total{upstream="infura",reason="fallback"} 15000

Debug Authentication Issues

1. Check authentication metrics:

curl -s http://localhost:3030/metrics | grep auth
# rpc_auth_success_total{key_id="production-api"} 125000
# rpc_auth_failure_total{key_id="unknown"} 45

2. Test with known valid key:

curl -X POST http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -H "X-API-Key: test-key-12345" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

3. Check key permissions:

# Review configuration
cat /etc/prism/config.toml | grep -A 10 "authentication.keys"

Debug Performance Issues

1. Profile request latency:

# Single request timing
time curl -X POST http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

2. Check P50/P95/P99 latency:

histogram_quantile(0.50, rate(rpc_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(rpc_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

3. Identify slow methods:

curl -s http://localhost:3030/metrics | grep duration_seconds_sum
# Calculate average: sum / count for each method

4. Check upstream latency:

curl -s http://localhost:3030/metrics | grep upstream_latency
# rpc_upstream_latency_p99_ms{upstream="alchemy"} 350
# rpc_upstream_latency_p99_ms{upstream="infura"} 420

FAQ

General Questions

Q: What methods does Prism cache?

A: Prism caches the following methods:

  • eth_getBlockByHash - Block cache

  • eth_getBlockByNumber - Block cache

  • eth_getLogs - Log cache (with partial-range support)

  • eth_getTransactionByHash - Transaction cache

  • eth_getTransactionReceipt - Transaction cache

Not cached (forwarded to upstream):

  • eth_blockNumber - Always latest value

  • eth_chainId - Static value

  • eth_gasPrice - Changes frequently

  • eth_getBalance - Account-specific, changes with every transaction

  • eth_call - Depends on state, not safely cacheable

Q: How long does cached data stay valid?

A: Cache validity depends on block finality:

  • Finalized blocks (past finalized checkpoint): Cached forever

  • Safe blocks (beyond safety_depth from tip): Cached until reorg detected

  • Unsafe blocks (within safety_depth of tip): May be invalidated during reorgs

  • Default safety_depth: 12 blocks (~2.4 minutes)

Configuration:

[reorg_manager]
safety_depth = 12  # Blocks beyond this are "safe"

Q: Can I disable caching for specific methods?

A: Currently, caching is method-specific and cannot be selectively disabled. However, you can:

  1. Disable all caching:

[cache]
enabled = false
  1. Reduce cache sizes (effectively disables caching):

[cache.block_cache]
max_headers = 0
max_bodies = 0

[cache.log_cache]
max_chunks = 0
  1. Use separate Prism instances (cached vs non-cached)

Q: How many upstreams should I configure?

A: Recommended:

  • Minimum: 2 upstreams for redundancy

  • Optimal: 3-4 upstreams for best reliability and performance

  • Maximum: No hard limit, but 5+ may have diminishing returns

Why multiple upstreams:

  • Redundancy: If one fails, others handle requests

  • Load distribution: Spread requests across providers

  • Rate limit mitigation: Switch to different upstream when one is rate-limited

  • Latency optimization: Route requests to fastest upstream

Q: What happens during a blockchain reorg?

A: Prism handles reorgs automatically:

  1. Detection: WebSocket or health check detects chain tip changed

  2. Invalidation: Affected blocks removed from cache

  3. Refetch: Next request fetches fresh data from upstream

  4. Consistency: Clients always receive canonical chain data

Example:

Before:  Block 1000 (hash 0xAAA) cached
Reorg:   Block 1000 now has hash 0xBBB
Action:  Block 1000 invalidated from cache
Result:  Next request fetches new block 1000 (0xBBB) from upstream

Configuration Questions

Q: What's the difference between safety_depth and max_reorg_depth?

A:

  • safety_depth: Blocks beyond this from tip are considered "safe" from reorgs

    • Default: 12 blocks

    • Used for: Cache retention decisions, finality classification

    • Example: With safety_depth=12 at tip 1000, blocks <= 988 are "safe"

  • max_reorg_depth: Maximum blocks to search backwards for reorg divergence

    • Default: 100 blocks

    • Used for: Limit reorg detection scope, prevent excessive cache invalidation

    • Example: Reorg affecting 150 blocks will only invalidate last 100

Configuration:

[reorg_manager]
safety_depth = 12        # Finality threshold
max_reorg_depth = 100    # Reorg search limit

Q: How do I optimize for lowest latency?

A: Latency optimization checklist:

  1. Enable caching:

[cache]
enabled = true

[cache.block_cache]
hot_window_size = 500  # Cache recent blocks
  1. Use fast upstreams:

[[upstreams]]
name = "quicknode"
url = "https://your-endpoint.quiknode.pro/KEY/"
priority = 10  # Higher priority
  1. Enable hedging:

[routing]
strategy = "hedge"

[routing.hedge]
hedge_delay_ms = 50  # Send backup request after 50ms
  1. Geographic proximity: Deploy Prism in same region as upstreams

  2. Connection pooling:

[upstreams]
max_concurrent_requests = 200
  1. Monitor latency:

histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

Q: How do I optimize for highest throughput?

A: Throughput optimization checklist:

  1. Add more upstreams:

# 4-5 upstreams for load distribution
[[upstreams]]
name = "alchemy"
# ...

[[upstreams]]
name = "infura"
# ...
  1. Increase concurrency:

[upstreams]
max_concurrent_requests = 500  # Per upstream

[server]
max_connections = 10000
  1. Use load balancing:

[routing]
strategy = "least_loaded"
  1. Enable batch requests: Use JSON-RPC batch format

  2. Maximize caching:

[cache.block_cache]
hot_window_size = 1000
max_headers = 100000

Troubleshooting Questions

Q: Why is my cache hit rate so low?

A: Common causes:

  1. Random historical queries: Queries to random old blocks bypass hot window cache

    • Solution: Focus queries on recent blocks (last 200 blocks)

  2. Cache size too small: Frequent evictions reduce hit rate

    • Solution: Increase cache sizes in configuration

  3. Wide log query ranges: Large block ranges cause partial misses

    • Solution: Use smaller ranges or increase chunk_size

  4. Reorgs invalidating cache: Frequent reorgs clear cached blocks

    • Solution: Increase safety_depth, check for reorg storms

  5. Cache disabled: Check cache.enabled = true

Debug:

# Check cache hit rate
curl -s http://localhost:3030/metrics | grep -E 'cache_(hits|misses)_total'

# PromQL
rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))

Q: How do I know if an upstream is slow or failing?

A: Check these metrics:

  1. Health status:

curl http://localhost:3030/health | jq '.upstreams[] | select(.name=="alchemy")'
  1. Latency metrics:

curl -s http://localhost:3030/metrics | grep 'upstream_latency.*alchemy'
# rpc_upstream_latency_p99_ms{upstream="alchemy"} 850  # High latency
  1. Error counts:

curl -s http://localhost:3030/metrics | grep 'upstream_errors.*alchemy'
# rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 125
  1. Circuit breaker state:

curl -s http://localhost:3030/metrics | grep 'circuit_breaker_state.*alchemy'
# rpc_circuit_breaker_state{upstream="alchemy"} 1  # Open = failing

Warning signs:

  • P99 latency > 1 second

  • Error rate > 5%

  • Circuit breaker open

  • Block lag > 10 blocks behind tip

Q: Why do I see "Circuit breaker is open" errors?

A: Circuit breaker opens when upstream has too many consecutive failures.

Immediate fix: Wait 60 seconds for automatic recovery to half-open state

Long-term solutions:

  1. Fix upstream issues:

    • Check API key validity

    • Verify network connectivity

    • Check provider status page

  2. Adjust circuit breaker settings:

[upstreams.circuit_breaker]
failure_threshold = 10    # Increase from 5
timeout_seconds = 30      # Decrease from 60
  1. Add redundant upstreams: More upstreams reduce impact of one failing

  2. Monitor recovery:

# Watch for half-open transition
journalctl -u prism -f | grep "circuit breaker"

Q: How can I reduce API costs?

A: Cost reduction strategies:

  1. Maximize caching:

[cache]
enabled = true

[cache.block_cache]
hot_window_size = 1000
max_headers = 100000

[cache.log_cache]
max_chunks = 500
  1. Use tiered upstreams:

# Free tier for cacheable queries
[[upstreams]]
name = "public-rpc"
url = "https://eth.public-rpc.com"
priority = 1  # Low priority

# Paid tier for non-cacheable queries
[[upstreams]]
name = "alchemy-paid"
url = "https://eth-mainnet.g.alchemy.com/v2/KEY"
priority = 10  # High priority
  1. Implement client-side batching: Batch requests to reduce HTTP overhead

  2. Rate limit clients: Prevent abuse

[authentication.keys]
rate_limit_per_second = 50
  1. Monitor request distribution:

# Check which methods use most requests
curl -s http://localhost:3030/metrics | grep 'rpc_requests_total' | sort -t= -k2 -n

Still having issues? Join our community:

  • GitHub Issues: https://github.com/your-org/prism/issues

  • Discord: https://discord.gg/prism

  • Documentation: https://docs.prism.sh

Next: Explore Monitoring & Observability for proactive issue detection.

Last updated