Troubleshooting & FAQ

Comprehensive troubleshooting guide for diagnosing and resolving common Prism issues.

Connection & Upstream Issues

No Healthy Upstreams Available

Error: "No healthy upstreams available" (Error code: -32050)

Symptoms:

All requests fail with -32050 error
/health endpoint returns "status": "unhealthy"
Metrics show rpc_healthy_upstreams == 0

Causes & Solutions:

1. All Upstreams Failing Health Checks

Check health status:

curl http://localhost:3030/health | jq '.upstreams'

Check metrics:

curl -s http://localhost:3030/metrics | grep rpc_upstream_health
# rpc_upstream_health{upstream="alchemy"} 0
# rpc_upstream_health{upstream="infura"} 0

Solutions:

Verify upstream URLs are correct in configuration
Check network connectivity to upstream providers
Verify API keys are valid and not rate-limited
Check upstream provider status pages

2. Incorrect Configuration

Check configuration:

[[upstreams]]
name = "alchemy-mainnet"
url = "https://eth-mainnet.g.alchemy.com/v2/YOUR_API_KEY"  # Verify this URL
chain_id = 1

Common mistakes:

Missing or invalid API key in URL
Wrong chain ID (e.g., mainnet vs testnet mismatch)
HTTP instead of HTTPS
Incorrect endpoint path

3. Circuit Breakers All Open

Check circuit breaker state:

curl -s http://localhost:3030/metrics | grep rpc_circuit_breaker_state
# rpc_circuit_breaker_state{upstream="alchemy"} 1  # 1 = open (bad)

Solution: Wait for circuit breakers to reset, or restart Prism to reset state:

# Circuit breakers will automatically transition to half-open after timeout
# Default timeout: 60 seconds

Reduce sensitivity:

[upstreams.circuit_breaker]
failure_threshold = 10    # Increase from default 5
timeout_seconds = 30      # Decrease recovery time

Upstream Connection Timeouts

Error: "Request timeout" or "Connection failed: connection timeout"

Symptoms:

Requests take 30+ seconds and then timeout
High P99 latency in metrics
Upstream error metrics show error_type="timeout"

Check timeout metrics:

curl -s http://localhost:3030/metrics | grep timeout
# rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 125

Solutions:

1. Increase Timeout Values

[upstreams]
timeout_seconds = 60      # Increase from default 30
max_retries = 3           # Increase retry attempts

2. Check Network Latency

# Test direct connection to upstream
time curl -s -o /dev/null -w "%{time_total}\n" \
  https://eth-mainnet.g.alchemy.com/v2/YOUR_KEY \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

If latency > 5 seconds:

Network issue between your server and upstream provider
Try different upstream providers
Use providers geographically closer to your server

3. Reduce Concurrent Load

[upstreams]
max_concurrent_requests = 50  # Reduce from default 100

HTTP 429 Rate Limiting

Error: "HTTP error: 429" or RPC error -32005: "Limit exceeded"

Symptoms:

Intermittent failures during high traffic
Error rate spikes in metrics
Upstream provider returns "Too Many Requests"

Check rate limit errors:

curl -s http://localhost:3030/metrics | grep rate_limit
# rpc_upstream_errors_total{upstream="alchemy",error_type="rpc_rate_limit"} 450
# rpc_jsonrpc_errors_total{upstream="alchemy",code="-32005"} 450

Solutions:

1. Add More Upstreams

# Distribute load across multiple providers
[[upstreams]]
name = "alchemy-mainnet"
url = "https://eth-mainnet.g.alchemy.com/v2/KEY1"

[[upstreams]]
name = "infura-mainnet"
url = "https://mainnet.infura.io/v3/KEY2"

[[upstreams]]
name = "quicknode-mainnet"
url = "https://your-endpoint.quiknode.pro/KEY3/"

2. Enable Caching to Reduce Upstream Calls

[cache]
enabled = true

[cache.block_cache]
hot_window_size = 500      # Cache more recent blocks

[cache.log_cache]
chunk_size = 100           # Larger chunks = better cache hit rate
max_chunks = 200

Verify cache effectiveness:

# Check cache hit rate
curl -s http://localhost:3030/metrics | grep cache_hits

3. Implement Rate Limiting at Prism Level

[rate_limiting]
enabled = true
requests_per_second = 50   # Limit client request rate

Cache Problems

Low Cache Hit Rate

Symptoms:

Cache hit rate < 70% consistently
Most requests show X-Cache-Status: MISS
High upstream request counts

Check cache hit rate:

curl -s http://localhost:3030/metrics | grep -E 'cache_(hits|misses)'
# Calculate: hits / (hits + misses)

PromQL query:

rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))

Solutions:

1. Increase Cache Sizes

[cache.block_cache]
hot_window_size = 1000     # Increase from default 200
max_headers = 50000        # Increase from default 10000
max_bodies = 20000         # Increase from default 5000

[cache.log_cache]
max_chunks = 500           # Increase from default 100

[cache.transaction_cache]
max_transactions = 100000  # Increase from default 50000
max_receipts = 100000

2. Adjust Chunk Size for Log Cache

[cache.log_cache]
chunk_size = 100           # Smaller = more granular, better hit rate
                           # Larger = fewer chunks, less memory overhead

Trade-off:

Smaller chunk_size (10-50): Better partial cache hits, more memory
Larger chunk_size (100-500): Less memory, may miss partial ranges

3. Enable Cache Warming

Warm cache with recent data on startup:

[cache]
warm_on_startup = true
warm_blocks_count = 1000   # Warm last 1000 blocks

4. Check Request Patterns

Problem: Random historical queries bypass cache

Example: Querying random old blocks

# These all miss cache if not in hot window
eth_getBlockByNumber("0x500000", false)
eth_getBlockByNumber("0x600000", false)
eth_getBlockByNumber("0x750000", false)

Solution:

Use recent blocks when possible (last 200 blocks have highest hit rate)
For historical queries, batch sequential ranges to benefit from cache

High Cache Eviction Rate

Symptoms:

rpc_cache_evictions_total increasing rapidly
Cache hit rate declining over time
Memory pressure on system

Check eviction metrics:

curl -s http://localhost:3030/metrics | grep evictions
# rpc_cache_evictions_total{cache_type="block"} 15000
# rpc_cache_evictions_total{cache_type="log"} 32000

Solutions:

1. Increase Cache Memory Limits

[cache.block_cache]
max_headers = 100000       # Allow more entries before eviction
max_bodies = 50000

[cache.log_cache]
max_chunks = 1000          # More chunks = less eviction

2. Adjust Hot Window Size

[cache.block_cache]
hot_window_size = 500      # Keep more recent blocks in fast cache

3. Check Memory Usage

# Monitor Prism process memory
ps aux | grep prism

# Check system memory
free -h

If memory constrained:

Reduce cache sizes
Add more RAM to server
Enable cache compression (if available)

Cache Invalidation After Reorgs

Symptoms:

Cache hit rate drops suddenly
Logs show "reorg detected" messages
Blocks being refetched repeatedly

Check reorg metrics:

curl -s http://localhost:3030/metrics | grep reorg
# rpc_reorgs_detected_total 8
# rpc_last_reorg_block 18499995

Expected behavior: Cache invalidates blocks during reorgs to maintain consistency

Solutions:

1. Adjust Safety Depth

[reorg_manager]
safety_depth = 20          # Increase from default 12
                           # Blocks beyond this depth are "safe" from reorgs

Trade-off:

Larger safety_depth: Less cache invalidation, but stale data during deep reorgs
Smaller safety_depth: More aggressive invalidation, always fresh data

2. Check for Reorg Storms

Frequent reorgs indicate:

Network issues with upstreams
Misconfigured chain_id (wrong network)
Upstream providers disagreeing on chain state

Verify chain consistency:

# Check all upstreams report same chain tip
curl http://localhost:3030/health | jq '.upstreams[] | {name, latest_block}'

If upstreams disagree by > 10 blocks:

One or more upstreams may be syncing
Consider removing slow/stale upstreams

3. Enable Reorg Coalescing

[reorg_manager]
coalesce_window_ms = 100   # Batch rapid reorgs within 100ms window

Authentication Errors

Invalid or Missing API Key

Error: -32054: "Authentication failed" or -32600: "Invalid Request"

Symptoms:

All requests rejected with authentication error
No X-API-Key header in requests

Solutions:

1. Include API Key in Request

Header method:

curl -X POST http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key-here" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

Query parameter method:

curl -X POST "http://localhost:3030/?api_key=your-api-key-here" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

2. Verify API Key Configuration

[authentication]
enabled = true

[[authentication.keys]]
key = "prod-key-12345"
name = "production-api"
allowed_methods = ["*"]  # Or specific methods
rate_limit_per_second = 100

3. Check Authentication Metrics

curl -s http://localhost:3030/metrics | grep auth
# rpc_auth_success_total{key_id="production-api"} 125000
# rpc_auth_failure_total{key_id="unknown"} 45

Method Not Allowed for API Key

Error: -32055: "Method not allowed"

Symptoms:

Some methods work, others return permission error
Authentication succeeds but specific RPC calls fail

Check key permissions:

[[authentication.keys]]
key = "logs-only-key"
allowed_methods = [
  "eth_getLogs",
  "eth_blockNumber"
]
# Calling eth_getBlockByNumber with this key will fail

Solution: Update key permissions or use appropriate key:

[[authentication.keys]]
key = "full-access-key"
allowed_methods = ["*"]  # Allow all methods

Rate Limit Exceeded

Error: -32053: "Rate limit exceeded"

Symptoms:

Requests succeed initially, then fail during high traffic
Rate limit metrics show rejections

Check rate limit metrics:

curl -s http://localhost:3030/metrics | grep rate_limit
# rpc_rate_limit_rejected_total{key="production-api"} 500

Solutions:

1. Increase Rate Limit

[[authentication.keys]]
key = "production-api"
rate_limit_per_second = 200  # Increase from 100

2. Implement Client-Side Rate Limiting

// Use rate limiting library on client side
const Bottleneck = require('bottleneck');

const limiter = new Bottleneck({
  minTime: 20,  // 50 requests per second
  maxConcurrent: 10
});

limiter.schedule(() => makeRpcRequest());

3. Use Multiple API Keys

Distribute load across multiple API keys:

const keys = ['key1', 'key2', 'key3'];
const keyIndex = requestCount % keys.length;
const apiKey = keys[keyIndex];

Performance Problems

High Request Latency (P99 > 1s)

Symptoms:

Slow response times for clients
High P99 latency in metrics
Timeouts during peak traffic

Check latency metrics:

curl -s http://localhost:3030/metrics | grep duration_seconds

PromQL query:

histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

Solutions:

1. Enable Caching

[cache]
enabled = true

Verify cache is working:

# Check cache status header
curl -D - http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' | grep -i cache
# X-Cache-Status: FULL

2. Add Faster Upstreams

[[upstreams]]
name = "quicknode-mainnet"
url = "https://your-endpoint.quiknode.pro/KEY/"  # Often lower latency
priority = 10  # Higher priority for faster upstream

3. Enable Request Hedging

[routing]
strategy = "hedge"

[routing.hedge]
enabled = true
hedge_delay_ms = 100       # Send backup request after 100ms

How hedging helps:

Sends request to primary upstream
If no response in 100ms, sends to backup upstream
Returns whichever responds first
Reduces tail latency

4. Optimize Connection Pooling

[upstreams]
max_concurrent_requests = 200  # Allow more concurrent requests
timeout_seconds = 30           # Reduce from 60 if upstreams are fast

Low Throughput (RPS < Expected)

Symptoms:

Cannot achieve desired requests per second
Concurrent requests queue up
CPU or network not saturated

Check throughput metrics:

curl -s http://localhost:3030/metrics | grep rpc_requests_total

PromQL query:

rate(rpc_requests_total[1m])  # Requests per second

Solutions:

1. Increase Concurrency Limits

[upstreams]
max_concurrent_requests = 500  # Increase from default 100

[server]
max_connections = 10000        # Increase connection pool

2. Add More Upstreams

# Distribute load across more providers
[[upstreams]]
name = "alchemy"
# ... config

[[upstreams]]
name = "infura"
# ... config

[[upstreams]]
name = "quicknode"
# ... config

3. Enable Load Balancing

[routing]
strategy = "least_loaded"  # Distribute evenly across upstreams

4. Use Batch Requests

Instead of:

// 100 separate HTTP requests
for (let i = 0; i < 100; i++) {
  await fetch('/rpc', { method: 'POST', body: singleRequest });
}

Do this:

// 1 batched HTTP request
const batch = requests.map(req => ({ ...req, id: i }));
await fetch('/rpc', { method: 'POST', body: JSON.stringify(batch) });

Circuit Breaker Issues

Circuit Breaker Stuck Open

Symptoms:

Upstream marked healthy but circuit breaker stays open
Requests continue to fail with "Circuit breaker is open"
Circuit breaker state metric shows 1 (open) for extended period

Check circuit breaker state:

curl -s http://localhost:3030/metrics | grep circuit_breaker_state
# rpc_circuit_breaker_state{upstream="alchemy"} 1  # 1 = open

Check transition history:

curl -s http://localhost:3030/metrics | grep circuit_breaker_transitions
# rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="open"} 3
# rpc_circuit_breaker_transitions_total{upstream="alchemy",to_state="half_open"} 0

Solutions:

1. Wait for Automatic Recovery

Circuit breaker will transition to half-open after timeout:

[upstreams.circuit_breaker]
timeout_seconds = 60  # Default timeout

Check when circuit will recover:

# Look for last failure time in logs
# Circuit opens at: T
# Will try recovery at: T + 60 seconds

2. Reduce Circuit Breaker Sensitivity

[upstreams.circuit_breaker]
failure_threshold = 10     # Increase from default 5
timeout_seconds = 30       # Decrease recovery time from 60

3. Restart Prism (Last Resort)

# Restarting resets all circuit breaker state
systemctl restart prism
# or
docker restart prism-container

Circuit Breaker Opens Too Easily

Symptoms:

Circuit breaker opens during normal operation
Temporary errors trigger circuit breaker
Frequent open/close cycles

Check failure threshold:

curl -s http://localhost:3030/metrics | grep circuit_breaker_failure_count
# rpc_circuit_breaker_failure_count{upstream="alchemy"} 5  # At threshold

Solutions:

1. Increase Failure Threshold

[upstreams.circuit_breaker]
failure_threshold = 10     # Increase from default 5
# Requires 10 consecutive failures before opening

2. Add Retry Logic

[upstreams]
max_retries = 3            # Retry failed requests
retry_delay_ms = 100       # Wait 100ms between retries

3. Adjust Error Classification

Review logs to identify error types:

journalctl -u prism | grep "circuit breaker"

Error types that trigger circuit breaker:

Provider errors (-32603: Internal error)
Parse errors (malformed responses)
Timeouts (upstream unresponsive)

Error types that DON'T trigger circuit breaker:

Client errors (-32600: Invalid Request)
Rate limits (-32005: Limit exceeded)
Execution errors (transaction reverts)

Reorg Handling Issues

Cache Contains Stale Data After Reorg

Symptoms:

Queries return different data than upstream
Block hashes don't match expected values
Transactions show incorrect status

Check reorg detection:

curl -s http://localhost:3030/metrics | grep reorg
# rpc_reorgs_detected_total 5
# rpc_last_reorg_block 18499995

Verify cache invalidation:

# Check logs for invalidation messages
journalctl -u prism | grep "invalidating block"

Solutions:

1. Verify Reorg Detection Is Working

Check health endpoint:

curl http://localhost:3030/health | jq '.upstreams[] | {name, latest_block, finalized_block}'

If all upstreams show same tip: Reorg detection is working

If upstreams disagree: May indicate reorg in progress

2. Reduce Safety Depth

[reorg_manager]
safety_depth = 6           # Reduce from default 12
# More aggressive cache invalidation

3. Clear Cache Manually

Restart Prism to clear all caches:

systemctl restart prism

Or use cache clear endpoint (if implemented):

curl -X POST http://localhost:3030/admin/clear-cache \
  -H "X-Admin-Key: your-admin-key"

Too Many Reorg Detections

Symptoms:

rpc_reorgs_detected_total increasing rapidly
Frequent cache invalidation
Low cache hit rate due to constant invalidation

Check reorg frequency:

curl -s http://localhost:3030/metrics | grep reorgs_detected
# rpc_reorgs_detected_total 150  # Very high

PromQL query:

rate(rpc_reorgs_detected_total[1h])  # Reorgs per second

Causes & Solutions:

1. Upstreams Reporting Different Chain States

Symptom: Upstreams disagree on current tip or block hashes

Check upstream consistency:

# Query all upstreams directly
for upstream in alchemy infura quicknode; do
  echo "=== $upstream ==="
  curl -s https://$upstream.url/... \
    -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
    | jq .result
done

If blocks differ by > 5: One or more upstreams may be syncing or stale

Solution: Remove stale upstreams from configuration

2. WebSocket Reconnections Causing False Reorgs

Check WebSocket metrics:

curl -s http://localhost:3030/metrics | grep websocket
# rpc_websocket_disconnections_total 45

Solution: Improve WebSocket stability

[upstreams.websocket]
enabled = true
reconnect_delay_ms = 5000  # Wait longer before reconnect
max_reconnect_attempts = 10

3. Network Issues Between Prism and Upstreams

Symptom: Intermittent connectivity causes missed block notifications

Solution:

Move Prism closer to upstream providers (same region)
Use more reliable network connection
Enable WebSocket fallback to HTTP polling

WebSocket Connection Failures

WebSocket Disconnects Repeatedly

Symptoms:

Frequent "WebSocket disconnected" in logs
High reconnection count in metrics
Missing block notifications

Check WebSocket status:

curl -s http://localhost:3030/metrics | grep websocket
# rpc_websocket_active_connections 0  # Should be > 0
# rpc_websocket_disconnections_total 125

Solutions:

1. Increase Reconnection Delay

[upstreams.websocket]
reconnect_delay_ms = 5000  # Increase from default 1000
max_reconnect_attempts = 20

2. Check Upstream WebSocket Support

Test WebSocket connection manually:

# Install wscat: npm install -g wscat
wscat -c wss://eth-mainnet.g.alchemy.com/v2/YOUR_KEY

# Subscribe to new heads
> {"jsonrpc":"2.0","method":"eth_subscribe","params":["newHeads"],"id":1}

# Should receive: {"jsonrpc":"2.0","result":"0x...","id":1}
# Then block notifications every ~12 seconds

If connection fails:

Upstream may not support WebSocket
API key may not have WebSocket access
Firewall blocking WebSocket connections

3. Disable WebSocket (Fallback to HTTP)

[upstreams.websocket]
enabled = false  # Disable WebSocket, use HTTP polling only

Note: HTTP polling is less efficient but more reliable

Missing Block Notifications

Symptoms:

Chain tip not updating in Prism
/health shows stale latest_block
Reorg detection not working

Check chain tip updates:

# Monitor health endpoint
watch -n 5 'curl -s http://localhost:3030/health | jq .upstreams[0].latest_block'

Solutions:

1. Verify WebSocket Subscription

Check logs for subscription confirmation:

journalctl -u prism | grep "subscribed to newHeads"

If no subscription message: WebSocket not connected

2. Enable HTTP Polling Fallback

[upstreams.health_check]
enabled = true
interval_seconds = 30      # Poll every 30 seconds
timeout_seconds = 10

How it helps:

Health checker polls eth_blockNumber periodically
Updates chain tip even if WebSocket fails
Detects rollbacks and reorgs

3. Check Firewall Rules

WebSocket requires outbound connections:

# Allow outbound HTTPS/WSS (port 443)
sudo ufw allow out 443/tcp

# Test connectivity
curl -v https://eth-mainnet.g.alchemy.com/v2/YOUR_KEY
# Should see "Connected to eth-mainnet.g.alchemy.com"

Error Codes Reference

JSON-RPC Standard Errors

Code

Message

Description

Troubleshooting

-32700

Parse error

Invalid JSON received

Check request syntax; ensure valid JSON

-32600

Invalid Request

Request object malformed

Verify jsonrpc: "2.0", method, id fields

-32601

Method not found

Method doesn't exist or unsupported

Check method name spelling; see supported methods

-32602

Invalid params

Invalid method parameters

Verify parameter types and count

-32603

Internal error

Internal server error

Check Prism logs; may be upstream issue

Ethereum JSON-RPC Errors

Code

Message

Description

Troubleshooting

-32000

Server error

Generic server error

Check error message for details

-32001

Resource not found

Requested resource doesn't exist

Block/transaction may not exist yet

-32002

Resource unavailable

Resource temporarily unavailable

Retry after delay; upstream may be syncing

-32003

Transaction rejected

Transaction wouldn't be accepted

Check transaction parameters

-32004

Method not supported

Method not implemented

Method not supported by upstream

-32005

Limit exceeded

Request exceeds defined limit

Reduce query range; add more upstreams

Prism-Specific Errors

Code

Message

Description

Troubleshooting

-32050

No healthy upstreams

All upstreams unavailable

Check upstream configuration; verify API keys

-32051

Circuit breaker open

Upstream circuit breaker open

Wait for recovery or restart; check upstream health

-32052

Consensus failure

Upstreams disagree on response

Check upstream consistency; remove stale upstreams

-32053

Rate limit exceeded

Request rate limit exceeded

Implement client rate limiting; increase limits

-32054

Authentication failed

Invalid or missing API key

Include X-API-Key header; verify key is valid

-32055

Method not allowed

API key lacks method permission

Update key permissions or use different key

Upstream Provider Errors

Execution Errors (Client's fault, NOT penalized):

"execution reverted" - Smart contract reverted
"out of gas" - Transaction ran out of gas
"insufficient funds" - Account balance too low
"nonce too low" - Transaction nonce already used
"gas too low" - Gas limit too low for transaction

Provider Errors (Upstream's fault, triggers circuit breaker):

"Internal error" (-32603) - Upstream server error
"server error" (-32000) - Generic upstream error

Rate Limit Errors (Transient, retry on different upstream):

"Limit exceeded" (-32005) - Upstream rate limit
HTTP 429 - Too many requests

Debug Strategies

General Debugging Workflow

Check Health Endpoint
```
curl http://localhost:3030/health | jq .
```
- Verify status: "healthy"
- Check upstream health and response times
- Verify cache statistics
Check Metrics
```
curl http://localhost:3030/metrics | grep -E '(error|unhealthy|circuit|reorg)'
```
- Look for error rate spikes
- Check circuit breaker states
- Review reorg activity

Enable Debug Logging

[logging]
level = "debug"  # or "trace" for extreme verbosity
format = "pretty"

Monitor Real-Time Logs

# Systemd
journalctl -u prism -f

# Docker
docker logs -f prism-container

# Direct binary
tail -f /var/log/prism/prism.log

Debugging Specific Issues

Debug Cache Misses

1. Enable verbose logging:

[logging]
level = "debug"

2. Look for cache decision logs:

journalctl -u prism | grep "cache"
# DEBUG cache hit method=eth_getBlockByNumber block=18500000
# DEBUG cache miss method=eth_getLogs range=18400000-18500000

3. Check cache status headers:

curl -D - http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x11a6e3b",false],"id":1}' \
  | grep -i cache

Possible values:

FULL: Complete cache hit
PARTIAL: Some data from cache, rest from upstream
MISS: No cached data, all from upstream
EMPTY: Cached empty result (e.g., no logs in range)

Debug Upstream Selection

1. Enable routing logs:

journalctl -u prism | grep "upstream selected"

2. Check upstream scoring:

curl -s http://localhost:3030/metrics | grep composite_score
# rpc_upstream_composite_score{upstream="alchemy"} 0.875
# rpc_upstream_composite_score{upstream="infura"} 0.720

3. Review selection reasons:

curl -s http://localhost:3030/metrics | grep upstream_selections
# rpc_upstream_selections_total{upstream="alchemy",reason="best_score"} 85000
# rpc_upstream_selections_total{upstream="infura",reason="fallback"} 15000

Debug Authentication Issues

1. Check authentication metrics:

curl -s http://localhost:3030/metrics | grep auth
# rpc_auth_success_total{key_id="production-api"} 125000
# rpc_auth_failure_total{key_id="unknown"} 45

2. Test with known valid key:

curl -X POST http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -H "X-API-Key: test-key-12345" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

3. Check key permissions:

# Review configuration
cat /etc/prism/config.toml | grep -A 10 "authentication.keys"

Debug Performance Issues

1. Profile request latency:

# Single request timing
time curl -X POST http://localhost:3030/ \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}'

2. Check P50/P95/P99 latency:

histogram_quantile(0.50, rate(rpc_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(rpc_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

3. Identify slow methods:

curl -s http://localhost:3030/metrics | grep duration_seconds_sum
# Calculate average: sum / count for each method

4. Check upstream latency:

curl -s http://localhost:3030/metrics | grep upstream_latency
# rpc_upstream_latency_p99_ms{upstream="alchemy"} 350
# rpc_upstream_latency_p99_ms{upstream="infura"} 420

FAQ

General Questions

Q: What methods does Prism cache?

A: Prism caches the following methods:

eth_getBlockByHash - Block cache
eth_getBlockByNumber - Block cache
eth_getLogs - Log cache (with partial-range support)
eth_getTransactionByHash - Transaction cache
eth_getTransactionReceipt - Transaction cache

Not cached (forwarded to upstream):

eth_blockNumber - Always latest value
eth_chainId - Static value
eth_gasPrice - Changes frequently
eth_getBalance - Account-specific, changes with every transaction
eth_call - Depends on state, not safely cacheable

Q: How long does cached data stay valid?

A: Cache validity depends on block finality:

Finalized blocks (past finalized checkpoint): Cached forever
Safe blocks (beyond safety_depth from tip): Cached until reorg detected
Unsafe blocks (within safety_depth of tip): May be invalidated during reorgs
Default safety_depth: 12 blocks (~2.4 minutes)

Configuration:

[reorg_manager]
safety_depth = 12  # Blocks beyond this are "safe"

Q: Can I disable caching for specific methods?

A: Currently, caching is method-specific and cannot be selectively disabled. However, you can:

Disable all caching:

[cache]
enabled = false

Reduce cache sizes (effectively disables caching):

[cache.block_cache]
max_headers = 0
max_bodies = 0

[cache.log_cache]
max_chunks = 0

Use separate Prism instances (cached vs non-cached)

Q: How many upstreams should I configure?

A: Recommended:

Minimum: 2 upstreams for redundancy
Optimal: 3-4 upstreams for best reliability and performance
Maximum: No hard limit, but 5+ may have diminishing returns

Why multiple upstreams:

Redundancy: If one fails, others handle requests
Load distribution: Spread requests across providers
Rate limit mitigation: Switch to different upstream when one is rate-limited
Latency optimization: Route requests to fastest upstream

Q: What happens during a blockchain reorg?

A: Prism handles reorgs automatically:

Detection: WebSocket or health check detects chain tip changed
Invalidation: Affected blocks removed from cache
Refetch: Next request fetches fresh data from upstream
Consistency: Clients always receive canonical chain data

Example:

Before:  Block 1000 (hash 0xAAA) cached
Reorg:   Block 1000 now has hash 0xBBB
Action:  Block 1000 invalidated from cache
Result:  Next request fetches new block 1000 (0xBBB) from upstream

Configuration Questions

Q: What's the difference between `safety_depth` and `max_reorg_depth`?

safety_depth: Blocks beyond this from tip are considered "safe" from reorgs
- Default: 12 blocks
- Used for: Cache retention decisions, finality classification
- Example: With safety_depth=12 at tip 1000, blocks <= 988 are "safe"
max_reorg_depth: Maximum blocks to search backwards for reorg divergence
- Default: 100 blocks
- Used for: Limit reorg detection scope, prevent excessive cache invalidation
- Example: Reorg affecting 150 blocks will only invalidate last 100

Configuration:

[reorg_manager]
safety_depth = 12        # Finality threshold
max_reorg_depth = 100    # Reorg search limit

Q: How do I optimize for lowest latency?

A: Latency optimization checklist:

Enable caching:

[cache]
enabled = true

[cache.block_cache]
hot_window_size = 500  # Cache recent blocks

Use fast upstreams:

[[upstreams]]
name = "quicknode"
url = "https://your-endpoint.quiknode.pro/KEY/"
priority = 10  # Higher priority

Enable hedging:

[routing]
strategy = "hedge"

[routing.hedge]
hedge_delay_ms = 50  # Send backup request after 50ms

Geographic proximity: Deploy Prism in same region as upstreams
Connection pooling:

[upstreams]
max_concurrent_requests = 200

Monitor latency:

histogram_quantile(0.99, rate(rpc_request_duration_seconds_bucket[5m]))

Q: How do I optimize for highest throughput?

A: Throughput optimization checklist:

Add more upstreams:

# 4-5 upstreams for load distribution
[[upstreams]]
name = "alchemy"
# ...

[[upstreams]]
name = "infura"
# ...

Increase concurrency:

[upstreams]
max_concurrent_requests = 500  # Per upstream

[server]
max_connections = 10000

Use load balancing:

[routing]
strategy = "least_loaded"

Enable batch requests: Use JSON-RPC batch format
Maximize caching:

[cache.block_cache]
hot_window_size = 1000
max_headers = 100000

Troubleshooting Questions

Q: Why is my cache hit rate so low?

A: Common causes:

Random historical queries: Queries to random old blocks bypass hot window cache
- Solution: Focus queries on recent blocks (last 200 blocks)
Cache size too small: Frequent evictions reduce hit rate
- Solution: Increase cache sizes in configuration
Wide log query ranges: Large block ranges cause partial misses
- Solution: Use smaller ranges or increase chunk_size
Reorgs invalidating cache: Frequent reorgs clear cached blocks
- Solution: Increase safety_depth, check for reorg storms
Cache disabled: Check cache.enabled = true

Debug:

# Check cache hit rate
curl -s http://localhost:3030/metrics | grep -E 'cache_(hits|misses)_total'

# PromQL
rate(rpc_cache_hits_total[5m]) /
  (rate(rpc_cache_hits_total[5m]) + rate(rpc_cache_misses_total[5m]))

Q: How do I know if an upstream is slow or failing?

A: Check these metrics:

Health status:

curl http://localhost:3030/health | jq '.upstreams[] | select(.name=="alchemy")'

Latency metrics:

curl -s http://localhost:3030/metrics | grep 'upstream_latency.*alchemy'
# rpc_upstream_latency_p99_ms{upstream="alchemy"} 850  # High latency

Error counts:

curl -s http://localhost:3030/metrics | grep 'upstream_errors.*alchemy'
# rpc_upstream_errors_total{upstream="alchemy",error_type="timeout"} 125

Circuit breaker state:

curl -s http://localhost:3030/metrics | grep 'circuit_breaker_state.*alchemy'
# rpc_circuit_breaker_state{upstream="alchemy"} 1  # Open = failing

Warning signs:

P99 latency > 1 second
Error rate > 5%
Circuit breaker open
Block lag > 10 blocks behind tip

Q: Why do I see "Circuit breaker is open" errors?

A: Circuit breaker opens when upstream has too many consecutive failures.

Immediate fix: Wait 60 seconds for automatic recovery to half-open state

Long-term solutions:

Fix upstream issues:
- Check API key validity
- Verify network connectivity
- Check provider status page
Adjust circuit breaker settings:

[upstreams.circuit_breaker]
failure_threshold = 10    # Increase from 5
timeout_seconds = 30      # Decrease from 60

Add redundant upstreams: More upstreams reduce impact of one failing
Monitor recovery:

# Watch for half-open transition
journalctl -u prism -f | grep "circuit breaker"

Q: How can I reduce API costs?

A: Cost reduction strategies:

Maximize caching:

[cache]
enabled = true

[cache.block_cache]
hot_window_size = 1000
max_headers = 100000

[cache.log_cache]
max_chunks = 500

Use tiered upstreams:

# Free tier for cacheable queries
[[upstreams]]
name = "public-rpc"
url = "https://eth.public-rpc.com"
priority = 1  # Low priority

# Paid tier for non-cacheable queries
[[upstreams]]
name = "alchemy-paid"
url = "https://eth-mainnet.g.alchemy.com/v2/KEY"
priority = 10  # High priority

Implement client-side batching: Batch requests to reduce HTTP overhead
Rate limit clients: Prevent abuse

[authentication.keys]
rate_limit_per_second = 50

Monitor request distribution:

# Check which methods use most requests
curl -s http://localhost:3030/metrics | grep 'rpc_requests_total' | sort -t= -k2 -n

Still having issues? Join our community:

GitHub Issues: https://github.com/your-org/prism/issues
Discord: https://discord.gg/prism
Documentation: https://docs.prism.sh

Next: Explore Monitoring & Observability for proactive issue detection.

PreviousAPI Reference

Last updated 3 months ago

Good evening

hashtagTable of Contents

hashtagConnection & Upstream Issues

hashtagNo Healthy Upstreams Available

hashtag1. All Upstreams Failing Health Checks

hashtag2. Incorrect Configuration

hashtag3. Circuit Breakers All Open

hashtagUpstream Connection Timeouts

hashtag1. Increase Timeout Values

hashtag2. Check Network Latency

hashtag3. Reduce Concurrent Load

hashtagHTTP 429 Rate Limiting

hashtag1. Add More Upstreams

hashtag2. Enable Caching to Reduce Upstream Calls

hashtag3. Implement Rate Limiting at Prism Level

hashtagCache Problems

hashtagLow Cache Hit Rate

hashtag1. Increase Cache Sizes

hashtag2. Adjust Chunk Size for Log Cache

hashtag3. Enable Cache Warming

hashtag4. Check Request Patterns

hashtagHigh Cache Eviction Rate

hashtag1. Increase Cache Memory Limits

hashtag2. Adjust Hot Window Size

hashtag3. Check Memory Usage

hashtagCache Invalidation After Reorgs

hashtag1. Adjust Safety Depth

hashtag2. Check for Reorg Storms

hashtag3. Enable Reorg Coalescing

hashtagAuthentication Errors

hashtagInvalid or Missing API Key

hashtag1. Include API Key in Request

hashtag2. Verify API Key Configuration

hashtag3. Check Authentication Metrics

hashtagMethod Not Allowed for API Key

hashtagRate Limit Exceeded

hashtag1. Increase Rate Limit

hashtag2. Implement Client-Side Rate Limiting

hashtag3. Use Multiple API Keys

hashtagPerformance Problems

hashtagHigh Request Latency (P99 > 1s)

hashtag1. Enable Caching

hashtag2. Add Faster Upstreams

hashtag3. Enable Request Hedging

hashtag4. Optimize Connection Pooling

hashtagLow Throughput (RPS < Expected)

hashtag1. Increase Concurrency Limits

hashtag2. Add More Upstreams

hashtag3. Enable Load Balancing

hashtag4. Use Batch Requests

hashtagCircuit Breaker Issues

hashtagCircuit Breaker Stuck Open

hashtag1. Wait for Automatic Recovery

hashtag2. Reduce Circuit Breaker Sensitivity

hashtag3. Restart Prism (Last Resort)

hashtagCircuit Breaker Opens Too Easily

hashtag1. Increase Failure Threshold

hashtag2. Add Retry Logic

hashtag3. Adjust Error Classification

hashtagReorg Handling Issues

hashtagCache Contains Stale Data After Reorg

hashtag1. Verify Reorg Detection Is Working

hashtag2. Reduce Safety Depth

hashtag3. Clear Cache Manually

hashtagToo Many Reorg Detections

hashtag1. Upstreams Reporting Different Chain States

hashtag2. WebSocket Reconnections Causing False Reorgs

hashtag3. Network Issues Between Prism and Upstreams

hashtagWebSocket Connection Failures

hashtagWebSocket Disconnects Repeatedly

hashtag1. Increase Reconnection Delay

hashtag2. Check Upstream WebSocket Support

hashtag3. Disable WebSocket (Fallback to HTTP)

hashtagMissing Block Notifications

hashtag1. Verify WebSocket Subscription

hashtag2. Enable HTTP Polling Fallback

hashtag3. Check Firewall Rules

hashtagError Codes Reference

hashtagJSON-RPC Standard Errors

hashtagEthereum JSON-RPC Errors

Table of Contents

Connection & Upstream Issues

No Healthy Upstreams Available

1. All Upstreams Failing Health Checks

2. Incorrect Configuration

3. Circuit Breakers All Open

Upstream Connection Timeouts

1. Increase Timeout Values

2. Check Network Latency

3. Reduce Concurrent Load

HTTP 429 Rate Limiting

1. Add More Upstreams

2. Enable Caching to Reduce Upstream Calls

3. Implement Rate Limiting at Prism Level

Cache Problems

Low Cache Hit Rate

1. Increase Cache Sizes

2. Adjust Chunk Size for Log Cache

3. Enable Cache Warming

4. Check Request Patterns

High Cache Eviction Rate

1. Increase Cache Memory Limits

2. Adjust Hot Window Size

3. Check Memory Usage

Cache Invalidation After Reorgs

1. Adjust Safety Depth

2. Check for Reorg Storms

3. Enable Reorg Coalescing

Authentication Errors

Invalid or Missing API Key

1. Include API Key in Request

2. Verify API Key Configuration

3. Check Authentication Metrics

Method Not Allowed for API Key

Rate Limit Exceeded

1. Increase Rate Limit

2. Implement Client-Side Rate Limiting

3. Use Multiple API Keys

Performance Problems

High Request Latency (P99 > 1s)

1. Enable Caching

2. Add Faster Upstreams

3. Enable Request Hedging

4. Optimize Connection Pooling

Low Throughput (RPS < Expected)

1. Increase Concurrency Limits

2. Add More Upstreams

3. Enable Load Balancing

4. Use Batch Requests

Circuit Breaker Issues

Circuit Breaker Stuck Open

1. Wait for Automatic Recovery

2. Reduce Circuit Breaker Sensitivity

3. Restart Prism (Last Resort)

Circuit Breaker Opens Too Easily

1. Increase Failure Threshold

2. Add Retry Logic

3. Adjust Error Classification

Reorg Handling Issues

Cache Contains Stale Data After Reorg

1. Verify Reorg Detection Is Working

2. Reduce Safety Depth

3. Clear Cache Manually

Too Many Reorg Detections

1. Upstreams Reporting Different Chain States

2. WebSocket Reconnections Causing False Reorgs

3. Network Issues Between Prism and Upstreams

WebSocket Connection Failures

WebSocket Disconnects Repeatedly

1. Increase Reconnection Delay

2. Check Upstream WebSocket Support

3. Disable WebSocket (Fallback to HTTP)

Missing Block Notifications

1. Verify WebSocket Subscription

2. Enable HTTP Polling Fallback

3. Check Firewall Rules

Error Codes Reference

JSON-RPC Standard Errors

Ethereum JSON-RPC Errors

Prism-Specific Errors