Troubleshooting & FAQ

Comprehensive troubleshooting guide for diagnosing and resolving common Prism issues.

Table of Contents


Connection & Upstream Issues

No Healthy Upstreams Available

Error: "No healthy upstreams available" (Error code: -32050)

Symptoms:

  • All requests fail with -32050 error

  • /health endpoint returns "status": "unhealthy"

  • Metrics show rpc_healthy_upstreams == 0

Causes & Solutions:

1. All Upstreams Failing Health Checks

Check health status:

Check metrics:

Solutions:

  • Verify upstream URLs are correct in configuration

  • Check network connectivity to upstream providers

  • Verify API keys are valid and not rate-limited

  • Check upstream provider status pages

2. Incorrect Configuration

Check configuration:

Common mistakes:

  • Missing or invalid API key in URL

  • Wrong chain ID (e.g., mainnet vs testnet mismatch)

  • HTTP instead of HTTPS

  • Incorrect endpoint path

3. Circuit Breakers All Open

Check circuit breaker state:

Solution: Wait for circuit breakers to reset, or restart Prism to reset state:

Reduce sensitivity:


Upstream Connection Timeouts

Error: "Request timeout" or "Connection failed: connection timeout"

Symptoms:

  • Requests take 30+ seconds and then timeout

  • High P99 latency in metrics

  • Upstream error metrics show error_type="timeout"

Check timeout metrics:

Solutions:

1. Increase Timeout Values

2. Check Network Latency

If latency > 5 seconds:

  • Network issue between your server and upstream provider

  • Try different upstream providers

  • Use providers geographically closer to your server

3. Reduce Concurrent Load


HTTP 429 Rate Limiting

Error: "HTTP error: 429" or RPC error -32005: "Limit exceeded"

Symptoms:

  • Intermittent failures during high traffic

  • Error rate spikes in metrics

  • Upstream provider returns "Too Many Requests"

Check rate limit errors:

Solutions:

1. Add More Upstreams

2. Enable Caching to Reduce Upstream Calls

Verify cache effectiveness:

3. Implement Rate Limiting at Prism Level


Cache Problems

Low Cache Hit Rate

Symptoms:

  • Cache hit rate < 70% consistently

  • Most requests show X-Cache-Status: MISS

  • High upstream request counts

Check cache hit rate:

PromQL query:

Solutions:

1. Increase Cache Sizes

2. Adjust Chunk Size for Log Cache

Trade-off:

  • Smaller chunk_size (10-50): Better partial cache hits, more memory

  • Larger chunk_size (100-500): Less memory, may miss partial ranges

3. Enable Cache Warming

Warm cache with recent data on startup:

4. Check Request Patterns

Problem: Random historical queries bypass cache

Example: Querying random old blocks

Solution:

  • Use recent blocks when possible (last 200 blocks have highest hit rate)

  • For historical queries, batch sequential ranges to benefit from cache


High Cache Eviction Rate

Symptoms:

  • rpc_cache_evictions_total increasing rapidly

  • Cache hit rate declining over time

  • Memory pressure on system

Check eviction metrics:

Solutions:

1. Increase Cache Memory Limits

2. Adjust Hot Window Size

3. Check Memory Usage

If memory constrained:

  • Reduce cache sizes

  • Add more RAM to server

  • Enable cache compression (if available)


Cache Invalidation After Reorgs

Symptoms:

  • Cache hit rate drops suddenly

  • Logs show "reorg detected" messages

  • Blocks being refetched repeatedly

Check reorg metrics:

Expected behavior: Cache invalidates blocks during reorgs to maintain consistency

Solutions:

1. Adjust Safety Depth

Trade-off:

  • Larger safety_depth: Less cache invalidation, but stale data during deep reorgs

  • Smaller safety_depth: More aggressive invalidation, always fresh data

2. Check for Reorg Storms

Frequent reorgs indicate:

  • Network issues with upstreams

  • Misconfigured chain_id (wrong network)

  • Upstream providers disagreeing on chain state

Verify chain consistency:

If upstreams disagree by > 10 blocks:

  • One or more upstreams may be syncing

  • Consider removing slow/stale upstreams

3. Enable Reorg Coalescing


Authentication Errors

Invalid or Missing API Key

Error: -32054: "Authentication failed" or -32600: "Invalid Request"

Symptoms:

  • All requests rejected with authentication error

  • No X-API-Key header in requests

Solutions:

1. Include API Key in Request

Header method:

Query parameter method:

2. Verify API Key Configuration

3. Check Authentication Metrics


Method Not Allowed for API Key

Error: -32055: "Method not allowed"

Symptoms:

  • Some methods work, others return permission error

  • Authentication succeeds but specific RPC calls fail

Check key permissions:

Solution: Update key permissions or use appropriate key:


Rate Limit Exceeded

Error: -32053: "Rate limit exceeded"

Symptoms:

  • Requests succeed initially, then fail during high traffic

  • Rate limit metrics show rejections

Check rate limit metrics:

Solutions:

1. Increase Rate Limit

2. Implement Client-Side Rate Limiting

3. Use Multiple API Keys

Distribute load across multiple API keys:


Performance Problems

High Request Latency (P99 > 1s)

Symptoms:

  • Slow response times for clients

  • High P99 latency in metrics

  • Timeouts during peak traffic

Check latency metrics:

PromQL query:

Solutions:

1. Enable Caching

Verify cache is working:

2. Add Faster Upstreams

3. Enable Request Hedging

How hedging helps:

  • Sends request to primary upstream

  • If no response in 100ms, sends to backup upstream

  • Returns whichever responds first

  • Reduces tail latency

4. Optimize Connection Pooling


Low Throughput (RPS < Expected)

Symptoms:

  • Cannot achieve desired requests per second

  • Concurrent requests queue up

  • CPU or network not saturated

Check throughput metrics:

PromQL query:

Solutions:

1. Increase Concurrency Limits

2. Add More Upstreams

3. Enable Load Balancing

4. Use Batch Requests

Instead of:

Do this:


Circuit Breaker Issues

Circuit Breaker Stuck Open

Symptoms:

  • Upstream marked healthy but circuit breaker stays open

  • Requests continue to fail with "Circuit breaker is open"

  • Circuit breaker state metric shows 1 (open) for extended period

Check circuit breaker state:

Check transition history:

Solutions:

1. Wait for Automatic Recovery

Circuit breaker will transition to half-open after timeout:

Check when circuit will recover:

2. Reduce Circuit Breaker Sensitivity

3. Restart Prism (Last Resort)


Circuit Breaker Opens Too Easily

Symptoms:

  • Circuit breaker opens during normal operation

  • Temporary errors trigger circuit breaker

  • Frequent open/close cycles

Check failure threshold:

Solutions:

1. Increase Failure Threshold

2. Add Retry Logic

3. Adjust Error Classification

Review logs to identify error types:

Error types that trigger circuit breaker:

  • Provider errors (-32603: Internal error)

  • Parse errors (malformed responses)

  • Timeouts (upstream unresponsive)

Error types that DON'T trigger circuit breaker:

  • Client errors (-32600: Invalid Request)

  • Rate limits (-32005: Limit exceeded)

  • Execution errors (transaction reverts)


Reorg Handling Issues

Cache Contains Stale Data After Reorg

Symptoms:

  • Queries return different data than upstream

  • Block hashes don't match expected values

  • Transactions show incorrect status

Check reorg detection:

Verify cache invalidation:

Solutions:

1. Verify Reorg Detection Is Working

Check health endpoint:

If all upstreams show same tip: Reorg detection is working

If upstreams disagree: May indicate reorg in progress

2. Reduce Safety Depth

3. Clear Cache Manually

Restart Prism to clear all caches:

Or use cache clear endpoint (if implemented):


Too Many Reorg Detections

Symptoms:

  • rpc_reorgs_detected_total increasing rapidly

  • Frequent cache invalidation

  • Low cache hit rate due to constant invalidation

Check reorg frequency:

PromQL query:

Causes & Solutions:

1. Upstreams Reporting Different Chain States

Symptom: Upstreams disagree on current tip or block hashes

Check upstream consistency:

If blocks differ by > 5: One or more upstreams may be syncing or stale

Solution: Remove stale upstreams from configuration

2. WebSocket Reconnections Causing False Reorgs

Check WebSocket metrics:

Solution: Improve WebSocket stability

3. Network Issues Between Prism and Upstreams

Symptom: Intermittent connectivity causes missed block notifications

Solution:

  • Move Prism closer to upstream providers (same region)

  • Use more reliable network connection

  • Enable WebSocket fallback to HTTP polling


WebSocket Connection Failures

WebSocket Disconnects Repeatedly

Symptoms:

  • Frequent "WebSocket disconnected" in logs

  • High reconnection count in metrics

  • Missing block notifications

Check WebSocket status:

Solutions:

1. Increase Reconnection Delay

2. Check Upstream WebSocket Support

Test WebSocket connection manually:

If connection fails:

  • Upstream may not support WebSocket

  • API key may not have WebSocket access

  • Firewall blocking WebSocket connections

3. Disable WebSocket (Fallback to HTTP)

Note: HTTP polling is less efficient but more reliable


Missing Block Notifications

Symptoms:

  • Chain tip not updating in Prism

  • /health shows stale latest_block

  • Reorg detection not working

Check chain tip updates:

Solutions:

1. Verify WebSocket Subscription

Check logs for subscription confirmation:

If no subscription message: WebSocket not connected

2. Enable HTTP Polling Fallback

How it helps:

  • Health checker polls eth_blockNumber periodically

  • Updates chain tip even if WebSocket fails

  • Detects rollbacks and reorgs

3. Check Firewall Rules

WebSocket requires outbound connections:


Error Codes Reference

JSON-RPC Standard Errors

Code
Message
Description
Troubleshooting

-32700

Parse error

Invalid JSON received

Check request syntax; ensure valid JSON

-32600

Invalid Request

Request object malformed

Verify jsonrpc: "2.0", method, id fields

-32601

Method not found

Method doesn't exist or unsupported

Check method name spelling; see supported methods

-32602

Invalid params

Invalid method parameters

Verify parameter types and count

-32603

Internal error

Internal server error

Check Prism logs; may be upstream issue

Ethereum JSON-RPC Errors

Code
Message
Description
Troubleshooting

-32000

Server error

Generic server error

Check error message for details

-32001

Resource not found

Requested resource doesn't exist

Block/transaction may not exist yet

-32002

Resource unavailable

Resource temporarily unavailable

Retry after delay; upstream may be syncing

-32003

Transaction rejected

Transaction wouldn't be accepted

Check transaction parameters

-32004

Method not supported

Method not implemented

Method not supported by upstream

-32005

Limit exceeded

Request exceeds defined limit

Reduce query range; add more upstreams

Prism-Specific Errors

Code
Message
Description
Troubleshooting

-32050

No healthy upstreams

All upstreams unavailable

Check upstream configuration; verify API keys

-32051

Circuit breaker open

Upstream circuit breaker open

Wait for recovery or restart; check upstream health

-32052

Consensus failure

Upstreams disagree on response

Check upstream consistency; remove stale upstreams

-32053

Rate limit exceeded

Request rate limit exceeded

Implement client rate limiting; increase limits

-32054

Authentication failed

Invalid or missing API key

Include X-API-Key header; verify key is valid

-32055

Method not allowed

API key lacks method permission

Update key permissions or use different key

Upstream Provider Errors

Execution Errors (Client's fault, NOT penalized):

  • "execution reverted" - Smart contract reverted

  • "out of gas" - Transaction ran out of gas

  • "insufficient funds" - Account balance too low

  • "nonce too low" - Transaction nonce already used

  • "gas too low" - Gas limit too low for transaction

Provider Errors (Upstream's fault, triggers circuit breaker):

  • "Internal error" (-32603) - Upstream server error

  • "server error" (-32000) - Generic upstream error

Rate Limit Errors (Transient, retry on different upstream):

  • "Limit exceeded" (-32005) - Upstream rate limit

  • HTTP 429 - Too many requests


Debug Strategies

General Debugging Workflow

  1. Check Health Endpoint

    • Verify status: "healthy"

    • Check upstream health and response times

    • Verify cache statistics

  2. Check Metrics

    • Look for error rate spikes

    • Check circuit breaker states

    • Review reorg activity

  3. Enable Debug Logging

  4. Monitor Real-Time Logs

Debugging Specific Issues

Debug Cache Misses

1. Enable verbose logging:

2. Look for cache decision logs:

3. Check cache status headers:

Possible values:

  • FULL: Complete cache hit

  • PARTIAL: Some data from cache, rest from upstream

  • MISS: No cached data, all from upstream

  • EMPTY: Cached empty result (e.g., no logs in range)

Debug Upstream Selection

1. Enable routing logs:

2. Check upstream scoring:

3. Review selection reasons:

Debug Authentication Issues

1. Check authentication metrics:

2. Test with known valid key:

3. Check key permissions:

Debug Performance Issues

1. Profile request latency:

2. Check P50/P95/P99 latency:

3. Identify slow methods:

4. Check upstream latency:


FAQ

General Questions

Q: What methods does Prism cache?

A: Prism caches the following methods:

  • eth_getBlockByHash - Block cache

  • eth_getBlockByNumber - Block cache

  • eth_getLogs - Log cache (with partial-range support)

  • eth_getTransactionByHash - Transaction cache

  • eth_getTransactionReceipt - Transaction cache

Not cached (forwarded to upstream):

  • eth_blockNumber - Always latest value

  • eth_chainId - Static value

  • eth_gasPrice - Changes frequently

  • eth_getBalance - Account-specific, changes with every transaction

  • eth_call - Depends on state, not safely cacheable

Q: How long does cached data stay valid?

A: Cache validity depends on block finality:

  • Finalized blocks (past finalized checkpoint): Cached forever

  • Safe blocks (beyond safety_depth from tip): Cached until reorg detected

  • Unsafe blocks (within safety_depth of tip): May be invalidated during reorgs

  • Default safety_depth: 12 blocks (~2.4 minutes)

Configuration:

Q: Can I disable caching for specific methods?

A: Currently, caching is method-specific and cannot be selectively disabled. However, you can:

  1. Disable all caching:

  1. Reduce cache sizes (effectively disables caching):

  1. Use separate Prism instances (cached vs non-cached)

Q: How many upstreams should I configure?

A: Recommended:

  • Minimum: 2 upstreams for redundancy

  • Optimal: 3-4 upstreams for best reliability and performance

  • Maximum: No hard limit, but 5+ may have diminishing returns

Why multiple upstreams:

  • Redundancy: If one fails, others handle requests

  • Load distribution: Spread requests across providers

  • Rate limit mitigation: Switch to different upstream when one is rate-limited

  • Latency optimization: Route requests to fastest upstream

Q: What happens during a blockchain reorg?

A: Prism handles reorgs automatically:

  1. Detection: WebSocket or health check detects chain tip changed

  2. Invalidation: Affected blocks removed from cache

  3. Refetch: Next request fetches fresh data from upstream

  4. Consistency: Clients always receive canonical chain data

Example:

Configuration Questions

Q: What's the difference between safety_depth and max_reorg_depth?

A:

  • safety_depth: Blocks beyond this from tip are considered "safe" from reorgs

    • Default: 12 blocks

    • Used for: Cache retention decisions, finality classification

    • Example: With safety_depth=12 at tip 1000, blocks <= 988 are "safe"

  • max_reorg_depth: Maximum blocks to search backwards for reorg divergence

    • Default: 100 blocks

    • Used for: Limit reorg detection scope, prevent excessive cache invalidation

    • Example: Reorg affecting 150 blocks will only invalidate last 100

Configuration:

Q: How do I optimize for lowest latency?

A: Latency optimization checklist:

  1. Enable caching:

  1. Use fast upstreams:

  1. Enable hedging:

  1. Geographic proximity: Deploy Prism in same region as upstreams

  2. Connection pooling:

  1. Monitor latency:

Q: How do I optimize for highest throughput?

A: Throughput optimization checklist:

  1. Add more upstreams:

  1. Increase concurrency:

  1. Use load balancing:

  1. Enable batch requests: Use JSON-RPC batch format

  2. Maximize caching:

Troubleshooting Questions

Q: Why is my cache hit rate so low?

A: Common causes:

  1. Random historical queries: Queries to random old blocks bypass hot window cache

    • Solution: Focus queries on recent blocks (last 200 blocks)

  2. Cache size too small: Frequent evictions reduce hit rate

    • Solution: Increase cache sizes in configuration

  3. Wide log query ranges: Large block ranges cause partial misses

    • Solution: Use smaller ranges or increase chunk_size

  4. Reorgs invalidating cache: Frequent reorgs clear cached blocks

    • Solution: Increase safety_depth, check for reorg storms

  5. Cache disabled: Check cache.enabled = true

Debug:

Q: How do I know if an upstream is slow or failing?

A: Check these metrics:

  1. Health status:

  1. Latency metrics:

  1. Error counts:

  1. Circuit breaker state:

Warning signs:

  • P99 latency > 1 second

  • Error rate > 5%

  • Circuit breaker open

  • Block lag > 10 blocks behind tip

Q: Why do I see "Circuit breaker is open" errors?

A: Circuit breaker opens when upstream has too many consecutive failures.

Immediate fix: Wait 60 seconds for automatic recovery to half-open state

Long-term solutions:

  1. Fix upstream issues:

    • Check API key validity

    • Verify network connectivity

    • Check provider status page

  2. Adjust circuit breaker settings:

  1. Add redundant upstreams: More upstreams reduce impact of one failing

  2. Monitor recovery:

Q: How can I reduce API costs?

A: Cost reduction strategies:

  1. Maximize caching:

  1. Use tiered upstreams:

  1. Implement client-side batching: Batch requests to reduce HTTP overhead

  2. Rate limit clients: Prevent abuse

  1. Monitor request distribution:


Still having issues? Join our community:

  • GitHub Issues: https://github.com/your-org/prism/issues

  • Discord: https://discord.gg/prism

  • Documentation: https://docs.prism.sh

Next: Explore Monitoring & Observability for proactive issue detection.

Last updated