Skip to content

bug: rpc timeout issue due to daser and share api session conflict #4562

@gupadhyaya

Description

@gupadhyaya

RPC Timeout Issue: DASer vs Share API Session Conflict

Problem Description

Light nodes experience intermittent RPC timeouts when Share API calls (like GetSamples, SharesAvailable) conflict with background DASer sampling operations. Both operations use the same session key (header height) in utils.Sessions, causing RPC requests to timeout while waiting for DASer sessions to complete.

Root Cause

The issue occurs in share/availability/light/availability.go where both DASer background sampling and on-demand RPC calls use the same session management:

// share/availability/light/availability.go:92-96
// Prevent multiple sampling and pruning sessions for the same header height
release, err := la.activeHeights.StartSession(ctx, header.Height())
if err != nil {
    return err  // ← RPC timeout occurs here when DASer holds the session
}
defer release()

Timing Conflict

  • DASer timeout: 240 seconds (15s × 16 concurrency limit)
  • RPC timeout: 30 seconds (typical client timeout)
  • Session key: Both use header.Height() as the session identifier

Code Evidence

Session Management Conflict

// utils/sessions.go - Both operations use same key
func (s *Sessions) StartSession(ctx context.Context, key interface{}) (func(), error) {
    // DASer and RPC both call this with header.Height() as key
    // When DASer holds session for 240s, RPC times out after 30s
}

DASer Background Sampling

// das/daser.go:201-202
func (d *DASer) sample(ctx context.Context, h *header.ExtendedHeader) error {
    err := d.da.SharesAvailable(ctx, h)  // Calls same session management
    // This can hold session for up to 240 seconds
}

RPC Share API Calls

// API calls like GetSamples, SharesAvailable
func (la *ShareAvailability) SharesAvailable(ctx context.Context, header *header.ExtendedHeader) error {
    // Same session management - conflicts with DASer
    release, err := la.activeHeights.StartSession(ctx, header.Height())
    if err != nil {
        return err  // "context deadline exceeded" after 30s
    }
}

Impact

  • User Experience: Intermittent Share API timeouts appear as "network issues"
  • Error Attribution: Generic timeout errors mask the root cause
  • Retry Logic: Automatic retries eventually succeed, hiding the issue
  • Production Visibility: Low conflict probability masks the problem in production

Why This Issue Doesn't Occur in Real Deployments

In production environments, this issue is effectively masked by several factors: the probability of DASer and RPC operations targeting the exact same header height simultaneously is very low due to distributed workloads across multiple nodes and different timing patterns. Production clients typically use longer timeouts (60s+ vs 30s in tests), and applications implement robust retry mechanisms and fallback strategies that handle occasional timeouts gracefully. Additionally, the distributed nature of real deployments means load is spread across many nodes, reducing per-node conflict probability. While this masks the issue in production, fixing the underlying session management is still important for system reliability and future scalability.

Proposed Solution

Use operation-specific session keys instead of just header height:

// Proposed fix: Operation-specific session keys
type SessionKey struct {
    Height    uint64
    Operation string  // "das", "rpc", "prune"
}

// DASer sessions
release, err := la.activeHeights.StartSession(ctx, SessionKey{
    Height: header.Height(),
    Operation: "das",
})

// RPC sessions  
release, err := la.activeHeights.StartSession(ctx, SessionKey{
    Height: header.Height(),
    Operation: "rpc",
})

Test Evidence

Created test confirming the session conflict:

func TestSessionConflict(t *testing.T) {
    sessions := utils.NewSessions()
    height := uint64(1000)
    
    // Simulate DASer holding session for 240s
    go func() {
        release, _ := sessions.StartSession(context.Background(), height)
        time.Sleep(240 * time.Second)
        release()
    }()
    
    // RPC tries to acquire same session with 30s timeout
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    _, err := sessions.StartSession(ctx, height)
    // Result: "context deadline exceeded" - confirms the conflict
}

Environment

  • Celestia Node Version: v0.26.0-arabica-27-g3189a732
  • Test Framework: Tastora (Docker-based)
  • Network: Test environment with bridge + light nodes
  • Reproduction: Intermittent in production, consistent in test environment

Priority

Medium-High: Affects user experience but is masked by retry logic in production. The underlying session management conflict should be resolved to prevent future issues and improve system reliability.

Related Files

  • share/availability/light/availability.go - Main session management
  • das/daser.go - Background DASer sampling
  • utils/sessions.go - Session management implementation
  • nodebuilder/tests/tastora/api_test.go - Test reproduction

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions