-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
RPC Timeout Issue: DASer vs Share API Session Conflict
Problem Description
Light nodes experience intermittent RPC timeouts when Share API calls (like GetSamples, SharesAvailable) conflict with background DASer sampling operations. Both operations use the same session key (header height) in utils.Sessions, causing RPC requests to timeout while waiting for DASer sessions to complete.
Root Cause
The issue occurs in share/availability/light/availability.go where both DASer background sampling and on-demand RPC calls use the same session management:
// share/availability/light/availability.go:92-96
// Prevent multiple sampling and pruning sessions for the same header height
release, err := la.activeHeights.StartSession(ctx, header.Height())
if err != nil {
return err // ← RPC timeout occurs here when DASer holds the session
}
defer release()Timing Conflict
- DASer timeout: 240 seconds (15s × 16 concurrency limit)
- RPC timeout: 30 seconds (typical client timeout)
- Session key: Both use
header.Height()as the session identifier
Code Evidence
Session Management Conflict
// utils/sessions.go - Both operations use same key
func (s *Sessions) StartSession(ctx context.Context, key interface{}) (func(), error) {
// DASer and RPC both call this with header.Height() as key
// When DASer holds session for 240s, RPC times out after 30s
}DASer Background Sampling
// das/daser.go:201-202
func (d *DASer) sample(ctx context.Context, h *header.ExtendedHeader) error {
err := d.da.SharesAvailable(ctx, h) // Calls same session management
// This can hold session for up to 240 seconds
}RPC Share API Calls
// API calls like GetSamples, SharesAvailable
func (la *ShareAvailability) SharesAvailable(ctx context.Context, header *header.ExtendedHeader) error {
// Same session management - conflicts with DASer
release, err := la.activeHeights.StartSession(ctx, header.Height())
if err != nil {
return err // "context deadline exceeded" after 30s
}
}Impact
- User Experience: Intermittent Share API timeouts appear as "network issues"
- Error Attribution: Generic timeout errors mask the root cause
- Retry Logic: Automatic retries eventually succeed, hiding the issue
- Production Visibility: Low conflict probability masks the problem in production
Why This Issue Doesn't Occur in Real Deployments
In production environments, this issue is effectively masked by several factors: the probability of DASer and RPC operations targeting the exact same header height simultaneously is very low due to distributed workloads across multiple nodes and different timing patterns. Production clients typically use longer timeouts (60s+ vs 30s in tests), and applications implement robust retry mechanisms and fallback strategies that handle occasional timeouts gracefully. Additionally, the distributed nature of real deployments means load is spread across many nodes, reducing per-node conflict probability. While this masks the issue in production, fixing the underlying session management is still important for system reliability and future scalability.
Proposed Solution
Use operation-specific session keys instead of just header height:
// Proposed fix: Operation-specific session keys
type SessionKey struct {
Height uint64
Operation string // "das", "rpc", "prune"
}
// DASer sessions
release, err := la.activeHeights.StartSession(ctx, SessionKey{
Height: header.Height(),
Operation: "das",
})
// RPC sessions
release, err := la.activeHeights.StartSession(ctx, SessionKey{
Height: header.Height(),
Operation: "rpc",
})Test Evidence
Created test confirming the session conflict:
func TestSessionConflict(t *testing.T) {
sessions := utils.NewSessions()
height := uint64(1000)
// Simulate DASer holding session for 240s
go func() {
release, _ := sessions.StartSession(context.Background(), height)
time.Sleep(240 * time.Second)
release()
}()
// RPC tries to acquire same session with 30s timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
_, err := sessions.StartSession(ctx, height)
// Result: "context deadline exceeded" - confirms the conflict
}Environment
- Celestia Node Version: v0.26.0-arabica-27-g3189a732
- Test Framework: Tastora (Docker-based)
- Network: Test environment with bridge + light nodes
- Reproduction: Intermittent in production, consistent in test environment
Priority
Medium-High: Affects user experience but is masked by retry logic in production. The underlying session management conflict should be resolved to prevent future issues and improve system reliability.
Related Files
share/availability/light/availability.go- Main session managementdas/daser.go- Background DASer samplingutils/sessions.go- Session management implementationnodebuilder/tests/tastora/api_test.go- Test reproduction