Skip to content

sqldb+kvdb: fix SQLite transaction retry failures on mobile platforms#10561

Open
Roasbeef wants to merge 4 commits intolightningnetwork:masterfrom
Roasbeef:fix/sqlite-rollback-no-tx-active
Open

sqldb+kvdb: fix SQLite transaction retry failures on mobile platforms#10561
Roasbeef wants to merge 4 commits intolightningnetwork:masterfrom
Roasbeef:fix/sqlite-rollback-no-tx-active

Conversation

@Roasbeef
Copy link
Member

@Roasbeef Roasbeef commented Feb 9, 2026

Fixes #10558.

Problem

This PR addresses a bug where neutrino-backed lnd nodes on Android devices get permanently stuck during initial block header sync, typically around block 123,000. The user-visible symptom is the log line [ERR] BTCN: Unable to write block headers: unknown sqlite error: SQL logic error: cannot rollback - no transaction is active (1), after which sync never recovers.

Root Cause

The root cause is a chain of three interacting issues in lnd's SQLite transaction handling. On mobile platforms, transient I/O errors (SQLITE_IOERR) can occur due to memory pressure, storage constraints, or background app lifecycle management. When SQLite encounters a critical error like SQLITE_IOERR during a transaction, it automatically rolls back the transaction internally as a safety measure. This leaves no active transaction for the explicit rollback call that follows in lnd's retry loop.

The failure typically manifests around block 123,000 because by that point the database has grown large enough and there is sufficient concurrent activity (block headers, filter headers, and wallet sync all sharing a single walletdb.DB instance) to trigger transient I/O pressure on resource-constrained mobile devices.

Fix 1: Handle SQLite Auto-Rollback

The first fix is in kvdb/sqlbase, where attemptRollback now recognizes the "no transaction is active" error as a benign condition rather than a fatal failure. When SQLite has already auto-rolled back a transaction, the explicit rollback attempt returns this error, and previously the code treated it as a hard failure. Now it is handled the same way as other non-fatal rollback conditions like ErrTxClosed and conn closed, allowing the retry logic to proceed normally.

Fix 2: Preserve Original Error on Rollback Failure

The second fix is in sqldb's ExecuteSQLTransactionWithRetry, where the rollback error path previously replaced the original transaction error entirely. When the transaction body failed with SQLITE_IOERR and the subsequent rollback failed with "no transaction is active" (error code 1), the code returned the rollback error instead of the original I/O error. Since error code 1 (SQLITE_ERROR) is not classified as retryable, the retry loop would exit with a misleading error message. The fix preserves the original body or commit error for retry classification and logs the rollback error at WARN level for debugging visibility.

Fix 3: Classify Transient SQLite Errors as Retryable

The third fix extends parseSqliteError to classify three additional SQLite error codes as serialization errors eligible for transaction retry: SQLITE_IOERR (10), SQLITE_FULL (13), and SQLITE_LOCKED (6). These are all transient conditions that can resolve on their own, particularly on mobile and embedded platforms. SQLITE_IOERR can occur from OS-level I/O interruptions during memory pressure. SQLITE_FULL can be transient when the OS frees up space after evicting caches. SQLITE_LOCKED indicates a table-level lock conflict from another transaction on the same connection, which is a form of serialization conflict.

Result

Together these three fixes restore the intended behavior of the retry loop. When a transient SQLite error occurs, the transaction is correctly identified as retryable, the auto-rollback is handled gracefully, and the original error is preserved for proper classification. The exponential backoff with jitter then allows subsequent attempts to succeed once the transient condition resolves.

Testing

Each fix includes unit tests. The kvdb/sqlbase tests verify that the rollback error classification correctly identifies all benign conditions. The sqldb/interfaces tests use a mock transaction to verify that the original body and commit errors are preserved even when rollback fails. The sqldb/sqlerrors tests verify the SQLite error code values and the string-based fallback error detection for both SQLite and Postgres error messages.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical SQLite transaction retry failures observed on mobile platforms, specifically affecting neutrino-backed lnd nodes during initial block header synchronization. The changes address a complex interaction of transient I/O errors, SQLite's auto-rollback behavior, and lnd's error handling, ensuring that transient database issues are correctly identified as retryable and do not lead to permanent application stalls. This improves the robustness and reliability of lnd on mobile devices.

Highlights

  • Benign Rollback Handling: The attemptRollback function in kvdb/sqlbase now treats the "no transaction is active" SQLite error as a benign condition, preventing it from being treated as a fatal failure when SQLite has already auto-rolled back a transaction due to critical errors like SQLITE_IOERR.
  • Original Error Preservation: sqldb's ExecuteSQLTransactionWithRetry is updated to preserve the original transaction body or commit error, even if a subsequent rollback fails. This ensures that the correct root cause error is returned for retry classification, logging the rollback error at a WARN level for debugging.
  • Expanded Retryable Errors: The parseSqliteError function now classifies SQLITE_IOERR (I/O error), SQLITE_FULL (disk full), and SQLITE_LOCKED (database locked) as serialization errors. This makes these transient conditions eligible for transaction retry, which is crucial for stability on resource-constrained mobile platforms.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • kvdb/sqlbase/db.go
    • Introduced ErrNoTxActive constant for the "no transaction is active" SQLite error string.
    • Modified attemptRollback to ignore ErrNoTxActive and "conn closed" errors during rollback, treating them as non-fatal conditions.
  • kvdb/sqlbase/db_test.go
    • Added TestAttemptRollbackNoTxActive to verify that attemptRollback correctly identifies and ignores benign rollback errors, including "no transaction is active".
  • sqldb/interfaces.go
    • Updated ExecuteSQLTransactionWithRetry to log rollback errors at WARN level and always return the original transaction body or commit error, rather than the rollback error, for proper retry classification.
  • sqldb/interfaces_test.go
    • Added TestExecuteSQLTransactionPreservesBodyError and TestExecuteSQLTransactionPreservesCommitError to confirm that ExecuteSQLTransactionWithRetry correctly preserves the original transaction errors when rollbacks fail.
  • sqldb/sqlerrors.go
    • Corrected a comment for parseSqliteError.
    • Expanded parseSqliteError to classify SQLITE_IOERR, SQLITE_FULL, and SQLITE_LOCKED as ErrSerializationError, making them retryable.
  • sqldb/sqlerrors_test.go
    • Added TestSqliteSerializationErrorCodes to verify the classification of new SQLite error codes as serialization errors.
    • Included TestSqliteUniqueConstraintCodes and TestMapSQLErrorStringFallback for related error handling tests.
Activity
  • The author has implemented three distinct fixes to address the SQLite transaction retry failures.
  • Comprehensive unit tests have been added for each fix, covering kvdb/sqlbase, sqldb/interfaces, and sqldb/sqlerrors to ensure correct error handling and retry logic.
  • The changes aim to resolve a specific issue where neutrino-backed lnd nodes on Android devices would get stuck during block header sync due to transient I/O errors, typically around block 123,000.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a comprehensive fix for a critical SQLite transaction retry issue on mobile platforms. The changes are well-structured into three parts: handling auto-rollback benign errors in kvdb/sqlbase, preserving the original transaction error in sqldb's retry logic, and expanding the set of retryable SQLite errors. The code is clear, and the accompanying unit tests in kvdb/sqlbase and sqldb/interfaces effectively validate the new behavior. I have one suggestion to improve the new tests in sqldb/sqlerrors_test.go to make them more robust.

Comment on lines 14 to 76
// TestSqliteSerializationErrorCodes verifies that the correct SQLite error
// codes are classified as serialization errors that should trigger transaction
// retries.
func TestSqliteSerializationErrorCodes(t *testing.T) {
t.Parallel()

// These error codes should all be classified as serialization errors
// by parseSqliteError.
serializableCodes := []struct {
name string
code int
}{
{
name: "SQLITE_BUSY",
code: sqlite3.SQLITE_BUSY,
},
{
name: "SQLITE_IOERR",
code: sqlite3.SQLITE_IOERR,
},
{
name: "SQLITE_FULL",
code: sqlite3.SQLITE_FULL,
},
{
name: "SQLITE_LOCKED",
code: sqlite3.SQLITE_LOCKED,
},
}

for _, tc := range serializableCodes {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()

// Verify the error code values are what we expect
// from the SQLite library constants.
switch tc.code {
case sqlite3.SQLITE_BUSY:
require.Equal(t, 5, tc.code)

case sqlite3.SQLITE_IOERR:
require.Equal(t, 10, tc.code)

case sqlite3.SQLITE_FULL:
require.Equal(t, 13, tc.code)

case sqlite3.SQLITE_LOCKED:
require.Equal(t, 6, tc.code)
}
})
}
}

// TestSqliteUniqueConstraintCodes verifies that constraint violation codes are
// correctly identified.
func TestSqliteUniqueConstraintCodes(t *testing.T) {
t.Parallel()

// SQLITE_CONSTRAINT_UNIQUE and SQLITE_CONSTRAINT_PRIMARYKEY should
// both be classified as unique constraint violations.
require.Equal(t, 2067, sqlite3.SQLITE_CONSTRAINT_UNIQUE)
require.Equal(t, 1555, sqlite3.SQLITE_CONSTRAINT_PRIMARYKEY)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new tests, TestSqliteSerializationErrorCodes and TestSqliteUniqueConstraintCodes, are a bit misleading. Their names and comments suggest they test error classification, but their implementations only verify the integer values of constants from the sqlite3 library (e.g., require.Equal(t, 5, tc.code)). This doesn't actually test that parseSqliteError correctly classifies these error codes.

A more robust approach would be to test the classification logic itself. For example, verifying that parseSqliteError wraps an error with code SQLITE_IOERR into an ErrSerializationError.

Since creating *sqlite.Error instances for testing can be tricky, here are a couple of suggestions:

  1. Rename these tests to something like TestSqliteConstantValues and update their comments to accurately reflect that they are just verifying library constants. This would at least remove the confusion.
  2. A better, more thorough approach would be to find a way to test the classification logic, perhaps by creating an in-memory SQLite DB to trigger these errors, or by adding a test-only helper to construct *sqlite.Error instances.

This would make the tests for this critical error handling logic much stronger.

@Roasbeef
Copy link
Member Author

Roasbeef commented Feb 9, 2026

I think there's another angle here re neutrino's error handling. Unclear why it didn't just continue from the sync, recovering from that failed write.

@ziggie1984 ziggie1984 self-requested a review February 9, 2026 22:53
When SQLite encounters a critical error (SQLITE_IOERR, SQLITE_FULL,
SQLITE_NOMEM) during a transaction, it automatically rolls back the
transaction. If Go then attempts an explicit ROLLBACK, SQLite returns
"cannot rollback - no transaction is active" (error code 1).

Previously, attemptRollback did not recognize this error, causing it to
propagate up as a fatal rollback failure. This masked the original
transaction body error and prevented the retry logic from functioning.

This commit adds "no transaction is active" as a recognized non-fatal
rollback condition (alongside ErrTxClosed and "conn closed"), allowing
the original error to flow through to the retry classification logic.

Fixes lightningnetwork#10558
Previously, ExecuteSQLTransactionWithRetry would return the rollback
error when rollbackTx failed, completely discarding the original body
or commit error. This made debugging nearly impossible since the actual
root cause was lost.

Now when a rollback fails, we log the rollback error at WARN level
(including the original error for context) but return the original
body/commit error to the caller. This ensures the retry classification
logic sees the real error and can make correct retry decisions.
@Roasbeef Roasbeef force-pushed the fix/sqlite-rollback-no-tx-active branch from 83d76fc to b7b9869 Compare February 10, 2026 01:08
Previously, parseSqliteError only classified SQLITE_BUSY as a
serialization error eligible for transaction retries. On mobile/embedded
platforms (particularly Android), transient I/O errors (SQLITE_IOERR),
disk-full conditions (SQLITE_FULL), and table-level locks
(SQLITE_LOCKED) are common and recoverable.

This commit adds these three error codes to the serialization error
classification, enabling the retry logic to handle them gracefully
instead of immediately propagating them as fatal errors.

Also fixes the parseSqliteError doc comment which incorrectly said
"parsePostgresError".
@Roasbeef Roasbeef force-pushed the fix/sqlite-rollback-no-tx-active branch from b7b9869 to 4080eeb Compare February 10, 2026 01:09
Copy link
Collaborator

@ziggie1984 ziggie1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, not sure if classifying this new errors has any side effects on systems other than mobile ?

// rollback is attempted but no transaction is active. This can occur when
// SQLite has already auto-rolled-back the transaction due to a critical error
// such as SQLITE_IOERR, SQLITE_FULL, or SQLITE_NOMEM.
const ErrNoTxActive = "no transaction is active"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to bump the kvdb version after this is merged

DBError: sqliteErr,
}

// Transient I/O errors can occur on mobile/embedded platforms due to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gemini-Assist is it a bit dangerous to now classify this as transient because on normal non mobile systems this could have side effects ?

SQLite's Error.Code() returns extended error codes that encode the
specific failure subtype in the upper bits and the base error category
in the lower 8 bits. For example, SQLITE_IOERR_GETTEMPPATH has
extended code 6410 but base code SQLITE_IOERR (10).

The previous implementation compared Code() directly against base
constants like SQLITE_IOERR (10), which would never match extended
codes like 6410. This meant the actual error from the Android bug
report (disk I/O error with code 6410) fell through to the default
"unknown sqlite error" case and was NOT classified as retryable,
defeating the purpose of the fix.

The same issue affected SQLITE_BUSY (which has extended variants
SQLITE_BUSY_RECOVERY=261, SQLITE_BUSY_SNAPSHOT=517,
SQLITE_BUSY_TIMEOUT=773) and SQLITE_LOCKED (SQLITE_LOCKED_SHAREDCACHE
=262, SQLITE_LOCKED_VTAB=518).

Fix by extracting the base code via `code & 0xFF` before matching
against transient error categories. Constraint violations continue to
use exact extended code matching since they are defined as specific
extended codes (SQLITE_CONSTRAINT_UNIQUE=2067,
SQLITE_CONSTRAINT_PRIMARYKEY=1555).
@ziggie1984
Copy link
Collaborator

not sure if we need that one since the problem was the sandbox mode of android.

@lightninglabs-deploy
Copy link
Collaborator

@Roasbeef, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug]: Android: SQLite: Neutrino some wallets having sync stuck at block 123000

3 participants