sqldb+kvdb: fix SQLite transaction retry failures on mobile platforms#10561
sqldb+kvdb: fix SQLite transaction retry failures on mobile platforms#10561Roasbeef wants to merge 4 commits intolightningnetwork:masterfrom
Conversation
Summary of ChangesHello @Roasbeef, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical SQLite transaction retry failures observed on mobile platforms, specifically affecting neutrino-backed lnd nodes during initial block header synchronization. The changes address a complex interaction of transient I/O errors, SQLite's auto-rollback behavior, and lnd's error handling, ensuring that transient database issues are correctly identified as retryable and do not lead to permanent application stalls. This improves the robustness and reliability of lnd on mobile devices. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request provides a comprehensive fix for a critical SQLite transaction retry issue on mobile platforms. The changes are well-structured into three parts: handling auto-rollback benign errors in kvdb/sqlbase, preserving the original transaction error in sqldb's retry logic, and expanding the set of retryable SQLite errors. The code is clear, and the accompanying unit tests in kvdb/sqlbase and sqldb/interfaces effectively validate the new behavior. I have one suggestion to improve the new tests in sqldb/sqlerrors_test.go to make them more robust.
sqldb/sqlerrors_test.go
Outdated
| // TestSqliteSerializationErrorCodes verifies that the correct SQLite error | ||
| // codes are classified as serialization errors that should trigger transaction | ||
| // retries. | ||
| func TestSqliteSerializationErrorCodes(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| // These error codes should all be classified as serialization errors | ||
| // by parseSqliteError. | ||
| serializableCodes := []struct { | ||
| name string | ||
| code int | ||
| }{ | ||
| { | ||
| name: "SQLITE_BUSY", | ||
| code: sqlite3.SQLITE_BUSY, | ||
| }, | ||
| { | ||
| name: "SQLITE_IOERR", | ||
| code: sqlite3.SQLITE_IOERR, | ||
| }, | ||
| { | ||
| name: "SQLITE_FULL", | ||
| code: sqlite3.SQLITE_FULL, | ||
| }, | ||
| { | ||
| name: "SQLITE_LOCKED", | ||
| code: sqlite3.SQLITE_LOCKED, | ||
| }, | ||
| } | ||
|
|
||
| for _, tc := range serializableCodes { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| // Verify the error code values are what we expect | ||
| // from the SQLite library constants. | ||
| switch tc.code { | ||
| case sqlite3.SQLITE_BUSY: | ||
| require.Equal(t, 5, tc.code) | ||
|
|
||
| case sqlite3.SQLITE_IOERR: | ||
| require.Equal(t, 10, tc.code) | ||
|
|
||
| case sqlite3.SQLITE_FULL: | ||
| require.Equal(t, 13, tc.code) | ||
|
|
||
| case sqlite3.SQLITE_LOCKED: | ||
| require.Equal(t, 6, tc.code) | ||
| } | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| // TestSqliteUniqueConstraintCodes verifies that constraint violation codes are | ||
| // correctly identified. | ||
| func TestSqliteUniqueConstraintCodes(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| // SQLITE_CONSTRAINT_UNIQUE and SQLITE_CONSTRAINT_PRIMARYKEY should | ||
| // both be classified as unique constraint violations. | ||
| require.Equal(t, 2067, sqlite3.SQLITE_CONSTRAINT_UNIQUE) | ||
| require.Equal(t, 1555, sqlite3.SQLITE_CONSTRAINT_PRIMARYKEY) | ||
| } |
There was a problem hiding this comment.
The new tests, TestSqliteSerializationErrorCodes and TestSqliteUniqueConstraintCodes, are a bit misleading. Their names and comments suggest they test error classification, but their implementations only verify the integer values of constants from the sqlite3 library (e.g., require.Equal(t, 5, tc.code)). This doesn't actually test that parseSqliteError correctly classifies these error codes.
A more robust approach would be to test the classification logic itself. For example, verifying that parseSqliteError wraps an error with code SQLITE_IOERR into an ErrSerializationError.
Since creating *sqlite.Error instances for testing can be tricky, here are a couple of suggestions:
- Rename these tests to something like
TestSqliteConstantValuesand update their comments to accurately reflect that they are just verifying library constants. This would at least remove the confusion. - A better, more thorough approach would be to find a way to test the classification logic, perhaps by creating an in-memory SQLite DB to trigger these errors, or by adding a test-only helper to construct
*sqlite.Errorinstances.
This would make the tests for this critical error handling logic much stronger.
|
I think there's another angle here re neutrino's error handling. Unclear why it didn't just continue from the sync, recovering from that failed write. |
When SQLite encounters a critical error (SQLITE_IOERR, SQLITE_FULL, SQLITE_NOMEM) during a transaction, it automatically rolls back the transaction. If Go then attempts an explicit ROLLBACK, SQLite returns "cannot rollback - no transaction is active" (error code 1). Previously, attemptRollback did not recognize this error, causing it to propagate up as a fatal rollback failure. This masked the original transaction body error and prevented the retry logic from functioning. This commit adds "no transaction is active" as a recognized non-fatal rollback condition (alongside ErrTxClosed and "conn closed"), allowing the original error to flow through to the retry classification logic. Fixes lightningnetwork#10558
Previously, ExecuteSQLTransactionWithRetry would return the rollback error when rollbackTx failed, completely discarding the original body or commit error. This made debugging nearly impossible since the actual root cause was lost. Now when a rollback fails, we log the rollback error at WARN level (including the original error for context) but return the original body/commit error to the caller. This ensures the retry classification logic sees the real error and can make correct retry decisions.
83d76fc to
b7b9869
Compare
Previously, parseSqliteError only classified SQLITE_BUSY as a serialization error eligible for transaction retries. On mobile/embedded platforms (particularly Android), transient I/O errors (SQLITE_IOERR), disk-full conditions (SQLITE_FULL), and table-level locks (SQLITE_LOCKED) are common and recoverable. This commit adds these three error codes to the serialization error classification, enabling the retry logic to handle them gracefully instead of immediately propagating them as fatal errors. Also fixes the parseSqliteError doc comment which incorrectly said "parsePostgresError".
b7b9869 to
4080eeb
Compare
ziggie1984
left a comment
There was a problem hiding this comment.
Looks good, not sure if classifying this new errors has any side effects on systems other than mobile ?
| // rollback is attempted but no transaction is active. This can occur when | ||
| // SQLite has already auto-rolled-back the transaction due to a critical error | ||
| // such as SQLITE_IOERR, SQLITE_FULL, or SQLITE_NOMEM. | ||
| const ErrNoTxActive = "no transaction is active" |
There was a problem hiding this comment.
we need to bump the kvdb version after this is merged
sqldb/sqlerrors.go
Outdated
| DBError: sqliteErr, | ||
| } | ||
|
|
||
| // Transient I/O errors can occur on mobile/embedded platforms due to |
There was a problem hiding this comment.
@Gemini-Assist is it a bit dangerous to now classify this as transient because on normal non mobile systems this could have side effects ?
SQLite's Error.Code() returns extended error codes that encode the specific failure subtype in the upper bits and the base error category in the lower 8 bits. For example, SQLITE_IOERR_GETTEMPPATH has extended code 6410 but base code SQLITE_IOERR (10). The previous implementation compared Code() directly against base constants like SQLITE_IOERR (10), which would never match extended codes like 6410. This meant the actual error from the Android bug report (disk I/O error with code 6410) fell through to the default "unknown sqlite error" case and was NOT classified as retryable, defeating the purpose of the fix. The same issue affected SQLITE_BUSY (which has extended variants SQLITE_BUSY_RECOVERY=261, SQLITE_BUSY_SNAPSHOT=517, SQLITE_BUSY_TIMEOUT=773) and SQLITE_LOCKED (SQLITE_LOCKED_SHAREDCACHE =262, SQLITE_LOCKED_VTAB=518). Fix by extracting the base code via `code & 0xFF` before matching against transient error categories. Constraint violations continue to use exact extended code matching since they are defined as specific extended codes (SQLITE_CONSTRAINT_UNIQUE=2067, SQLITE_CONSTRAINT_PRIMARYKEY=1555).
|
not sure if we need that one since the problem was the sandbox mode of android. |
|
@Roasbeef, remember to re-request review from reviewers when ready |
Fixes #10558.
Problem
This PR addresses a bug where neutrino-backed lnd nodes on Android devices get permanently stuck during initial block header sync, typically around block 123,000. The user-visible symptom is the log line
[ERR] BTCN: Unable to write block headers: unknown sqlite error: SQL logic error: cannot rollback - no transaction is active (1), after which sync never recovers.Root Cause
The root cause is a chain of three interacting issues in lnd's SQLite transaction handling. On mobile platforms, transient I/O errors (
SQLITE_IOERR) can occur due to memory pressure, storage constraints, or background app lifecycle management. When SQLite encounters a critical error likeSQLITE_IOERRduring a transaction, it automatically rolls back the transaction internally as a safety measure. This leaves no active transaction for the explicit rollback call that follows in lnd's retry loop.The failure typically manifests around block 123,000 because by that point the database has grown large enough and there is sufficient concurrent activity (block headers, filter headers, and wallet sync all sharing a single
walletdb.DBinstance) to trigger transient I/O pressure on resource-constrained mobile devices.Fix 1: Handle SQLite Auto-Rollback
The first fix is in
kvdb/sqlbase, whereattemptRollbacknow recognizes the "no transaction is active" error as a benign condition rather than a fatal failure. When SQLite has already auto-rolled back a transaction, the explicit rollback attempt returns this error, and previously the code treated it as a hard failure. Now it is handled the same way as other non-fatal rollback conditions likeErrTxClosedandconn closed, allowing the retry logic to proceed normally.Fix 2: Preserve Original Error on Rollback Failure
The second fix is in
sqldb'sExecuteSQLTransactionWithRetry, where the rollback error path previously replaced the original transaction error entirely. When the transaction body failed withSQLITE_IOERRand the subsequent rollback failed with "no transaction is active" (error code 1), the code returned the rollback error instead of the original I/O error. Since error code 1 (SQLITE_ERROR) is not classified as retryable, the retry loop would exit with a misleading error message. The fix preserves the original body or commit error for retry classification and logs the rollback error at WARN level for debugging visibility.Fix 3: Classify Transient SQLite Errors as Retryable
The third fix extends
parseSqliteErrorto classify three additional SQLite error codes as serialization errors eligible for transaction retry:SQLITE_IOERR(10),SQLITE_FULL(13), andSQLITE_LOCKED(6). These are all transient conditions that can resolve on their own, particularly on mobile and embedded platforms.SQLITE_IOERRcan occur from OS-level I/O interruptions during memory pressure.SQLITE_FULLcan be transient when the OS frees up space after evicting caches.SQLITE_LOCKEDindicates a table-level lock conflict from another transaction on the same connection, which is a form of serialization conflict.Result
Together these three fixes restore the intended behavior of the retry loop. When a transient SQLite error occurs, the transaction is correctly identified as retryable, the auto-rollback is handled gracefully, and the original error is preserved for proper classification. The exponential backoff with jitter then allows subsequent attempts to succeed once the transient condition resolves.
Testing
Each fix includes unit tests. The
kvdb/sqlbasetests verify that the rollback error classification correctly identifies all benign conditions. Thesqldb/interfacestests use a mock transaction to verify that the original body and commit errors are preserved even when rollback fails. Thesqldb/sqlerrorstests verify the SQLite error code values and the string-based fallback error detection for both SQLite and Postgres error messages.