HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295
Open
balodesecurity wants to merge 3 commits intoapache:trunkfrom
Open
HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295balodesecurity wants to merge 3 commits intoapache:trunkfrom
balodesecurity wants to merge 3 commits intoapache:trunkfrom
Conversation
… excess replica timing race. In HA mode, a timing race can cause the standby NN to incorrectly mark a replica as excess before it learns that a DataNode is decommissioning. This leaves the standby's isSufficient() check permanently returning false (live=1 < RF=2), so the decommission monitor never calls setDecommissioned() and logs under-replication warnings indefinitely. Fix: in isSufficient(), count excess replicas (physically-present block copies) alongside live replicas when checking decommission sufficiency for non-UC blocks. A hasMinStorage guard ensures at least dfs.replication.min live copies exist for durability. If the excess replica is later deleted, the block manager detects under-replication and schedules re-replication.
…cess replica fix. Tests cover: - Bug scenario: live=1 + excess=1 >= RF=2 → decommission allowed (HDFS-17722 fix) - Normal case: live=2, excess=0 → decommission allowed (not broken by fix) - Safety guard: live=0, excess=2 → decommission blocked (no durable copy) - Insufficient even with excess: live=0 + excess=1 < RF=2 → blocked - Excess above RF with min live: live=1 + excess=2 >= RF=2, live >= min → allowed
Author
Docker Integration Test ResultsTested on a 3-DataNode Docker cluster (1 NameNode + 3 DataNodes, RF=3, balodesecurity/hadoop HDFS-17722 branch): Note on replicating the bug naturally: In a single-NameNode setup the race does not occur naturally (the block manager processes setrep deletions before the decommission check runs in the same thread). The bug is specific to the standby NameNode path. The unit tests in |
Author
|
CI failed due to Jenkins OOM kill (exit code 137) — unrelated to the patch. Requesting retest. /retest |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On a standby NameNode, a DataNode can get stuck in the
DECOMMISSION_INPROGRESSstate indefinitely when a timing race causes a replica to be flagged as excess instead of live during decommissioning.Sequence:
isSufficient():numLive=2(DN-B, DN-C) satisfies RF=3? No. It sees only 2 live copies, so decommission stalls.isSufficient()never returns true.The excess replica on DN-D is a physically present block copy and contributes to durability — ignoring it causes the deadlock.
Fix
In
DatanodeAdminManager.isSufficient(), count excess replicas alongside live replicas for the sufficiency check on non-under-construction blocks:The
hasMinStorageguard (checksdfs.replication.min, default 1) ensures decommission does not proceed if zero live replicas exist — excess-only replicas are not guaranteed durable. After decommission completes, if the excess replica on DN-D is subsequently deleted, the block manager's normal under-replication detection will schedule re-replication.Testing
Unit tests —
TestDatanodeAdminManagerIsSufficient(5 tests, no cluster required):testExcessReplicaCountsTowardSufficiencytruetestNormalDecommissionStillSufficienttruetestNoLiveReplicaBlocksDecommissionfalsetestInsufficientEvenWithExcessfalsetestExcessAboveRFWithMinLivetrueDocker integration — 3-DataNode cluster with 1 NameNode and RF=3, 5 scenarios:
Related