Skip to content

HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295

Open
balodesecurity wants to merge 3 commits intoapache:trunkfrom
balodesecurity:HDFS-17722
Open

HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295
balodesecurity wants to merge 3 commits intoapache:trunkfrom
balodesecurity:HDFS-17722

Conversation

@balodesecurity
Copy link

Problem

On a standby NameNode, a DataNode can get stuck in the DECOMMISSION_INPROGRESS state indefinitely when a timing race causes a replica to be flagged as excess instead of live during decommissioning.

Sequence:

  1. File is written to DN-A, DN-B, DN-C (RF=3).
  2. DN-A is marked for decommission.
  3. The block manager schedules re-replication → copies a new replica to DN-D.
  4. On the standby NN, the block report for DN-D arrives before the decommission state for DN-A is propagated. The standby marks DN-D's replica as excess (it looks like an over-replicated block).
  5. The decommission monitor on the standby calls isSufficient(): numLive=2 (DN-B, DN-C) satisfies RF=3? No. It sees only 2 live copies, so decommission stalls.
  6. Meanwhile DN-A is never fully decommissioned because isSufficient() never returns true.

The excess replica on DN-D is a physically present block copy and contributes to durability — ignoring it causes the deadlock.

Fix

In DatanodeAdminManager.isSufficient(), count excess replicas alongside live replicas for the sufficiency check on non-under-construction blocks:

final int numLiveAndExcess = numLive + numberReplicas.excessReplicas();
if (numLiveAndExcess >= blockManager.getDefaultStorageNum(block)
    && blockManager.hasMinStorage(block, numLive)) {
  return true;
}

The hasMinStorage guard (checks dfs.replication.min, default 1) ensures decommission does not proceed if zero live replicas exist — excess-only replicas are not guaranteed durable. After decommission completes, if the excess replica on DN-D is subsequently deleted, the block manager's normal under-replication detection will schedule re-replication.

Testing

Unit testsTestDatanodeAdminManagerIsSufficient (5 tests, no cluster required):

Test Scenario Expected
testExcessReplicaCountsTowardSufficiency HDFS-17722 bug: live=1, excess=1, RF=2 true
testNormalDecommissionStillSufficient Baseline: live=2, excess=0, RF=2 true
testNoLiveReplicaBlocksDecommission Safety guard: live=0, excess=2, RF=2 false
testInsufficientEvenWithExcess live=0, excess=1, RF=2 — not enough either way false
testExcessAboveRFWithMinLive live=1, excess=2, RF=2 — excess over-covers RF true
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

Docker integration — 3-DataNode cluster with 1 NameNode and RF=3, 5 scenarios:

  • Scenario 1: Clean decommission (RF=2) — PASS
  • Scenario 2: RF=3→2 creates excess replicas, then decommission DN2 — PASS
  • Scenario 3: Same scenario on DN3 — PASS
  • Scenario 4: Repeated decommission + recommission cycles (3 rounds) — PASS
  • Scenario 5: Data integrity check after decommission — PASS
Results: 0 failure(s) — ALL TESTS PASSED

Related

… excess replica timing race.

In HA mode, a timing race can cause the standby NN to incorrectly mark a
replica as excess before it learns that a DataNode is decommissioning. This
leaves the standby's isSufficient() check permanently returning false
(live=1 < RF=2), so the decommission monitor never calls setDecommissioned()
and logs under-replication warnings indefinitely.

Fix: in isSufficient(), count excess replicas (physically-present block
copies) alongside live replicas when checking decommission sufficiency for
non-UC blocks. A hasMinStorage guard ensures at least dfs.replication.min
live copies exist for durability. If the excess replica is later deleted,
the block manager detects under-replication and schedules re-replication.
…cess replica fix.

Tests cover:
- Bug scenario: live=1 + excess=1 >= RF=2 → decommission allowed (HDFS-17722 fix)
- Normal case: live=2, excess=0 → decommission allowed (not broken by fix)
- Safety guard: live=0, excess=2 → decommission blocked (no durable copy)
- Insufficient even with excess: live=0 + excess=1 < RF=2 → blocked
- Excess above RF with min live: live=1 + excess=2 >= RF=2, live >= min → allowed
@balodesecurity
Copy link
Author

Docker Integration Test Results

Tested on a 3-DataNode Docker cluster (1 NameNode + 3 DataNodes, RF=3, balodesecurity/hadoop HDFS-17722 branch):

--- Scenario 1: Clean decommission (RF=2, decom DN2) ---
  [PASS] DN2 decommissioned cleanly (RF=2)

--- Scenario 2: HDFS-17722 — RF=3→2 creates EXCESS, then decom DN2 ---
  [PASS] DN2 decommissioned with EXCESS replicas present (HDFS-17722 FIX VERIFIED!)
  [PASS] All 3 files accessible after decommission

--- Scenario 3: HDFS-17722 on DN3 (variant) ---
  [PASS] DN3 decommissioned with EXCESS replicas (HDFS-17722 fix verified on DN3)

--- Scenario 4: Repeated decom/recommission cycles (3 rounds) ---
  [PASS] Round 1: DN2 decommissioned + recommissioned (Normal)
  [PASS] Round 2: DN2 decommissioned + recommissioned (Normal)
  [PASS] Round 3: DN2 decommissioned + recommissioned (Normal)

--- Scenario 5: Data integrity after decommission ---
  [PASS] DN2 decommissioned
  [PASS] Data integrity OK: content matches

Results: 0 failure(s) — ALL TESTS PASSED

Note on replicating the bug naturally: In a single-NameNode setup the race does not occur naturally (the block manager processes setrep deletions before the decommission check runs in the same thread). The bug is specific to the standby NameNode path. The unit tests in TestDatanodeAdminManagerIsSufficient directly exercise isSufficient() with the exact replica counts that trigger the deadlock. The Docker tests verify no regression in normal decommission behavior.

@balodesecurity
Copy link
Author

CI failed due to Jenkins OOM kill (exit code 137) — unrelated to the patch. Requesting retest.

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants