perf: eliminate mutex contention in parallel trie commit by deferring NodeSet merges until after goroutine synchronization #33156

floor-licker · 2025-11-12T04:54:51Z

The parallel trie commit implementation is suffering from significant mutex contention that makes it significantly slower than the sequential path for medium and large tries, basically 16 goroutines are contending to acquire the same mutex sequentially to merge their result via MergeDisjoint() . This directly relates to the performance issue documented in core/state/statedb.go:1236-1241 by @karalabe , where account trie commits were observed to take 5-6ms at chain heads despite only shuffling data (no hashing).

In commitChildren(), each of the 16 child goroutines had to acquire the same mutex sequentially to merge their results via MergeDisjoint(). It seems like this is effectively serializing all parallel work at the merge point.
I think that the result of the benchmarks I ran are noteworthy (go test -bench=BenchmarkCommit -benchmem -run=^$ ./trie -benchtime=3s):

The data was super bizarre, the “parallel” version is actually slower for medium and large tries.

Seed: b42140c3ab458c49
goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
cpu: Apple M4
BenchmarkCommitAfterHash/no-onleaf-10            4767016              2065 ns/op     918 B/op          8 allocs/op                                              
BenchmarkCommitAfterHash/with-onleaf-10          3352406              1623 ns/op    1292 B/op          9 allocs/op                                              
BenchmarkCommitAfterHashFixedSize/10-10           390693              9353 ns/op   12153 B/op        185 allocs/op                                              
BenchmarkCommitAfterHashFixedSize/100-10                   79369             43523 ns/op           74118 B/op        866 allocs/op                              
BenchmarkCommitAfterHashFixedSize/1K-10                     6270            612656 ns/op          982906 B/op       8909 allocs/op                              
BenchmarkCommitAfterHashFixedSize/10K-10                     868           3874492 ns/op         9934724 B/op      87999 allocs/op                              
BenchmarkCommitAfterHashFixedSize/100K-10                     82          41948169 ns/op        118081046 B/op    873769 allocs/op                              
BenchmarkCommit/commit-100nodes-sequential-10              70851            120998 ns/op           76509 B/op        966 allocs/op                              
BenchmarkCommit/commit-100nodes-parallel-10                67854             66119 ns/op           76513 B/op        966 allocs/op                              
BenchmarkCommit/commit-500nodes-sequential-10              15874            407776 ns/op          357976 B/op       4910 allocs/op                              
BenchmarkCommit/commit-500nodes-parallel-10                21897            655989 ns/op          518038 B/op       5360 allocs/op                              
BenchmarkCommit/commit-2000nodes-sequential-10              3140           1195715 ns/op         1436690 B/op      18662 allocs/op                              
BenchmarkCommit/commit-2000nodes-parallel-10                5908           1631038 ns/op         2045499 B/op      19263 allocs/op                              
BenchmarkCommit/commit-5000nodes-sequential-10              1366           2751380 ns/op         3487214 B/op      48184 allocs/op                              
BenchmarkCommit/commit-5000nodes-parallel-10                2347           4430527 ns/op         5024092 B/op      48887 allocs/op                              
PASS
ok      github.com/ethereum/go-ethereum/trie    249.100s

So relative to sequential execution, that means:

500 nodes: Parallel is 61% slower  
2000 nodes: Parallel is 36% slower  
5000 nodes: Parallel is 61% slower

To me that confirmed my suspicion of the mutex contention bottleneck. Below are the same benchmarks run with the optimization made in this PR. To summarize the result:

500 nodes: from 61% slower to 38% faster (99% improvement)
2000 nodes: from 36% slower to 44% faster (80% improvement)
5000 nodes: from 61% slower to 20% slower (41% improvement)

Seed: 4f7de0861c87d35d
goos: darwin
goarch: arm64
pkg: github.com/ethereum/go-ethereum/trie
cpu: Apple M4
BenchmarkCommitAfterHash/no-onleaf-10            3218742              1183 ns/op    1168 B/op          7 allocs/op                                              
BenchmarkCommitAfterHash/with-onleaf-10          5216218              1577 ns/op    1049 B/op          8 allocs/op                                              
BenchmarkCommitAfterHashFixedSize/10-10           373384              9094 ns/op   11992 B/op        171 allocs/op                                              
BenchmarkCommitAfterHashFixedSize/100-10                   85698             42479 ns/op           73301 B/op        800 allocs/op                              
BenchmarkCommitAfterHashFixedSize/1K-10                     5788            675084 ns/op          974424 B/op       8238 allocs/op                              
BenchmarkCommitAfterHashFixedSize/10K-10                     765           4547419 ns/op         9828142 B/op      80449 allocs/op                              
BenchmarkCommitAfterHashFixedSize/100K-10                     67          46218220 ns/op        117099982 B/op    799188 allocs/op                              
BenchmarkCommit/commit-100nodes-sequential-10              78994             92884 ns/op           75703 B/op        902 allocs/op                              
BenchmarkCommit/commit-100nodes-parallel-10                70840            123180 ns/op           75702 B/op        902 allocs/op                              
BenchmarkCommit/commit-500nodes-sequential-10              14530            479314 ns/op          353114 B/op       4530 allocs/op                              
BenchmarkCommit/commit-500nodes-parallel-10                13047            295766 ns/op          513514 B/op       4985 allocs/op                              
BenchmarkCommit/commit-2000nodes-sequential-10              3938           1958731 ns/op         1420105 B/op      17386 allocs/op                              
BenchmarkCommit/commit-2000nodes-parallel-10                3646           1105597 ns/op         2029575 B/op      17992 allocs/op                              
BenchmarkCommit/commit-5000nodes-sequential-10              1380           4550495 ns/op         3439972 B/op      44558 allocs/op                              
BenchmarkCommit/commit-5000nodes-parallel-10                1959           5445753 ns/op         4972804 B/op      45272 allocs/op                              
PASS
ok      github.com/ethereum/go-ethereum/trie    212.449s

… NodeSet merges until after goroutine synchronization

rjl493456442 · 2025-11-13T06:41:52Z

I am not sure if it's an AI-generated PR, but we can simplify the changes in this way?

diff --git a/trie/committer.go b/trie/committer.go
index 2a2142e0ff..555daea612 100644
--- a/trie/committer.go
+++ b/trie/committer.go
@@ -89,7 +89,7 @@ func (c *committer) commit(path []byte, n node, parallel bool) node {
 func (c *committer) commitChildren(path []byte, n *fullNode, parallel bool) {
 	var (
 		wg      sync.WaitGroup
-		nodesMu sync.Mutex
+		nodeset = make([]*trienode.NodeSet, 16)
 	)
 	for i := 0; i < 16; i++ {
 		child := n.Children[i]
@@ -116,15 +116,18 @@ func (c *committer) commitChildren(path []byte, n *fullNode, parallel bool) {
 				childSet := trienode.NewNodeSet(c.nodes.Owner)
 				childCommitter := newCommitter(childSet, c.tracer, c.collectLeaf)
 				n.Children[index] = childCommitter.commit(p, child, false)
-
-				nodesMu.Lock()
-				c.nodes.MergeDisjoint(childSet)
-				nodesMu.Unlock()
+				nodeset[index] = childSet
 			}(i)
 		}
 	}
 	if parallel {
 		wg.Wait()
+		for _, set := range nodeset {
+			if set == nil {
+				continue
+			}
+			c.nodes.MergeDisjoint(set)
+		}
 	}
 }

floor-licker · 2025-11-14T00:56:02Z

@rjl493456442 Yeah, good catch, makes way more sense to just use a fixed array and omit the mutex since each goroutine writes to its own index. Just implemented this, let me know what you think.

perf: eliminate mutex contention in parallel trie commit by deferring…

13a581e

… NodeSet merges until after goroutine synchronization

floor-licker requested review from gballet and rjl493456442 as code owners November 12, 2025 04:54

Merge branch 'ethereum:master' into master

116d9d5

floor-licker added 2 commits November 13, 2025 19:52

simplify by omitting mutex

ae97dfa

Merge branch 'master' of https://github.com/floor-licker/go-ethereum

2a7466d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: eliminate mutex contention in parallel trie commit by deferring NodeSet merges until after goroutine synchronization #33156

perf: eliminate mutex contention in parallel trie commit by deferring NodeSet merges until after goroutine synchronization #33156

floor-licker commented Nov 12, 2025

Uh oh!

rjl493456442 commented Nov 13, 2025

Uh oh!

floor-licker commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: eliminate mutex contention in parallel trie commit by deferring NodeSet merges until after goroutine synchronization #33156

Are you sure you want to change the base?

perf: eliminate mutex contention in parallel trie commit by deferring NodeSet merges until after goroutine synchronization #33156

Conversation

floor-licker commented Nov 12, 2025

Uh oh!

rjl493456442 commented Nov 13, 2025

Uh oh!

floor-licker commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants