perf: eliminate mutex contention in parallel trie commit by deferring NodeSet merges until after goroutine synchronization #33156
+53
−24
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The parallel trie commit implementation is suffering from significant mutex contention that makes it significantly slower than the sequential path for medium and large tries, basically 16 goroutines are contending to acquire the same mutex sequentially to merge their result via
MergeDisjoint(). This directly relates to the performance issue documented incore/state/statedb.go:1236-1241by @karalabe , where account trie commits were observed to take 5-6ms at chain heads despite only shuffling data (no hashing).In
commitChildren(), each of the 16 child goroutines had to acquire the same mutex sequentially to merge their results viaMergeDisjoint(). It seems like this is effectively serializing all parallel work at the merge point.I think that the result of the benchmarks I ran are noteworthy (
go test -bench=BenchmarkCommit -benchmem -run=^$ ./trie -benchtime=3s):The data was super bizarre, the “parallel” version is actually slower for medium and large tries.
So relative to sequential execution, that means:
500 nodes: Parallel is 61% slower
2000 nodes: Parallel is 36% slower
5000 nodes: Parallel is 61% slower
To me that confirmed my suspicion of the mutex contention bottleneck. Below are the same benchmarks run with the optimization made in this PR. To summarize the result:
500 nodes: from 61% slower to 38% faster (99% improvement)
2000 nodes: from 36% slower to 44% faster (80% improvement)
5000 nodes: from 61% slower to 20% slower (41% improvement)