Change client IO pipeline to be inline with Java client design: IO thread pool, response threads, lock-free writes [HZ-5387] by ihsandemir · Pull Request #1412 · hazelcast/hazelcast-cpp-client

ihsandemir · 2026-03-12T21:01:49Z

Summary

We like to improve the throughput and performance of the client following the IO thread PR #1410 . We hope to gain some improvements in throughput and latency by making the IO pipeline design similar to Java current design as possible.

Dedicated response thread pool — offloads invocation completion from IO threads to dedicated ClientResponseHandler threads (RESPONSE_THREAD_COUNT property, default 2), preventing IO thread stalls
Lock-free write queue with batch flushing — replaces per-write dispatch with boost::lockfree::queue and scatter-gather async_write, coalescing multiple outbound messages into single syscalls
Global concurrent invocation registry — uses boost::concurrent_flat_map for O(1) correlation ID lookup, replacing per-connection maps
CallIdSequence for correlation IDs — mirrors Java client pattern with optional backpressure support
Fix race condition in backup ack handling — adds double-check in ClientInvocation::notify() after storing pending_response_ to prevent invocations from getting stuck when backup acks arrive concurrently with primary responses (mirrors Java BaseInvocation.notifyResponse pattern)

Test plan

Run ClientMapTest.testGet with 1000 entries — previously hung due to backup ack race condition
Run unit tests against 2-node cluster to validate backup-aware operations
Verify no regression in existing tests (all HazelcastTests1-8)
Benchmark sequential put/get throughput to confirm pipeline improvements

JackPGreen

I've done a very limited review to the best of my abilities and added some comments.

Add two new client property constants to control threading: - IO_THREAD_COUNT (default 3): number of IO threads for networking - RESPONSE_THREAD_COUNT (default 2): number of response processing threads Includes static string definitions, constructor initializers, and accessor methods following the existing client_properties pattern.

When no explicit executor_pool_size is set in client_config, the user_executor_ thread pool now reads its size from the RESPONSE_THREAD_COUNT client property (default: 2), matching the Java client behavior. Previously it created an unbounded pool.

…operties

Instead of storing a single io_context& and resolver& as member fields, SocketFactory now receives them as parameters to create(). This enables callers to pass different io_context instances per connection, which is needed for multi-threaded IO support.

…ignment Replace single io_context/io_thread with vectors sized by IO_THREAD_COUNT property. New connections are assigned to IO contexts via atomic round-robin index, distributing network load across multiple threads.

Required for boost::concurrent_flat_map (global invocation registry) and boost::lockfree::queue (write queue MPSC).

Property: hazelcast.client.response.thread.count Default: 2 threads Follows IO_THREAD_COUNT pattern for configuration.

…rviceImpl Uses boost::concurrent_flat_map for thread-safe invocation tracking. Registers invocations in global map before writing to connection, mirroring Java client's registerInvocation pattern.

Introduce ClientResponseHandler with configurable thread count (hazelcast.client.response.thread.count, default 2) that processes responses on dedicated threads instead of IO threads. Uses per-thread MPSC queues with condition_variable for efficient distribution by correlation_id. Mirrors Java client's ClientResponseHandlerSupplier pattern.

Route regular responses through ClientResponseHandler.enqueue() instead of processing them directly on the IO thread. Backup events and listener events remain on the IO thread as they are lightweight. This offloads response deserialization and invocation completion to dedicated response threads.

Match the socket buffer size (DEFAULT_BUFFER_SIZE_BYTE = 128KB) for the ReadHandler to reduce the number of read syscalls and improve throughput for larger responses.

Replace per-message strand posting and individual async_write calls with a lock-free MPSC queue (boost::lockfree::queue). User threads push write entries without strand involvement. A single flush on the IO strand drains all queued messages into scatter-gather buffers for one batched async_write syscall. Also moves correlation ID generation to send() using an atomic counter on Connection, ensuring the global invocation map is populated with the correct ID before writing. Key changes: - BaseSocket: lock-free write_queue_, flush_write_queue() batch drain - Connection: allocate_call_id() with atomic counter - ClientInvocationServiceImpl::send(): generate + register correlation ID before write

Co-authored-by: Jack Green <JackPGreen@Gmail.com>

Add connection liveness check in ClientInvocationServiceImpl::send() to prevent invocations from being registered on already-closed connections. Without this check, notify_connection_closed() runs before the invocation is registered, and the subsequent async_write failure cannot re-trigger close (idempotent), leaving the invocation promise unfulfilled forever. A double-check after set_send_connection closes the remaining race window where the connection closes between the initial alive check and invocation registration.

BaseSocket contains boost::lockfree::queue which requires 64-byte alignment. In C++14 builds without -faligned-new, the default operator new cannot satisfy over-aligned allocations. Add class-level operator new/delete using posix_memalign/_aligned_malloc to handle the alignment requirement portably.

This reverts commit 38d9b90.

Enable C++17 over-aligned new support in C++14 mode instead of disabling it. This fixes the compile error on GCC where boost::lockfree::queue requires 64-byte alignment and -Werror promotes the aligned-new warning to an error.

**Bug 1 – synchronous future continuation inside catch handlers** `invoke()` and `invoke_urgent()` had a fast-path when the lifecycle was no longer running that attached a future continuation without an executor. On Boost.Thread, a no-executor continuation runs synchronously on the thread that resolves the promise. During shutdown `notify_exception_with_owned_permission` resolves the promise while inside two nested catch handlers; the continuation called `f.get()`, which re-threw, while those handlers were still active. On Windows (SEH) re-throwing inside active catch handlers corrupts the exception chain, producing an ACCESS_VIOLATION. On POSIX (Itanium ABI) the same pattern is safe. Fix: remove the no-executor shortcut entirely. Both paths now always schedule the continuation on the user executor, so `f.get()` runs on a separate thread, outside any catch handler. `id_seq->complete()` is always called; this is safe for both backpressure-enabled and no-backpressure CallIdSequence implementations. **Bug 2 – listener invocations erased after initial registration response** `ClientResponseHandler::process_response` called `invocation->notify(msg)` with the default `erase=true` for every successful response, including the initial registration confirmation for listener invocations. After that erase, subsequent server-pushed events (IS_EVENT_FLAG) arrived with the same correlation ID, `get_invocation()` returned null, and `process_event_message` dereferenced the null pointer (crash at `invocation->get_event_handler()`). This is a regression introduced when response processing was moved to `ClientResponseHandler`: the previous inline path preserved the distinction between ordinary invocations (erase after response) and listener invocations (keep in map for event routing). Fix: in `process_response`, check whether the invocation carries an event handler; if it does, pass `erase=false` to `notify()` so the invocation remains in the map. Listener invocations are still removed when explicitly deregistered (`deregister_invocation`) or when their connection closes (`notify_exception` with `erase=true`). Also add a defensive null check at the top of `process_event_message` to silently drop events whose invocation has already been removed (the normal race between deregistration and an in-flight event).

…tion so that we use user executor only when available, otherwise complete using boost::launch::sync in the same calling thread.

…-report (hazelcast#1426)

…rt (hazelcast#1424)

…tions/coverage-report (hazelcast#1425)

…zelcast#1427) See (and is blocked by) hazelcast/hazelcast-tpm#71

## Summary - The `enforce-code-formatting` CI step installs the default `clang-format` package on Ubuntu 24.04, which is version 18 - I locally use clang-format 22, which produces different formatting for the same `.clang-format` config - This causes false failures on PRs (e.g., hazelcast#1412) where code is correctly formatted per v22 but differs under v18 - Fix: install `clang-format-22` from the official LLVM apt repository to match the developer version

Align listener event dispatch with the Java client. Previously, listener registration invocations lived in the global invocations_ map forever because erase_invocation() skipped deregistration whenever an event handler was present. Every incoming event looked up the full ClientInvocation from that global map just to reach its event handler. Now, event handlers are stored in a per-Connection boost::concurrent_flat_map<int64_t, EventHandler> keyed by correlation id. Listener invocations are removed from the global map as soon as the registration response arrives, and connection close automatically scopes event handler cleanup. This mirrors TcpClientConnection's eventHandlerMap in the Java client. Changes: - Connection: add event_handler_map_ with add/get/remove_event_handler methods. Clear the map in close() before inner_close(), matching the Java ordering. - ClientInvocationServiceImpl::send(): after register_invocation(), install the invocation's event handler on the target connection. - ClientInvocation::erase_invocation(): drop the 'if (!this->event_handler_)' guard so every invocation is deregistered on completion. - Connection::handle_client_message(): for IS_EVENT_FLAG messages, look up the handler from this connection's event_handler_map_ and forward to listener_service_impl. - listener_service_impl: replace the two old handle_client_message overloads (which looked up a ClientInvocation by correlation id) with a single overload taking the EventHandler directly. Matching process_event_message overload follows. remove_event_handler() now erases from connection->remove_event_handler() instead of the global invocations map. - ClientInvocation: drop the 'bool erase' parameter from the entire notify / complete / notify_exception / notify_exception_with_owned_permission / complete_with_pending_response / notify_response chain. With listener invocations no longer pinned in the global map, there is no caller that needs to keep an entry alive past notification. complete() and notify_exception_with_owned_permission() now unconditionally call erase_invocation(), matching ClientInvocation.complete() in Java. - ClientResponseHandler::process_response: call invocation->notify() without the erase flag. Thread safety: boost::concurrent_flat_map supports concurrent insert_or_assign / erase / cvisit / clear without external synchronization, the same pattern already used by the invocations_ map. IO threads read via cvisit, send() writes via insert_or_assign, close() clears. Edge cases: - Connection closes before registration response: clear() drops the handler; re-registration via connection_added_internal re-invokes on the new connection and re-installs the handler. - Backup event handler: dispatched via BACKUP_EVENT_FLAG through response_handler_.accept(), not through event_handler_map_, so it is unaffected. - connection_removed_internal only touches listener-level tracking (registrations_); connection close handles event handler cleanup.

hazelcast#1412 inadvertently reverted hazelcast#1431, which caused the [PR builder to stop working again](https://github.com/hazelcast/hazelcast-cpp-client/actions/runs/24265361516/job/70859003057#step:2:32): > gh: User does not exist or is not a public member of the organization (HTTP 404) This was extra confusing because hazelcast#1431 was itself (partially) reverting hazelcast#1427. Because `pull_request_target` this will require a manual force-merge.

#1412 inadvertently reverted #1431, which caused the [PR builder to stop working again](https://github.com/hazelcast/hazelcast-cpp-client/actions/runs/24265361516/job/70859003057#step:2:32): > gh: User does not exist or is not a public member of the organization (HTTP 404) This was extra confusing because #1431 was itself (partially) reverting #1427. Because `pull_request_target` this will require a manual force-merge.

ihsandemir self-assigned this Mar 12, 2026

ihsandemir added the Type: Enhancement label Mar 12, 2026

ihsandemir added this to the 5.7.0 milestone Mar 12, 2026

ihsandemir requested a review from JackPGreen March 12, 2026 21:03

ihsandemir changed the title ~~Rearchitect client pipeline: IO thread pool, response threads, lock-free writes~~ Rearchitect client pipeline: IO thread pool, response threads, lock-free writes [HZ-5387] Mar 12, 2026

JackPGreen requested changes Mar 13, 2026

View reviewed changes

ihsandemir force-pushed the rearchitecture branch from 330e3c9 to 1e980b2 Compare March 13, 2026 13:45

ihsandemir requested a review from JackPGreen March 17, 2026 07:01

JackPGreen requested changes Mar 17, 2026

View reviewed changes

ihsandemir force-pushed the rearchitecture branch from db244dc to 6b5bf9c Compare March 23, 2026 08:41

ihsandemir requested a review from JackPGreen March 26, 2026 13:31

ihsandemir force-pushed the rearchitecture branch from f08646c to a6db272 Compare March 26, 2026 14:20

ihsandemir enabled auto-merge (squash) March 26, 2026 14:21

JackPGreen previously approved these changes Mar 26, 2026

View reviewed changes

ihsandemir dismissed JackPGreen’s stale review via 54535d9 March 26, 2026 20:51

ihsandemir changed the title ~~Rearchitect client pipeline: IO thread pool, response threads, lock-free writes [HZ-5387]~~ Change client IO pipeline to be inline with Java client design: IO thread pool, response threads, lock-free writes [HZ-5387] Mar 27, 2026

ihsandemir added 14 commits March 30, 2026 09:57

test: add unit tests for IO_THREAD_COUNT and RESPONSE_THREAD_COUNT pr…

ac54a5e

…operties

no need for response thread count setting.

1607c3e

clang formatted.

52e46cb

deps: add boost-unordered and boost-lockfree to vcpkg

0f7c3d8

Required for boost::concurrent_flat_map (global invocation registry) and boost::lockfree::queue (write queue MPSC).

feat: add RESPONSE_THREAD_COUNT client property (default 2)

3bb995f

Property: hazelcast.client.response.thread.count Default: 2 threads Follows IO_THREAD_COUNT pattern for configuration.

feat: add global concurrent invocation registry to ClientInvocationSe…

5b51a4f

…rviceImpl Uses boost::concurrent_flat_map for thread-safe invocation tracking. Registers invocations in global map before writing to connection, mirroring Java client's registerInvocation pattern.

Increase read buffer size from 16KB to 128KB

b1e6953

Match the socket buffer size (DEFAULT_BUFFER_SIZE_BYTE = 128KB) for the ReadHandler to reduce the number of read syscalls and improve throughput for larger responses.

ihsandemir and others added 19 commits April 7, 2026 15:02

review comment fix. define magic number.

05786b4

review comment fix.

a500277

Update hazelcast/src/hazelcast/client/spi.cpp

5eefa6f

Co-authored-by: Jack Green <JackPGreen@Gmail.com>

review comment fix.

040851b

Revert "Fix aligned-new compile error on GCC with -fno-aligned-new"

a0714d9

This reverts commit 38d9b90.

Replace -fno-aligned-new with -faligned-new in build script

8f1d999

Enable C++17 over-aligned new support in C++14 mode instead of disabling it. This fixes the compile error on GCC where boost::lockfree::queue requires 64-byte alignment and -Werror promotes the aligned-new warning to an error.

Minor refactoring, moving response_handler impl into the spi.cpp.

8ca6620

Removed unneeded check on response handling.

c132629

Fixed the ClientInvocation future continuation for sequence id comple…

eee98d4

…tion so that we use user executor only when available, otherwise complete using boost::launch::sync in the same calling thread.

clang format fix.

6b35bab

Bump actions/upload-artifact from 4 to 7 in /.github/actions/coverage…

fb65a90

…-report (hazelcast#1426)

Bump actions/setup-java from 4 to 5 in /.github/actions/coverage-repo…

8117904

…rt (hazelcast#1424)

Bump aws-actions/configure-aws-credentials from 5 to 6 in /.github/ac…

451f57f

…tions/coverage-report (hazelcast#1425)

Update membership check implementation to avoid PAT requirement (ha…

ec46049

…zelcast#1427) See (and is blocked by) hazelcast/hazelcast-tpm#71

Bump actions/checkout from 5 to 6 (hazelcast#1423)

2ed552f

ihsandemir force-pushed the rearchitecture branch from a813986 to a9c71b3 Compare April 7, 2026 12:03

emreyigit requested changes Apr 8, 2026

View reviewed changes

Comment thread hazelcast/include/hazelcast/client/connection/Connection.h

Comment thread hazelcast/include/hazelcast/client/spi/impl/ClientInvocation.h

ihsandemir requested a review from emreyigit April 10, 2026 07:16

emreyigit approved these changes Apr 10, 2026

View reviewed changes

Merge branch 'master' into rearchitecture

fcbe47c

ihsandemir merged commit 59a6c87 into hazelcast:master Apr 10, 2026
57 of 61 checks passed

ihsandemir deleted the rearchitecture branch April 10, 2026 11:46

JackPGreen mentioned this pull request Apr 10, 2026

Fix PR builder post merge conflict in #1412 #1436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change client IO pipeline to be inline with Java client design: IO thread pool, response threads, lock-free writes [HZ-5387]#1412

Change client IO pipeline to be inline with Java client design: IO thread pool, response threads, lock-free writes [HZ-5387]#1412
ihsandemir merged 59 commits intohazelcast:masterfrom
ihsandemir:rearchitecture

ihsandemir commented Mar 12, 2026 •

edited

Loading

Uh oh!

JackPGreen left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ihsandemir commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

JackPGreen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ihsandemir commented Mar 12, 2026 •

edited

Loading