Is there an already existing issue for this?
Expected behavior
When a late-joining VOLATILE DataReader matches an already running DataWriter through Data-sharing, reader initialization should complete quickly by skipping old samples and moving to the writer's current position. Discovery must not block, and no thread should busy-loop while holding the PDP mutex.
Current behavior
Under Data-sharing, a late-joining VOLATILE reader can spin forever inside ReaderPool::init_shared_segment() while trying to fast-forward to the writer position. One CPU core goes to 100%, the reader never finishes enabling, and discovery/transport threads block behind the PDP mutex, effectively hanging the process.
The problematic code is the is_volatile_ branch in src/cpp/rtps/DataSharing/ReaderPool.hpp:
if (is_volatile_)
{
CacheChange_t ch;
SequenceNumber_t last_sequence = c_SequenceNumber_Unknown;
uint64_t current_end = end();
get_next_unread_payload(ch, last_sequence, current_end);
while (ch.sequenceNumber != SequenceNumber_t::unknown() || next_payload_ != current_end)
{
current_end = end(); // re-reads writer live position every iteration
advance(next_payload_);
get_next_unread_payload(ch, last_sequence, current_end);
}
}
Because current_end is refreshed from the writer's live end() on every iteration, the loop can chase a moving target forever when the writer is active and the reader gets preempted at the wrong time.
Steps to reproduce
- Run a publisher and subscriber on the same machine so Data-sharing is selected automatically.
- Start a
DataWriter publishing at high rate (for example 1000 Hz) and let it run for at least 2 seconds before the reader joins.
- Use writer QoS equivalent to:
BEST_EFFORT
TRANSIENT_LOCAL
KEEP_LAST(depth=10)
- Create a late-joining
DataReader with QoS equivalent to:
BEST_EFFORT
VOLATILE
KEEP_LAST(depth=1)
- Data-sharing left as
AUTO / default
- During endpoint matching,
ReaderPool::init_shared_segment() enters the is_volatile_ fast-forward path.
- If the reader thread is preempted while that loop is running, the reader can fall behind the active writer, keep re-reading a newer
end(), and never terminate.
Observed result:
- main thread at 100% CPU
- stuck in
ReaderPool::get_next_unread_payload() / ReaderPool::init_shared_segment()
- PDP mutex remains held
- discovery / UDP / SHM threads block waiting for that mutex
Fast DDS version/commit
Confirmed on Fast DDS v3.0.1 (local build).
I also checked the same logic in the current 3.4.x branch and the relevant ReaderPool.hpp code path appears unchanged, so newer versions may also be affected.
Platform/Architecture
Linux amd64 (same-host publisher/subscriber setup).
Transport layer
- Default configuration, UDPv4 & SHM
- Shared Memory Transport (SHM)
- Data-sharing delivery
Additional context
A representative blocked stack looked like this:
ReaderPool::get_next_unread_payload()
ReaderPool::init_shared_segment<SharedMemSegment>()
DataSharingListener::add_datasharing_writer()
StatelessReader::matched_writer_add_edp()
EDP::pairingReader()
EDP::pairingReader() was holding the PDP mutex while entering the Data-sharing setup path, so once the loop became non-terminating, the rest of discovery stalled as well.
The issue seems easiest to trigger when all of these are true:
- same machine
- Data-sharing active
- late-joining reader
VOLATILE reader durability
- active high-rate writer
- startup or CPU load high enough to increase the chance of reader preemption
A simple root-cause fix seems to be avoiding the moving target and jumping directly to a snapshot of end() for VOLATILE readers instead of re-reading it in the loop.
Is there an already existing issue for this?
Expected behavior
When a late-joining
VOLATILEDataReadermatches an already runningDataWriterthrough Data-sharing, reader initialization should complete quickly by skipping old samples and moving to the writer's current position. Discovery must not block, and no thread should busy-loop while holding the PDP mutex.Current behavior
Under Data-sharing, a late-joining
VOLATILEreader can spin forever insideReaderPool::init_shared_segment()while trying to fast-forward to the writer position. One CPU core goes to 100%, the reader never finishes enabling, and discovery/transport threads block behind the PDP mutex, effectively hanging the process.The problematic code is the
is_volatile_branch insrc/cpp/rtps/DataSharing/ReaderPool.hpp:Because
current_endis refreshed from the writer's liveend()on every iteration, the loop can chase a moving target forever when the writer is active and the reader gets preempted at the wrong time.Steps to reproduce
DataWriterpublishing at high rate (for example 1000 Hz) and let it run for at least 2 seconds before the reader joins.BEST_EFFORTTRANSIENT_LOCALKEEP_LAST(depth=10)DataReaderwith QoS equivalent to:BEST_EFFORTVOLATILEKEEP_LAST(depth=1)AUTO/ defaultReaderPool::init_shared_segment()enters theis_volatile_fast-forward path.end(), and never terminate.Observed result:
ReaderPool::get_next_unread_payload()/ReaderPool::init_shared_segment()Fast DDS version/commit
Confirmed on Fast DDS
v3.0.1(local build).I also checked the same logic in the current
3.4.xbranch and the relevantReaderPool.hppcode path appears unchanged, so newer versions may also be affected.Platform/Architecture
Linux amd64 (same-host publisher/subscriber setup).
Transport layer
Additional context
A representative blocked stack looked like this:
EDP::pairingReader()was holding the PDP mutex while entering the Data-sharing setup path, so once the loop became non-terminating, the rest of discovery stalled as well.The issue seems easiest to trigger when all of these are true:
VOLATILEreader durabilityA simple root-cause fix seems to be avoiding the moving target and jumping directly to a snapshot of
end()forVOLATILEreaders instead of re-reading it in the loop.