Skip to content

Data-sharing VOLATILE DataReader can loop forever in ReaderPool::init_shared_segment() #6338

@jameszhangpr

Description

@jameszhangpr

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

When a late-joining VOLATILE DataReader matches an already running DataWriter through Data-sharing, reader initialization should complete quickly by skipping old samples and moving to the writer's current position. Discovery must not block, and no thread should busy-loop while holding the PDP mutex.

Current behavior

Under Data-sharing, a late-joining VOLATILE reader can spin forever inside ReaderPool::init_shared_segment() while trying to fast-forward to the writer position. One CPU core goes to 100%, the reader never finishes enabling, and discovery/transport threads block behind the PDP mutex, effectively hanging the process.

The problematic code is the is_volatile_ branch in src/cpp/rtps/DataSharing/ReaderPool.hpp:

if (is_volatile_)
{
    CacheChange_t ch;
    SequenceNumber_t last_sequence = c_SequenceNumber_Unknown;
    uint64_t current_end = end();
    get_next_unread_payload(ch, last_sequence, current_end);
    while (ch.sequenceNumber != SequenceNumber_t::unknown() || next_payload_ != current_end)
    {
        current_end = end();   // re-reads writer live position every iteration
        advance(next_payload_);
        get_next_unread_payload(ch, last_sequence, current_end);
    }
}

Because current_end is refreshed from the writer's live end() on every iteration, the loop can chase a moving target forever when the writer is active and the reader gets preempted at the wrong time.

Steps to reproduce

  1. Run a publisher and subscriber on the same machine so Data-sharing is selected automatically.
  2. Start a DataWriter publishing at high rate (for example 1000 Hz) and let it run for at least 2 seconds before the reader joins.
  3. Use writer QoS equivalent to:
    • BEST_EFFORT
    • TRANSIENT_LOCAL
    • KEEP_LAST(depth=10)
  4. Create a late-joining DataReader with QoS equivalent to:
    • BEST_EFFORT
    • VOLATILE
    • KEEP_LAST(depth=1)
    • Data-sharing left as AUTO / default
  5. During endpoint matching, ReaderPool::init_shared_segment() enters the is_volatile_ fast-forward path.
  6. If the reader thread is preempted while that loop is running, the reader can fall behind the active writer, keep re-reading a newer end(), and never terminate.

Observed result:

  • main thread at 100% CPU
  • stuck in ReaderPool::get_next_unread_payload() / ReaderPool::init_shared_segment()
  • PDP mutex remains held
  • discovery / UDP / SHM threads block waiting for that mutex

Fast DDS version/commit

Confirmed on Fast DDS v3.0.1 (local build).

I also checked the same logic in the current 3.4.x branch and the relevant ReaderPool.hpp code path appears unchanged, so newer versions may also be affected.

Platform/Architecture

Linux amd64 (same-host publisher/subscriber setup).

Transport layer

  • Default configuration, UDPv4 & SHM
  • Shared Memory Transport (SHM)
  • Data-sharing delivery

Additional context

A representative blocked stack looked like this:

ReaderPool::get_next_unread_payload()
ReaderPool::init_shared_segment<SharedMemSegment>()
DataSharingListener::add_datasharing_writer()
StatelessReader::matched_writer_add_edp()
EDP::pairingReader()

EDP::pairingReader() was holding the PDP mutex while entering the Data-sharing setup path, so once the loop became non-terminating, the rest of discovery stalled as well.

The issue seems easiest to trigger when all of these are true:

  • same machine
  • Data-sharing active
  • late-joining reader
  • VOLATILE reader durability
  • active high-rate writer
  • startup or CPU load high enough to increase the chance of reader preemption

A simple root-cause fix seems to be avoiding the moving target and jumping directly to a snapshot of end() for VOLATILE readers instead of re-reading it in the loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions