Skip to content

[Bug] Multiple self-deadlocks in CycloneDDS caused by recursive locking of non-recursive mutexes #2374

Description

@hanqing2025

Summary

Dear CycloneDDS maintainers,

Thank you for your work on the project. During our analysis, we found 8 self-deadlock cases in CycloneDDS that appear to share the same root cause.

In the observed cases, a thread acquires a non-recursive mutex and later re-enters a path that attempts to acquire the same mutex again before the first acquisition is released, resulting in self-deadlock.

Concrete example

One concrete case is gv->sendq_running_lock.

CycloneDDS acquires this lock in:

  • src/core/ddsc/src/dds_writer.c:459
  • src/core/ddsi/src/ddsi_init.c:1921

In the self-loop we observed, the path is rooted at dds_delete(). During deletion and cleanup, CycloneDDS enters ddsi_fini(), which acquires gv->sendq_running_lock at src/core/ddsi/src/ddsi_init.c:1921. We observed nested deletion/finalization paths in which the same thread re-enters cleanup before the first acquisition is released, and then tries to acquire gv->sendq_running_lock again.

A representative captured stack for this re-entry is:

dds_delete -> dds_delete_impl -> dds_delete_impl_pinned -> really_delete_pinned_closed_locked -> dds_entity_deriver_delete -> dds_domain_free -> ddsi_fini -> joinleave_spdp_defmcip -> ddsi_addrset_forall -> ddsi_leave_mc -> ddsrt_mutex_lock

Because gv->sendq_running_lock is non-recursive, this second acquisition attempt deadlocks the current thread on its own lock.

Affected revision

Latest upstream revision analyzed:

  • be3ecb81f4d6f0664d92fbbc0a2f8edb553bb4bf

Earlier releases also appear to be affected, but we have not yet determined the exact introduction point.

Affected sites

  1. entity->qos_lock

    • src/core/ddsc/src/dds_serdata_builtintopic.c:219
  2. gv->sendq_running_lock

    • src/core/ddsc/src/dds_writer.c:459
    • src/core/ddsi/src/ddsi_init.c:1921
  3. lowr->wr.e.lock

    • src/core/ddsi/src/ddsi_endpoint.c:1320
  4. observed->m_observers_lock

    • src/core/ddsc/src/dds_entity.c:1394
    • src/core/ddsc/src/dds_entity.c:1418
  5. parent->m_mutex

    • src/core/ddsc/src/dds_entity.c:335
    • src/core/ddsc/src/dds_entity.c:346
  6. pp->e.lock

    • src/core/ddsi/src/ddsi_participant.c:754
    • src/core/ddsi/src/ddsi_participant.c:991
  7. rd->m_entity.m_observers_lock

    • src/core/ddsc/src/dds_reader.c:225
  8. thread_states.lock

    • src/core/ddsi/src/ddsi_thread.c:257
    • src/core/ddsi/src/ddsi_thread.c:261

Impact

Any affected path can hang the calling thread and potentially the whole CycloneDDS process, resulting in denial of service.

Suggested direction

The fix likely needs to be the same at all sites:

  • avoid calling re-entrant helpers while holding these mutexes, or
  • restructure the call paths so the same mutex cannot be acquired twice by the same thread, or
  • only use recursive mutexes where re-entrancy is explicitly intended and proven safe

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions