Summary
Dear CycloneDDS maintainers,
Thank you for your work on the project. During our analysis, we found 8 self-deadlock cases in CycloneDDS that appear to share the same root cause.
In the observed cases, a thread acquires a non-recursive mutex and later re-enters a path that attempts to acquire the same mutex again before the first acquisition is released, resulting in self-deadlock.
Concrete example
One concrete case is gv->sendq_running_lock.
CycloneDDS acquires this lock in:
src/core/ddsc/src/dds_writer.c:459
src/core/ddsi/src/ddsi_init.c:1921
In the self-loop we observed, the path is rooted at dds_delete(). During deletion and cleanup, CycloneDDS enters ddsi_fini(), which acquires gv->sendq_running_lock at src/core/ddsi/src/ddsi_init.c:1921. We observed nested deletion/finalization paths in which the same thread re-enters cleanup before the first acquisition is released, and then tries to acquire gv->sendq_running_lock again.
A representative captured stack for this re-entry is:
dds_delete -> dds_delete_impl -> dds_delete_impl_pinned -> really_delete_pinned_closed_locked -> dds_entity_deriver_delete -> dds_domain_free -> ddsi_fini -> joinleave_spdp_defmcip -> ddsi_addrset_forall -> ddsi_leave_mc -> ddsrt_mutex_lock
Because gv->sendq_running_lock is non-recursive, this second acquisition attempt deadlocks the current thread on its own lock.
Affected revision
Latest upstream revision analyzed:
be3ecb81f4d6f0664d92fbbc0a2f8edb553bb4bf
Earlier releases also appear to be affected, but we have not yet determined the exact introduction point.
Affected sites
-
entity->qos_lock
src/core/ddsc/src/dds_serdata_builtintopic.c:219
-
gv->sendq_running_lock
src/core/ddsc/src/dds_writer.c:459
src/core/ddsi/src/ddsi_init.c:1921
-
lowr->wr.e.lock
src/core/ddsi/src/ddsi_endpoint.c:1320
-
observed->m_observers_lock
src/core/ddsc/src/dds_entity.c:1394
src/core/ddsc/src/dds_entity.c:1418
-
parent->m_mutex
src/core/ddsc/src/dds_entity.c:335
src/core/ddsc/src/dds_entity.c:346
-
pp->e.lock
src/core/ddsi/src/ddsi_participant.c:754
src/core/ddsi/src/ddsi_participant.c:991
-
rd->m_entity.m_observers_lock
src/core/ddsc/src/dds_reader.c:225
-
thread_states.lock
src/core/ddsi/src/ddsi_thread.c:257
src/core/ddsi/src/ddsi_thread.c:261
Impact
Any affected path can hang the calling thread and potentially the whole CycloneDDS process, resulting in denial of service.
Suggested direction
The fix likely needs to be the same at all sites:
- avoid calling re-entrant helpers while holding these mutexes, or
- restructure the call paths so the same mutex cannot be acquired twice by the same thread, or
- only use recursive mutexes where re-entrancy is explicitly intended and proven safe
Reference
Summary
Dear CycloneDDS maintainers,
Thank you for your work on the project. During our analysis, we found 8 self-deadlock cases in CycloneDDS that appear to share the same root cause.
In the observed cases, a thread acquires a non-recursive mutex and later re-enters a path that attempts to acquire the same mutex again before the first acquisition is released, resulting in self-deadlock.
Concrete example
One concrete case is
gv->sendq_running_lock.CycloneDDS acquires this lock in:
src/core/ddsc/src/dds_writer.c:459src/core/ddsi/src/ddsi_init.c:1921In the self-loop we observed, the path is rooted at
dds_delete(). During deletion and cleanup, CycloneDDS entersddsi_fini(), which acquiresgv->sendq_running_lockatsrc/core/ddsi/src/ddsi_init.c:1921. We observed nested deletion/finalization paths in which the same thread re-enters cleanup before the first acquisition is released, and then tries to acquiregv->sendq_running_lockagain.A representative captured stack for this re-entry is:
dds_delete -> dds_delete_impl -> dds_delete_impl_pinned -> really_delete_pinned_closed_locked -> dds_entity_deriver_delete -> dds_domain_free -> ddsi_fini -> joinleave_spdp_defmcip -> ddsi_addrset_forall -> ddsi_leave_mc -> ddsrt_mutex_lockBecause
gv->sendq_running_lockis non-recursive, this second acquisition attempt deadlocks the current thread on its own lock.Affected revision
Latest upstream revision analyzed:
be3ecb81f4d6f0664d92fbbc0a2f8edb553bb4bfEarlier releases also appear to be affected, but we have not yet determined the exact introduction point.
Affected sites
entity->qos_locksrc/core/ddsc/src/dds_serdata_builtintopic.c:219gv->sendq_running_locksrc/core/ddsc/src/dds_writer.c:459src/core/ddsi/src/ddsi_init.c:1921lowr->wr.e.locksrc/core/ddsi/src/ddsi_endpoint.c:1320observed->m_observers_locksrc/core/ddsc/src/dds_entity.c:1394src/core/ddsc/src/dds_entity.c:1418parent->m_mutexsrc/core/ddsc/src/dds_entity.c:335src/core/ddsc/src/dds_entity.c:346pp->e.locksrc/core/ddsi/src/ddsi_participant.c:754src/core/ddsi/src/ddsi_participant.c:991rd->m_entity.m_observers_locksrc/core/ddsc/src/dds_reader.c:225thread_states.locksrc/core/ddsi/src/ddsi_thread.c:257src/core/ddsi/src/ddsi_thread.c:261Impact
Any affected path can hang the calling thread and potentially the whole CycloneDDS process, resulting in denial of service.
Suggested direction
The fix likely needs to be the same at all sites:
Reference