Avoid std::exit() from a worker thread in fastq_mergepairs#635
Merged
Conversation
fastq_mergepairs runs a pool of worker threads. On an out-of-range FASTQ quality value, get_qual() (reached from process() via the worker pool) and the "More forward reads than reverse reads" check in read_pair() called std::exit() directly from a worker. exit() flushes and closes the shared output streams and runs static destructors while sibling workers are still writing to those streams, a data race that intermittently corrupts libc state and crashes (observed as an "Illegal instruction" core dump on FreeBSD CI; benign on glibc/Linux, hence intermittent and platform- dependent). It can also drop the fatal message before it reaches the --log file. Make the error path cooperative instead: a worker records the first error and sets an atomic abort flag; every worker then unwinds its loop, and pair_all() reports the error and exits from the main thread after all workers have joined. Teardown is then single-threaded and the message reliably reaches stderr and the log file. Verified with a multi-threaded stress loop of the offending inputs and a ThreadSanitizer build (no races on the abort or normal paths).
The comments describing why the chunk mutex/condition variable are locals still referenced the old hazard of a worker calling std::exit() while siblings wait on the condition variable. That can no longer happen: with the cooperative abort, exit() occurs only on the main thread after the worker pool has joined. Reword the comments to reflect the current design.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
fastq_mergepairsprocesses read pairs with a pool of worker threads. Twoerror paths reachable from a worker called
std::exit()directly:get_qual()(reached fromprocess()in the worker pool) on a FASTQquality value outside
[fastq_qmin, fastq_qmax], andread_pair().Calling
std::exit()from a worker is unsafe while sibling workers are stillrunning:
exit()flushes and closes the shared output streams and runs staticdestructors concurrently with threads that are still writing to those streams.
That is a data race on stdio/global teardown. It can corrupt libc state and
crash, and it can also drop the fatal message before it reaches the
--logfile.
The crash is timing-dependent, so it is intermittent and platform-dependent: it
was observed as an intermittent
Illegal instruction (core dumped)on FreeBSD(clang libc), while the same race happens to be benign on glibc/Linux.
Fix
Make the worker error path cooperative instead of exiting in place:
value) and sets an
std::atomic<bool>abort flag.process(), the chunk-processing loop, andpair_worker()observe the flagand unwind;
finished_allis set under the lock with anotify_all()so anyparked worker wakes and exits.
pair_all()reports the recorded error andcalls
exit()on the main thread. Teardown is then single-threaded, and themessage reliably reaches both
stderrand the--logfile.Output messages and exit status are unchanged.
Testing
exit non-zero; the
--logmessage is now reliably present; the forward/reverseread-count mismatch is still reported; normal merging is unchanged.
value injected mid-stream, and a concurrent read-count-mismatch case — all
clean exits, no crash or hang.