Issues/5477 cpu usage docker cwl by avnig05 · Pull Request #5526 · DataBiosphere/toil

avnig05 · 2026-05-28T19:46:42Z

CWL and WDL have a shared runtime injection module that samples in-container CPU/memory usage and reports it back to Toil via runtime message files.

Changelog Entry

To be copied to the draft changelog by merger:

PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

adamnovak · 2026-05-28T20:37:26Z

@avnig05 It looks like this isn't passing type checking; if you make mypy you should be able to reproduce the problems:
https://ucsc-ci.com/databiosphere/toil/-/jobs/110094#L1195

It looks like mostly you need casts or # type: ignore comments at some places where you're using a fake object or looking in one of these Expando objects that are fundamentally dynamically-typed. But it looks like you might not be importing ResourceMonitor from the place where it's actually defined?

adamnovak

This looks like it should work for the Docker container case.

I think for the Singularity/CWL case, it might be double-counting the resource usage, because it will be counted via having a child process and counted again via the command injection.

I think it might be possible to simplify the hooking into the CWL code by overriding another method in ToilCommandLineTool that gets to run after the container finishes, which would let us avoid needing ToilContainerCommandLineJob and friends and the machinery to attach them.

I think the PR should be rebased to not be on top of the pre-squash commits from #5512.

adamnovak · 2026-05-28T20:41:08Z

+    def _file_mounts_from_pathmapper(
+        job: ContainerCommandLineJob,
+    ) -> list[tuple[str, str]]:


This needs a docstring to at least explain which item in the result tuples is which.

adamnovak · 2026-05-28T21:04:12Z

+class ToilContainerCommandLineJob(ContainerCommandLineJob):
+    """Container job that collects resource stats from injected in-container code."""
+
+    def _execute(
+        self,
+        runtime: list[str],
+        env: MutableMapping[str, str],
+        runtimeContext: cwltool.context.RuntimeContext,
+        monitor_function: Callable[["subprocess.Popen[str]"], None] | None = None,
+    ) -> None:
+        super()._execute(runtime, env, runtimeContext, monitor_function)
+        handle_injection_messages_from_outdir(self.outdir)
+
+
+class ToilDockerCommandLineJob(ToilContainerCommandLineJob, DockerCommandLineJob):
+    """Docker container job with Toil runtime injection support."""
+
+
+class ToilPodmanCommandLineJob(ToilContainerCommandLineJob, PodmanCommandLineJob):
+    """Podman container job with Toil runtime injection support."""
+
+
+class ToilSingularityCommandLineJob(ToilContainerCommandLineJob, SingularityCommandLineJob):
+    """Singularity container job with Toil runtime injection support."""
+


This approach of hooking the ContainerCommandLineJob in addition to the CommandLineTool is a little awkward. We have to have all these extra classes for each container system, and we already don't have one for UDockerCommandLineJob and it's hard to tell whether that's because we don't need one. And the hook here only works in concert with the hook that ToilCommandLineTool applies, so we end up with one feature spread out over several classes.

Instead of hooking ContainerCommandLineJob._execute(), did you consider hooking CommandLineTool.collect_output_ports() instead? It looks like that has access to the outdir, and it gets sent into _execute() and called inside there after the container has run. And if we did it that way we could keep all the hook logic together inside ToilCommandLineTool and not need to worry as much about the container type.

adamnovak · 2026-05-28T21:06:10Z

+        for job in super().job(job_order, output_callbacks, runtimeContext):
+            if isinstance(job, ContainerCommandLineJob) and self._uses_container(
+                runtimeContext
+            ):
+                file_mounts = self._file_mounts_from_pathmapper(job)
+                script = command_line_to_shell_script(job.command_line)
+                script = add_injections(script, file_mounts)
+                job.command_line = shell_script_to_command_line(script)


Is this going to inject the monitoring logic even when we're using Singularity (or I think UDocker?) where we should already see CPU and memory usage in the stats because it happens under a child process of Toil? Will that lead to double-counting of CPU usage?

adamnovak · 2026-05-28T21:10:00Z

+###
+# Runtime code injection system
+# When a workflow steps runs inside a container, the Toil worker process on the host
+# often cannot see how much CPU and RAM that step actually used. This system allows


The word often is doing some vigorous handwaving here, and trying to hide that this text does not convey an understanding of how or why (or, more importantly for avoiding double-counting of resource usage, when) this will be the case.

adamnovak · 2026-05-28T21:11:47Z

+    present, and a flat argv list otherwise.
+    """
+    if (
+        len(command_line) >= 3


This might want to be == or we'll be able to throw away arguments.

adamnovak · 2026-05-28T21:32:25Z

+                    "--outdir",
+                    str(tmp_path / "output_dir"),
+                    str(cwl_file),
+                    str(inputs_file),
+                    "--jobStore",
+                    str(job_store),
+                    "--stats",


There's nothing in here to make Docker be used, but I looked at the toil-cwl-runner options and it has --singularity, --podman, and --no-container, but not a --docker. So I guess Docker is the default container engine. Maybe we need a comment here to remind us that the test depends on that?

adamnovak · 2026-05-28T21:35:37Z

+    # Include malformed and partial lines to exercise parser robustness.
+    message_file.write_text(
+        "CPU\t1000000\n"
+        "Memory\t1024\n"
+        "CPU\t4000000\n"
+        "Memory\t2048\n"
+        "Odd\tline\n"
+        "CPU\t5000000",
+        encoding="utf-8",
+    )


One thing that's noticeably absent here is a line with the wrong number of fields, and in particular too few fields.

adamnovak · 2026-05-28T21:42:24Z

+    from toil.lib import interpreter
+
+    message_file = tmp_path / "resources.tsv"
+    # Include malformed and partial lines to exercise parser robustness.


A some of what's being tested here (like the fact that final lines without a terminating newline are ignored) seems to be based on the implementation code, rather than anything that the docstring actually promises will be true about the function under test.

If it's an important enough point about the function's behavior that it's worth testing, it's probably worth documenting in the docstring. Otherwise, the next person to touch the function is going to tinker with the internals in a way that doesn't appear to change anything about what the function promises, and then be hit with test failures because there are other secret things that need to be true about the function.

adamnovak · 2026-05-28T21:43:16Z

+    job = object.__new__(cwltoil.ToilDockerCommandLineJob)
+    job.command_line = ["echo", "hello"]
+    job.pathmapper = FakePathMapper(FakeMapperEntry(str(host_input), "/work/input.txt"))
+


There aren't any actual assertions in this test, so I don't think it's genuinely testing what its docstring claims it is testing.

adamnovak · 2026-05-28T21:50:56Z

I think these are leftover changes from #5512. It looks like your new commit c4c69a4 is on top of the commits from the branch for that PR, and not on top of the squashed commit that was created when the PR was merged.

I would recommend rebasing onto the current mainline Toil, and keeping just the commit about this feature in the branch for this PR:

Be on the branch for this PR.

Do git fetch origin (assuming your origin remote points to this Github project)

Do git rebase -i origin/master

In the editor, delete all the lines for the commits from the old feature, and only keep the line for the commit that's about the new feature.

Save and quit

Git should rewrite your history so now your issues/5477-cpu-usage-docker-cwl branch just has the new commit for this feature.

Do a git push origin issues/5477-cpu-usage-docker-cwl -f to force-push your rewritten history to Github and update the PR.

adamnovak · 2026-06-11T20:52:02Z

@avnig05 Since you're gone for the summer, I'm going to take on landing this myself.

Avni Gandhi added 4 commits May 5, 2026 12:10

Harden CPU overuse warning checks and add unit tests

7cdac00

address CPU usage warning test review feedback

f37f102

address CPU usage warning test review feedback

e5fda4c

add injected container cpu/memory accounting for cwl and wdl

c4c69a4

avnig05 requested a review from adamnovak May 28, 2026 19:46

adamnovak requested changes May 28, 2026

View reviewed changes

Conversation

avnig05 commented May 28, 2026

Changelog Entry

Reviewer Checklist

Merger Checklist

Uh oh!

adamnovak commented May 28, 2026

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamnovak commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants