Skip to content

Issues/5477 cpu usage docker cwl#5526

Open
avnig05 wants to merge 4 commits into
masterfrom
issues/5477-cpu-usage-docker-cwl
Open

Issues/5477 cpu usage docker cwl#5526
avnig05 wants to merge 4 commits into
masterfrom
issues/5477-cpu-usage-docker-cwl

Conversation

@avnig05

@avnig05 avnig05 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

CWL and WDL have a shared runtime injection module that samples in-container CPU/memory usage and reports it back to Toil via runtime message files.

Changelog Entry

To be copied to the draft changelog by merger:

  • PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@avnig05 avnig05 requested a review from adamnovak May 28, 2026 19:46
@adamnovak

Copy link
Copy Markdown
Member

@avnig05 It looks like this isn't passing type checking; if you make mypy you should be able to reproduce the problems:
https://ucsc-ci.com/databiosphere/toil/-/jobs/110094#L1195

It looks like mostly you need casts or # type: ignore comments at some places where you're using a fake object or looking in one of these Expando objects that are fundamentally dynamically-typed. But it looks like you might not be importing ResourceMonitor from the place where it's actually defined?

@adamnovak adamnovak left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should work for the Docker container case.

I think for the Singularity/CWL case, it might be double-counting the resource usage, because it will be counted via having a child process and counted again via the command injection.

I think it might be possible to simplify the hooking into the CWL code by overriding another method in ToilCommandLineTool that gets to run after the container finishes, which would let us avoid needing ToilContainerCommandLineJob and friends and the machinery to attach them.

I think the PR should be rebased to not be on top of the pre-squash commits from #5512.

Comment thread src/toil/cwl/cwltoil.py
Comment on lines +1213 to +1215
def _file_mounts_from_pathmapper(
job: ContainerCommandLineJob,
) -> list[tuple[str, str]]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a docstring to at least explain which item in the result tuples is which.

Comment thread src/toil/cwl/cwltoil.py
Comment on lines +1105 to +1129
class ToilContainerCommandLineJob(ContainerCommandLineJob):
"""Container job that collects resource stats from injected in-container code."""

def _execute(
self,
runtime: list[str],
env: MutableMapping[str, str],
runtimeContext: cwltool.context.RuntimeContext,
monitor_function: Callable[["subprocess.Popen[str]"], None] | None = None,
) -> None:
super()._execute(runtime, env, runtimeContext, monitor_function)
handle_injection_messages_from_outdir(self.outdir)


class ToilDockerCommandLineJob(ToilContainerCommandLineJob, DockerCommandLineJob):
"""Docker container job with Toil runtime injection support."""


class ToilPodmanCommandLineJob(ToilContainerCommandLineJob, PodmanCommandLineJob):
"""Podman container job with Toil runtime injection support."""


class ToilSingularityCommandLineJob(ToilContainerCommandLineJob, SingularityCommandLineJob):
"""Singularity container job with Toil runtime injection support."""

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach of hooking the ContainerCommandLineJob in addition to the CommandLineTool is a little awkward. We have to have all these extra classes for each container system, and we already don't have one for UDockerCommandLineJob and it's hard to tell whether that's because we don't need one. And the hook here only works in concert with the hook that ToilCommandLineTool applies, so we end up with one feature spread out over several classes.

Instead of hooking ContainerCommandLineJob._execute(), did you consider hooking CommandLineTool.collect_output_ports() instead? It looks like that has access to the outdir, and it gets sent into _execute() and called inside there after the container has run. And if we did it that way we could keep all the hook logic together inside ToilCommandLineTool and not need to worry as much about the container type.

Comment thread src/toil/cwl/cwltoil.py
Comment on lines +1235 to +1242
for job in super().job(job_order, output_callbacks, runtimeContext):
if isinstance(job, ContainerCommandLineJob) and self._uses_container(
runtimeContext
):
file_mounts = self._file_mounts_from_pathmapper(job)
script = command_line_to_shell_script(job.command_line)
script = add_injections(script, file_mounts)
job.command_line = shell_script_to_command_line(script)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to inject the monitoring logic even when we're using Singularity (or I think UDocker?) where we should already see CPU and memory usage in the stats because it happens under a child process of Toil? Will that lead to double-counting of CPU usage?

###
# Runtime code injection system
# When a workflow steps runs inside a container, the Toil worker process on the host
# often cannot see how much CPU and RAM that step actually used. This system allows

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word often is doing some vigorous handwaving here, and trying to hide that this text does not convey an understanding of how or why (or, more importantly for avoiding double-counting of resource usage, when) this will be the case.

present, and a flat argv list otherwise.
"""
if (
len(command_line) >= 3

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might want to be == or we'll be able to throw away arguments.

Comment on lines +499 to +505
"--outdir",
str(tmp_path / "output_dir"),
str(cwl_file),
str(inputs_file),
"--jobStore",
str(job_store),
"--stats",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing in here to make Docker be used, but I looked at the toil-cwl-runner options and it has --singularity, --podman, and --no-container, but not a --docker. So I guess Docker is the default container engine. Maybe we need a comment here to remind us that the test depends on that?

Comment on lines +2297 to +2306
# Include malformed and partial lines to exercise parser robustness.
message_file.write_text(
"CPU\t1000000\n"
"Memory\t1024\n"
"CPU\t4000000\n"
"Memory\t2048\n"
"Odd\tline\n"
"CPU\t5000000",
encoding="utf-8",
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's noticeably absent here is a line with the wrong number of fields, and in particular too few fields.

from toil.lib import interpreter

message_file = tmp_path / "resources.tsv"
# Include malformed and partial lines to exercise parser robustness.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A some of what's being tested here (like the fact that final lines without a terminating newline are ignored) seems to be based on the implementation code, rather than anything that the docstring actually promises will be true about the function under test.

If it's an important enough point about the function's behavior that it's worth testing, it's probably worth documenting in the docstring. Otherwise, the next person to touch the function is going to tinker with the internals in a way that doesn't appear to change anything about what the function promises, and then be hit with test failures because there are other secret things that need to be true about the function.

job = object.__new__(cwltoil.ToilDockerCommandLineJob)
job.command_line = ["echo", "hello"]
job.pathmapper = FakePathMapper(FakeMapperEntry(str(host_input), "/work/input.txt"))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any actual assertions in this test, so I don't think it's genuinely testing what its docstring claims it is testing.

Comment on lines 821 to 875

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are leftover changes from #5512. It looks like your new commit c4c69a4 is on top of the commits from the branch for that PR, and not on top of the squashed commit that was created when the PR was merged.

I would recommend rebasing onto the current mainline Toil, and keeping just the commit about this feature in the branch for this PR:

  1. Be on the branch for this PR.
  2. Do git fetch origin (assuming your origin remote points to this Github project)
  3. Do git rebase -i origin/master
  4. In the editor, delete all the lines for the commits from the old feature, and only keep the line for the commit that's about the new feature.
  5. Save and quit
  6. Git should rewrite your history so now your issues/5477-cpu-usage-docker-cwl branch just has the new commit for this feature.
  7. Do a git push origin issues/5477-cpu-usage-docker-cwl -f to force-push your rewritten history to Github and update the PR.

@adamnovak

Copy link
Copy Markdown
Member

@avnig05 Since you're gone for the summer, I'm going to take on landing this myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants