Issues/5477 cpu usage docker cwl#5526
Conversation
|
@avnig05 It looks like this isn't passing type checking; if you It looks like mostly you need casts or |
adamnovak
left a comment
There was a problem hiding this comment.
This looks like it should work for the Docker container case.
I think for the Singularity/CWL case, it might be double-counting the resource usage, because it will be counted via having a child process and counted again via the command injection.
I think it might be possible to simplify the hooking into the CWL code by overriding another method in ToilCommandLineTool that gets to run after the container finishes, which would let us avoid needing ToilContainerCommandLineJob and friends and the machinery to attach them.
I think the PR should be rebased to not be on top of the pre-squash commits from #5512.
| def _file_mounts_from_pathmapper( | ||
| job: ContainerCommandLineJob, | ||
| ) -> list[tuple[str, str]]: |
There was a problem hiding this comment.
This needs a docstring to at least explain which item in the result tuples is which.
| class ToilContainerCommandLineJob(ContainerCommandLineJob): | ||
| """Container job that collects resource stats from injected in-container code.""" | ||
|
|
||
| def _execute( | ||
| self, | ||
| runtime: list[str], | ||
| env: MutableMapping[str, str], | ||
| runtimeContext: cwltool.context.RuntimeContext, | ||
| monitor_function: Callable[["subprocess.Popen[str]"], None] | None = None, | ||
| ) -> None: | ||
| super()._execute(runtime, env, runtimeContext, monitor_function) | ||
| handle_injection_messages_from_outdir(self.outdir) | ||
|
|
||
|
|
||
| class ToilDockerCommandLineJob(ToilContainerCommandLineJob, DockerCommandLineJob): | ||
| """Docker container job with Toil runtime injection support.""" | ||
|
|
||
|
|
||
| class ToilPodmanCommandLineJob(ToilContainerCommandLineJob, PodmanCommandLineJob): | ||
| """Podman container job with Toil runtime injection support.""" | ||
|
|
||
|
|
||
| class ToilSingularityCommandLineJob(ToilContainerCommandLineJob, SingularityCommandLineJob): | ||
| """Singularity container job with Toil runtime injection support.""" | ||
|
|
There was a problem hiding this comment.
This approach of hooking the ContainerCommandLineJob in addition to the CommandLineTool is a little awkward. We have to have all these extra classes for each container system, and we already don't have one for UDockerCommandLineJob and it's hard to tell whether that's because we don't need one. And the hook here only works in concert with the hook that ToilCommandLineTool applies, so we end up with one feature spread out over several classes.
Instead of hooking ContainerCommandLineJob._execute(), did you consider hooking CommandLineTool.collect_output_ports() instead? It looks like that has access to the outdir, and it gets sent into _execute() and called inside there after the container has run. And if we did it that way we could keep all the hook logic together inside ToilCommandLineTool and not need to worry as much about the container type.
| for job in super().job(job_order, output_callbacks, runtimeContext): | ||
| if isinstance(job, ContainerCommandLineJob) and self._uses_container( | ||
| runtimeContext | ||
| ): | ||
| file_mounts = self._file_mounts_from_pathmapper(job) | ||
| script = command_line_to_shell_script(job.command_line) | ||
| script = add_injections(script, file_mounts) | ||
| job.command_line = shell_script_to_command_line(script) |
There was a problem hiding this comment.
Is this going to inject the monitoring logic even when we're using Singularity (or I think UDocker?) where we should already see CPU and memory usage in the stats because it happens under a child process of Toil? Will that lead to double-counting of CPU usage?
| ### | ||
| # Runtime code injection system | ||
| # When a workflow steps runs inside a container, the Toil worker process on the host | ||
| # often cannot see how much CPU and RAM that step actually used. This system allows |
There was a problem hiding this comment.
The word often is doing some vigorous handwaving here, and trying to hide that this text does not convey an understanding of how or why (or, more importantly for avoiding double-counting of resource usage, when) this will be the case.
| present, and a flat argv list otherwise. | ||
| """ | ||
| if ( | ||
| len(command_line) >= 3 |
There was a problem hiding this comment.
This might want to be == or we'll be able to throw away arguments.
| "--outdir", | ||
| str(tmp_path / "output_dir"), | ||
| str(cwl_file), | ||
| str(inputs_file), | ||
| "--jobStore", | ||
| str(job_store), | ||
| "--stats", |
There was a problem hiding this comment.
There's nothing in here to make Docker be used, but I looked at the toil-cwl-runner options and it has --singularity, --podman, and --no-container, but not a --docker. So I guess Docker is the default container engine. Maybe we need a comment here to remind us that the test depends on that?
| # Include malformed and partial lines to exercise parser robustness. | ||
| message_file.write_text( | ||
| "CPU\t1000000\n" | ||
| "Memory\t1024\n" | ||
| "CPU\t4000000\n" | ||
| "Memory\t2048\n" | ||
| "Odd\tline\n" | ||
| "CPU\t5000000", | ||
| encoding="utf-8", | ||
| ) |
There was a problem hiding this comment.
One thing that's noticeably absent here is a line with the wrong number of fields, and in particular too few fields.
| from toil.lib import interpreter | ||
|
|
||
| message_file = tmp_path / "resources.tsv" | ||
| # Include malformed and partial lines to exercise parser robustness. |
There was a problem hiding this comment.
A some of what's being tested here (like the fact that final lines without a terminating newline are ignored) seems to be based on the implementation code, rather than anything that the docstring actually promises will be true about the function under test.
If it's an important enough point about the function's behavior that it's worth testing, it's probably worth documenting in the docstring. Otherwise, the next person to touch the function is going to tinker with the internals in a way that doesn't appear to change anything about what the function promises, and then be hit with test failures because there are other secret things that need to be true about the function.
| job = object.__new__(cwltoil.ToilDockerCommandLineJob) | ||
| job.command_line = ["echo", "hello"] | ||
| job.pathmapper = FakePathMapper(FakeMapperEntry(str(host_input), "/work/input.txt")) | ||
|
|
There was a problem hiding this comment.
There aren't any actual assertions in this test, so I don't think it's genuinely testing what its docstring claims it is testing.
There was a problem hiding this comment.
I think these are leftover changes from #5512. It looks like your new commit c4c69a4 is on top of the commits from the branch for that PR, and not on top of the squashed commit that was created when the PR was merged.
I would recommend rebasing onto the current mainline Toil, and keeping just the commit about this feature in the branch for this PR:
- Be on the branch for this PR.
- Do
git fetch origin(assuming youroriginremote points to this Github project) - Do
git rebase -i origin/master - In the editor, delete all the lines for the commits from the old feature, and only keep the line for the commit that's about the new feature.
- Save and quit
- Git should rewrite your history so now your
issues/5477-cpu-usage-docker-cwlbranch just has the new commit for this feature. - Do a
git push origin issues/5477-cpu-usage-docker-cwl -fto force-push your rewritten history to Github and update the PR.
|
@avnig05 Since you're gone for the summer, I'm going to take on landing this myself. |
CWL and WDL have a shared runtime injection module that samples in-container CPU/memory usage and reports it back to Toil via runtime message files.
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist