WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle#647
WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle#647bfjelds wants to merge 3 commits into
Conversation
|
/azp run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
d55615e to
c735db0
Compare
|
/azp run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Improves Storm scale test reliability by tying long-running goroutines to the framework test-case context, tightening per-iteration SSH tunnel cleanup, and making initramfs diagnostics less prone to false positives.
Changes:
- Switch servicing test execution to pass
storm.TestCasethrough so tests can usetc.Context()/tc.BackgroundWaitGroup(). - Add per-iteration SSH proxy cancellation + wait to avoid accumulating proxy goroutines/processes across update iterations.
- Refine serial log parsing to separate transient dracut warnings from true initramfs boot failures.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tools/storm/utils/ssh/ssh.go | Makes SSH reverse port forward startup more strict and adjusts behavior on context-cancelled shutdown. |
| tools/storm/servicing/trident.go | Adds a context-aware runner that passes storm.TestCase into servicing test functions. |
| tools/storm/servicing/tests/update.go | Uses framework context/WaitGroup for background servers, adds per-iteration SSH proxy lifecycle, and improves dracut/initramfs diagnostics. |
Comments suppressed due to low confidence (1)
tools/storm/servicing/tests/update.go:279
- The temp file handle (
stageLogLocalTmpFile) is never closed, and the temp path is not removed on several early-return paths after creation (e.g., SCP download failure, later stage/finalize error returns before the explicitos.Remove). Close the file promptly after getting the name, and ensure the temp path is removed on all exits from the iteration (a common approach is to scope per-iteration cleanup via an inner function so defers don’t accumulate across iterations).
stageLogLocalTmpFile, err := os.CreateTemp("", "staged-trident-full")
if err != nil {
cleanupProxies()
return fmt.Errorf("failed to create temp staging log file: %w", err)
}
stageLogLocalTmpPath := stageLogLocalTmpFile.Name()
err = stormssh.ScpDownloadFile(vmConfig.VMConfig, vmIP, "/var/log/trident-full.log", stageLogLocalTmpPath)
if err != nil {
return fmt.Errorf("failed to download staged trident log: %w", err)
}
c735db0 to
7277612
Compare
|
/azp run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
tools/storm/servicing/tests/update.go:279
- The temp file handle is never closed (
stageLogLocalTmpFile.Close()), and the temp path is only removed on the “success” path. Any early return after creation (e.g., SCP failure, copy failure, stage errors) will leak temp files; at scale this can exhaust disk/inodes and file descriptors. A concrete fix is to close the temp file immediately after getting its name, and ensure the temp path is removed on all paths (e.g., via a per-iteration-scoped defer in an inner function, or explicit cleanup before each return after creation).
stageLogLocalTmpFile, err := os.CreateTemp("", "staged-trident-full")
if err != nil {
cleanupProxies()
return fmt.Errorf("failed to create temp staging log file: %w", err)
}
stageLogLocalTmpPath := stageLogLocalTmpFile.Name()
err = stormssh.ScpDownloadFile(vmConfig.VMConfig, vmIP, "/var/log/trident-full.log", stageLogLocalTmpPath)
if err != nil {
return fmt.Errorf("failed to download staged trident log: %w", err)
}
7277612 to
ee47505
Compare
|
/azp run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Three fixes for scale test reliability:
1. Use storm TestCase.Context() instead of context.Background()
- The servicing update-loop created its own orphaned context, bypassing
the storm framework's goroutine lifecycle management
- Thread tc (storm.TestCase) through to UpdateLoop/Rollback via new
runTestCaseWithContext method, matching the pattern used in e2e tests
- Register netlisten goroutines with tc.BackgroundWaitGroup()
2. Per-iteration SSH proxy lifecycle
- SSH reverse tunnel goroutines were started inside the loop using the
function-level context, accumulating 2 goroutines per iteration
- Now use per-iteration child context with explicit cleanup before
finalize (tunnels are only needed during staging)
- Add ExitOnForwardFailure=yes to detect remote port binding failures
- Handle cmd.Start() failure to prevent deadlock on startedChannel
3. Improve initramfs diagnostics (checkSerialLogForDracutIssues)
- Distinguish between genuine initramfs failures (emergency shell,
device timeout) and transient warnings where the VM booted past
initramfs into systemd but is unreachable for other reasons
- Log transient warnings as WARN instead of ERRO to avoid
misattribution of SSH/network failures as initramfs issues
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ee47505 to
3792d15
Compare
| // SSH reverse tunnels are no longer needed — finalize re-reads the COSI | ||
| // file, so tunnels must stay alive through finalize. Stop them now before | ||
| // the reboot to avoid orphaned SSH processes. | ||
| cleanupProxies() | ||
|
|
| // Wait for both update servers to start | ||
| <-aStartedChannel | ||
| <-bStartedChannel |
| if stageErr != nil { | ||
| os.Remove(stageLogLocalTmpPath) | ||
| cleanupProxies() | ||
| if egrepOut, err := exec.Command("/bin/sh", "-c", fmt.Sprintf("grep 'target is busy' %s | grep umount", stageLogLocalTmpPath)).CombinedOutput(); err == nil { | ||
| // Check for known unmount failure and signal |
When a VM boots (DHCP lease + virsh running) but SSH port 22 is unreachable, capture diagnostics to identify the root cause: 1. virsh domifaddr - verify VM network interfaces 2. ARP table entry - check L2 reachability 3. Ping test - check L3 reachability 4. TCP port probes - check key services 5. Full serial log save - preserve boot output 6. virsh console commands - query systemd/sshd/network state (cherry-picked from scale-failure-diagnostics branch) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three improvements based on analysis of pipeline 1121313 (5/525 failures): 1. Fix serial log truncation race: Move TruncateLog AFTER VM shutdown instead of before. Previously, shutdown messages written after truncation polluted the next boot's serial log, making it appear the VM produced no output (seen in qemu-3-5 at iteration 64). 2. Fix bootedToSystemd false positive: The dracut emergency classifier was misidentifying genuine initramfs failures as transient because systemd messages from prior boots leaked into the serial log. Now requires kernel timestamp markers to confirm messages are from the current boot. Also promote 'Entering emergency mode' to definite failure (seen in uki-3-17 at iteration 90, bug 15086). 3. Add host resource diagnostics: Log host memory, load average, and QEMU process count when a VM fails to boot. Resource exhaustion on the QEMU host can cause boot failures that look like VM issues. 4. Add domain state verification: After DomainCreate, verify the VM is actually in 'running' state before waiting for login. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three fixes for scale test reliability:
Addresses goroutine leak warnings and false-positive initramfs attribution in scale test runs.