Skip to content

WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle#647

Draft
bfjelds wants to merge 3 commits into
mainfrom
user/bfjelds/mjolnir/fix-scale-test-failures
Draft

WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle#647
bfjelds wants to merge 3 commits into
mainfrom
user/bfjelds/mjolnir/fix-scale-test-failures

Conversation

@bfjelds
Copy link
Copy Markdown
Member

@bfjelds bfjelds commented May 14, 2026

Three fixes for scale test reliability:

  1. Storm framework integration — use tc.Context() instead of context.Background() for proper goroutine lifecycle management
  2. Per-iteration SSH proxy lifecycle — cancel tunnels before finalize to prevent goroutine accumulation
  3. Improved initramfs diagnostics — distinguish genuine initramfs failures from transient warnings where VM booted to systemd

Addresses goroutine leak warnings and false-positive initramfs attribution in scale test runs.

Copilot AI review requested due to automatic review settings May 14, 2026 18:43
@bfjelds bfjelds requested a review from a team as a code owner May 14, 2026 18:43
@bfjelds
Copy link
Copy Markdown
Member Author

bfjelds commented May 14, 2026

/azp run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from d55615e to c735db0 Compare May 14, 2026 18:48
@bfjelds
Copy link
Copy Markdown
Member Author

bfjelds commented May 14, 2026

/azp run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves Storm scale test reliability by tying long-running goroutines to the framework test-case context, tightening per-iteration SSH tunnel cleanup, and making initramfs diagnostics less prone to false positives.

Changes:

  • Switch servicing test execution to pass storm.TestCase through so tests can use tc.Context() / tc.BackgroundWaitGroup().
  • Add per-iteration SSH proxy cancellation + wait to avoid accumulating proxy goroutines/processes across update iterations.
  • Refine serial log parsing to separate transient dracut warnings from true initramfs boot failures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tools/storm/utils/ssh/ssh.go Makes SSH reverse port forward startup more strict and adjusts behavior on context-cancelled shutdown.
tools/storm/servicing/trident.go Adds a context-aware runner that passes storm.TestCase into servicing test functions.
tools/storm/servicing/tests/update.go Uses framework context/WaitGroup for background servers, adds per-iteration SSH proxy lifecycle, and improves dracut/initramfs diagnostics.
Comments suppressed due to low confidence (1)

tools/storm/servicing/tests/update.go:279

  • The temp file handle (stageLogLocalTmpFile) is never closed, and the temp path is not removed on several early-return paths after creation (e.g., SCP download failure, later stage/finalize error returns before the explicit os.Remove). Close the file promptly after getting the name, and ensure the temp path is removed on all exits from the iteration (a common approach is to scope per-iteration cleanup via an inner function so defers don’t accumulate across iterations).
		stageLogLocalTmpFile, err := os.CreateTemp("", "staged-trident-full")
		if err != nil {
			cleanupProxies()
			return fmt.Errorf("failed to create temp staging log file: %w", err)
		}
		stageLogLocalTmpPath := stageLogLocalTmpFile.Name()

		err = stormssh.ScpDownloadFile(vmConfig.VMConfig, vmIP, "/var/log/trident-full.log", stageLogLocalTmpPath)
		if err != nil {
			return fmt.Errorf("failed to download staged trident log: %w", err)
		}

Comment thread tools/storm/utils/ssh/ssh.go
Comment thread tools/storm/utils/ssh/ssh.go Outdated
Comment thread tools/storm/servicing/tests/update.go Outdated
Comment thread tools/storm/servicing/tests/update.go
@bfjelds bfjelds marked this pull request as draft May 14, 2026 19:38
@bfjelds bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from c735db0 to 7277612 Compare May 14, 2026 19:43
@bfjelds
Copy link
Copy Markdown
Member Author

bfjelds commented May 14, 2026

/azp run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds bfjelds changed the title fix(storm): use framework context for update-loop goroutine lifecycle tests: (scale and servicing) use framework context for update-loop goroutine lifecycle May 14, 2026
@bfjelds bfjelds marked this pull request as ready for review May 14, 2026 22:01
Copilot AI review requested due to automatic review settings May 14, 2026 22:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

tools/storm/servicing/tests/update.go:279

  • The temp file handle is never closed (stageLogLocalTmpFile.Close()), and the temp path is only removed on the “success” path. Any early return after creation (e.g., SCP failure, copy failure, stage errors) will leak temp files; at scale this can exhaust disk/inodes and file descriptors. A concrete fix is to close the temp file immediately after getting its name, and ensure the temp path is removed on all paths (e.g., via a per-iteration-scoped defer in an inner function, or explicit cleanup before each return after creation).
		stageLogLocalTmpFile, err := os.CreateTemp("", "staged-trident-full")
		if err != nil {
			cleanupProxies()
			return fmt.Errorf("failed to create temp staging log file: %w", err)
		}
		stageLogLocalTmpPath := stageLogLocalTmpFile.Name()

		err = stormssh.ScpDownloadFile(vmConfig.VMConfig, vmIP, "/var/log/trident-full.log", stageLogLocalTmpPath)
		if err != nil {
			return fmt.Errorf("failed to download staged trident log: %w", err)
		}

Comment thread tools/storm/utils/ssh/ssh.go Outdated
Comment thread tools/storm/servicing/tests/update.go Outdated
Comment thread tools/storm/servicing/tests/update.go
@bfjelds bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from 7277612 to ee47505 Compare May 14, 2026 22:48
@bfjelds
Copy link
Copy Markdown
Member Author

bfjelds commented May 14, 2026

/azp run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Three fixes for scale test reliability:

1. Use storm TestCase.Context() instead of context.Background()
   - The servicing update-loop created its own orphaned context, bypassing
     the storm framework's goroutine lifecycle management
   - Thread tc (storm.TestCase) through to UpdateLoop/Rollback via new
     runTestCaseWithContext method, matching the pattern used in e2e tests
   - Register netlisten goroutines with tc.BackgroundWaitGroup()

2. Per-iteration SSH proxy lifecycle
   - SSH reverse tunnel goroutines were started inside the loop using the
     function-level context, accumulating 2 goroutines per iteration
   - Now use per-iteration child context with explicit cleanup before
     finalize (tunnels are only needed during staging)
   - Add ExitOnForwardFailure=yes to detect remote port binding failures
   - Handle cmd.Start() failure to prevent deadlock on startedChannel

3. Improve initramfs diagnostics (checkSerialLogForDracutIssues)
   - Distinguish between genuine initramfs failures (emergency shell,
     device timeout) and transient warnings where the VM booted past
     initramfs into systemd but is unreachable for other reasons
   - Log transient warnings as WARN instead of ERRO to avoid
     misattribution of SSH/network failures as initramfs issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 15, 2026 16:01
@bfjelds bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from ee47505 to 3792d15 Compare May 15, 2026 16:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment on lines +353 to +357
// SSH reverse tunnels are no longer needed — finalize re-reads the COSI
// file, so tunnels must stay alive through finalize. Stop them now before
// the reboot to avoid orphaned SSH processes.
cleanupProxies()

Comment on lines 82 to 84
// Wait for both update servers to start
<-aStartedChannel
<-bStartedChannel
Comment on lines 301 to 305
if stageErr != nil {
os.Remove(stageLogLocalTmpPath)
cleanupProxies()
if egrepOut, err := exec.Command("/bin/sh", "-c", fmt.Sprintf("grep 'target is busy' %s | grep umount", stageLogLocalTmpPath)).CombinedOutput(); err == nil {
// Check for known unmount failure and signal
@bfjelds bfjelds marked this pull request as draft May 18, 2026 20:54
@bfjelds bfjelds changed the title tests: (scale and servicing) use framework context for update-loop goroutine lifecycle WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle May 18, 2026
bfjelds and others added 2 commits May 18, 2026 17:18
When a VM boots (DHCP lease + virsh running) but SSH port 22 is
unreachable, capture diagnostics to identify the root cause:
1. virsh domifaddr - verify VM network interfaces
2. ARP table entry - check L2 reachability
3. Ping test - check L3 reachability
4. TCP port probes - check key services
5. Full serial log save - preserve boot output
6. virsh console commands - query systemd/sshd/network state

(cherry-picked from scale-failure-diagnostics branch)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three improvements based on analysis of pipeline 1121313 (5/525 failures):

1. Fix serial log truncation race: Move TruncateLog AFTER VM shutdown
   instead of before. Previously, shutdown messages written after truncation
   polluted the next boot's serial log, making it appear the VM produced
   no output (seen in qemu-3-5 at iteration 64).

2. Fix bootedToSystemd false positive: The dracut emergency classifier
   was misidentifying genuine initramfs failures as transient because
   systemd messages from prior boots leaked into the serial log. Now
   requires kernel timestamp markers to confirm messages are from the
   current boot. Also promote 'Entering emergency mode' to definite
   failure (seen in uki-3-17 at iteration 90, bug 15086).

3. Add host resource diagnostics: Log host memory, load average, and
   QEMU process count when a VM fails to boot. Resource exhaustion on
   the QEMU host can cause boot failures that look like VM issues.

4. Add domain state verification: After DomainCreate, verify the VM is
   actually in 'running' state before waiting for login.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants