WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle by bfjelds · Pull Request #647 · microsoft/trident

bfjelds · 2026-05-14T18:43:14Z

Three fixes for scale test reliability:

Storm framework integration — use tc.Context() instead of context.Background() for proper goroutine lifecycle management
Per-iteration SSH proxy lifecycle — cancel tunnels before finalize to prevent goroutine accumulation
Improved initramfs diagnostics — distinguish genuine initramfs failures from transient warnings where VM booted to systemd

Addresses goroutine leak warnings and false-positive initramfs attribution in scale test runs.

bfjelds · 2026-05-14T18:43:22Z

/azp run [GITHUB]-trident-pr-e2e

azure-pipelines · 2026-05-14T18:43:43Z

Azure Pipelines successfully started running 1 pipeline(s).

bfjelds · 2026-05-14T18:50:38Z

/azp run [GITHUB]-trident-pr-e2e

azure-pipelines · 2026-05-14T18:50:50Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves Storm scale test reliability by tying long-running goroutines to the framework test-case context, tightening per-iteration SSH tunnel cleanup, and making initramfs diagnostics less prone to false positives.

Changes:

Switch servicing test execution to pass storm.TestCase through so tests can use tc.Context() / tc.BackgroundWaitGroup().
Add per-iteration SSH proxy cancellation + wait to avoid accumulating proxy goroutines/processes across update iterations.
Refine serial log parsing to separate transient dracut warnings from true initramfs boot failures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
tools/storm/utils/ssh/ssh.go	Makes SSH reverse port forward startup more strict and adjusts behavior on context-cancelled shutdown.
tools/storm/servicing/trident.go	Adds a context-aware runner that passes `storm.TestCase` into servicing test functions.
tools/storm/servicing/tests/update.go	Uses framework context/WaitGroup for background servers, adds per-iteration SSH proxy lifecycle, and improves dracut/initramfs diagnostics.

Comments suppressed due to low confidence (1)

tools/storm/servicing/tests/update.go:279

The temp file handle (stageLogLocalTmpFile) is never closed, and the temp path is not removed on several early-return paths after creation (e.g., SCP download failure, later stage/finalize error returns before the explicit os.Remove). Close the file promptly after getting the name, and ensure the temp path is removed on all exits from the iteration (a common approach is to scope per-iteration cleanup via an inner function so defers don’t accumulate across iterations).

		stageLogLocalTmpFile, err := os.CreateTemp("", "staged-trident-full")
		if err != nil {
			cleanupProxies()
			return fmt.Errorf("failed to create temp staging log file: %w", err)
		}
		stageLogLocalTmpPath := stageLogLocalTmpFile.Name()

		err = stormssh.ScpDownloadFile(vmConfig.VMConfig, vmIP, "/var/log/trident-full.log", stageLogLocalTmpPath)
		if err != nil {
			return fmt.Errorf("failed to download staged trident log: %w", err)
		}

bfjelds · 2026-05-14T19:43:57Z

/azp run [GITHUB]-trident-pr-e2e

azure-pipelines · 2026-05-14T19:44:16Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

tools/storm/servicing/tests/update.go:279

The temp file handle is never closed (stageLogLocalTmpFile.Close()), and the temp path is only removed on the “success” path. Any early return after creation (e.g., SCP failure, copy failure, stage errors) will leak temp files; at scale this can exhaust disk/inodes and file descriptors. A concrete fix is to close the temp file immediately after getting its name, and ensure the temp path is removed on all paths (e.g., via a per-iteration-scoped defer in an inner function, or explicit cleanup before each return after creation).

		stageLogLocalTmpFile, err := os.CreateTemp("", "staged-trident-full")
		if err != nil {
			cleanupProxies()
			return fmt.Errorf("failed to create temp staging log file: %w", err)
		}
		stageLogLocalTmpPath := stageLogLocalTmpFile.Name()

		err = stormssh.ScpDownloadFile(vmConfig.VMConfig, vmIP, "/var/log/trident-full.log", stageLogLocalTmpPath)
		if err != nil {
			return fmt.Errorf("failed to download staged trident log: %w", err)
		}

bfjelds · 2026-05-14T22:48:59Z

/azp run [GITHUB]-trident-pr-e2e

azure-pipelines · 2026-05-14T22:49:17Z

Azure Pipelines successfully started running 1 pipeline(s).

Three fixes for scale test reliability: 1. Use storm TestCase.Context() instead of context.Background() - The servicing update-loop created its own orphaned context, bypassing the storm framework's goroutine lifecycle management - Thread tc (storm.TestCase) through to UpdateLoop/Rollback via new runTestCaseWithContext method, matching the pattern used in e2e tests - Register netlisten goroutines with tc.BackgroundWaitGroup() 2. Per-iteration SSH proxy lifecycle - SSH reverse tunnel goroutines were started inside the loop using the function-level context, accumulating 2 goroutines per iteration - Now use per-iteration child context with explicit cleanup before finalize (tunnels are only needed during staging) - Add ExitOnForwardFailure=yes to detect remote port binding failures - Handle cmd.Start() failure to prevent deadlock on startedChannel 3. Improve initramfs diagnostics (checkSerialLogForDracutIssues) - Distinguish between genuine initramfs failures (emergency shell, device timeout) and transient warnings where the VM booted past initramfs into systemd but is unreachable for other reasons - Log transient warnings as WARN instead of ERRO to avoid misattribution of SSH/network failures as initramfs issues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

+		// SSH reverse tunnels are no longer needed — finalize re-reads the COSI
+		// file, so tunnels must stay alive through finalize. Stop them now before
+		// the reboot to avoid orphaned SSH processes.
+		cleanupProxies()
+


 	// Wait for both update servers to start
 	<-aStartedChannel
 	<-bStartedChannel


 		if stageErr != nil {
+			os.Remove(stageLogLocalTmpPath)
+			cleanupProxies()
 			if egrepOut, err := exec.Command("/bin/sh", "-c", fmt.Sprintf("grep 'target is busy' %s | grep umount", stageLogLocalTmpPath)).CombinedOutput(); err == nil {
 				// Check for known unmount failure and signal


When a VM boots (DHCP lease + virsh running) but SSH port 22 is unreachable, capture diagnostics to identify the root cause: 1. virsh domifaddr - verify VM network interfaces 2. ARP table entry - check L2 reachability 3. Ping test - check L3 reachability 4. TCP port probes - check key services 5. Full serial log save - preserve boot output 6. virsh console commands - query systemd/sshd/network state (cherry-picked from scale-failure-diagnostics branch) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three improvements based on analysis of pipeline 1121313 (5/525 failures): 1. Fix serial log truncation race: Move TruncateLog AFTER VM shutdown instead of before. Previously, shutdown messages written after truncation polluted the next boot's serial log, making it appear the VM produced no output (seen in qemu-3-5 at iteration 64). 2. Fix bootedToSystemd false positive: The dracut emergency classifier was misidentifying genuine initramfs failures as transient because systemd messages from prior boots leaked into the serial log. Now requires kernel timestamp markers to confirm messages are from the current boot. Also promote 'Entering emergency mode' to definite failure (seen in uki-3-17 at iteration 90, bug 15086). 3. Add host resource diagnostics: Log host memory, load average, and QEMU process count when a VM fails to boot. Resource exhaustion on the QEMU host can cause boot failures that look like VM issues. 4. Add domain state verification: After DomainCreate, verify the VM is actually in 'running' state before waiting for login. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 14, 2026 18:43

bfjelds requested a review from a team as a code owner May 14, 2026 18:43

bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from d55615e to c735db0 Compare May 14, 2026 18:48

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread tools/storm/utils/ssh/ssh.go

Comment thread tools/storm/utils/ssh/ssh.go Outdated

Comment thread tools/storm/servicing/tests/update.go Outdated

Comment thread tools/storm/servicing/tests/update.go

Copilot started reviewing on behalf of bfjelds May 14, 2026 19:13 View session

bfjelds marked this pull request as draft May 14, 2026 19:38

bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from c735db0 to 7277612 Compare May 14, 2026 19:43

bfjelds changed the title ~~fix(storm): use framework context for update-loop goroutine lifecycle~~ tests: (scale and servicing) use framework context for update-loop goroutine lifecycle May 14, 2026

bfjelds marked this pull request as ready for review May 14, 2026 22:01

Copilot AI review requested due to automatic review settings May 14, 2026 22:01

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread tools/storm/utils/ssh/ssh.go Outdated

Comment thread tools/storm/servicing/tests/update.go Outdated

Comment thread tools/storm/servicing/tests/update.go

Copilot started reviewing on behalf of bfjelds May 14, 2026 22:16 View session

bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from 7277612 to ee47505 Compare May 14, 2026 22:48

Copilot AI review requested due to automatic review settings May 15, 2026 16:01

bfjelds force-pushed the user/bfjelds/mjolnir/fix-scale-test-failures branch from ee47505 to 3792d15 Compare May 15, 2026 16:01

Copilot started reviewing on behalf of bfjelds May 15, 2026 16:04 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

bfjelds marked this pull request as draft May 18, 2026 20:54

bfjelds changed the title ~~tests: (scale and servicing) use framework context for update-loop goroutine lifecycle~~ WIP: tests: (scale and servicing) use framework context for update-loop goroutine lifecycle May 18, 2026

bfjelds and others added 2 commits May 18, 2026 17:18

Conversation

bfjelds commented May 14, 2026

Uh oh!

bfjelds commented May 14, 2026

Uh oh!

azure-pipelines Bot commented May 14, 2026

Uh oh!

bfjelds commented May 14, 2026

Uh oh!

azure-pipelines Bot commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bfjelds commented May 14, 2026

Uh oh!

azure-pipelines Bot commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bfjelds commented May 14, 2026

Uh oh!

azure-pipelines Bot commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants