You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm validating the win-dev-skills agent and skills with Claude Code (Opus 4.7, 1M context) rather than GitHub Copilot, to see how well they generalize across agentic harnesses. The skills work — the final app builds, launches, and is functionally correct — but the session also exposed several rough edges that are worth fixing while they're fresh. This issue is a single tracking item for those findings; happy to split into separate issues during triage.
Repro session
Harness: Claude Code, model claude-opus-4-7 (Opus 4.7, 1M context window)
Plugin/skills version:winui plugin v0.2.3 (installed via /plugin, then /reload-plugins)
Invocation:@agent-winui:winui-dev Can you create a simple todo app
Outcome: working WinUI 3 app at ~150 LoC across Models/, ViewModels/, Services/, with MVVM via CommunityToolkit.Mvvm, x:Bind, ThemeResource, JSON persistence under LocalAppData. App launches cleanly, all features work, persistence verified across restart.
Cost: ~62 min wall time, ~112 tool uses across two agent invocations (the first invocation timed out mid-build and had to be resumed manually).
The cost-to-output ratio is what motivated this review — an LLM-driven scaffold of a todo app should not need 112 tool uses, and identifying why it did points directly at fixable issues in the skills and tooling.
Proposed solution
1. Lint/analyzer rule for the DataContext.X binding footgun
…but the page only exposes a ViewModel property for x:Bind and never assigns DataContext. The binding silently failed — clicking Delete did nothing. The XAML compiler doesn't catch this, and it only surfaced because the agent ran UI automation afterwards. Without UI testing this would have shipped silently broken. This is the most concrete and important item in the issue: it's a pattern the design skill or its templates could plausibly emit again, and there is no compiler-level guardrail today.
Proposal: add a Roslyn analyzer rule (or, as a stopgap, a winui-code-review lint check) that flags {Binding DataContext.X, ElementName=Y} when the named element's parent page/control never assigns DataContext in code-behind or XAML. Suggested fix-it: switch to Click handler with Tag="{x:Bind <item>}", which is the pattern the agent ultimately landed on after the smoke test caught it.
2. winui-setup should be fallback-only, not preconditioned
The skill description says it's for "after a Windows reset, or when another winui skill reports a missing prerequisite" — i.e., a fallback. But when the parent prompt mentioned "use winui-setup first if prerequisites aren't ready", the agent ran setup checks pre-emptively because that phrasing matched what the skill does, not when it should run. On a machine with a working toolchain this is wasted turns.
Proposal: either (a) strengthen the description with explicit "do NOT run speculatively — only invoke when another skill reports a specific missing-prerequisite error string," or (b) add a fast-path entry point in winui-dev-workflow that just attempts the build and dispatches to setup only on the specific error patterns it can recognize. Option (a) is cheaper and likely sufficient.
3. winapp run exit-code semantics
winapp run --debug-output exited with 255 when the app was killed by the agent at the end of the smoke test. That return code is indistinguishable from a real failure without parsing stderr — the agent had to write extra logic to second-guess whether its own session had actually succeeded.
Proposal: distinguish between outcomes with separate exit codes (or, better, a structured JSON status line on stdout that agents can parse without ambiguity):
0 — app exited cleanly on its own
1 — build failed
2 — launch failed (built ok, but did not start)
130 (or similar dedicated code) — app killed by signal/Ctrl-C
explicit handling for "process still running, output detached"
4. winui-ui-testing scales tests to template, not feature surface
The skill generated 19 smoke-test cases for a 4-feature todo app (add, check, delete, persist). Most cases were template-driven assertions (e.g., element-existence checks for every named control) rather than feature-driven validations of the user-visible behavior.
This was partly induced by my prompt asking for a "smoke test" — but a skill named winui-ui-testing arguably should produce a smoke test by default and require an explicit flag for the exhaustive batch.
Proposal: scale the generated test count to the declared feature surface — e.g., 1–3 tests per feature by default, with explicit opt-in for exhaustive accessibility/element audits. A --scope=smoke|full flag would cover this cleanly.
The very first build (NuGet restore + WindowsAppSDK + .NET 10 SDK pull) is a multi-minute black box. The first agent invocation timed out mid-build because it couldn't distinguish "this is normal, still working" from "this is hung." It had to be resumed by spawning a second agent that picked up where the first left off — half the wall-time cost of the session is attributable to this one issue.
Proposal: have BuildAndRun.ps1 print explicit milestone lines that agents can grep for and treat as keep-alive signals, e.g.:
Bonus: document expected cold-start timing in the skill so agents can pick a single appropriate sleep instead of polling. This mostly matters for the first run on a clean machine; on warm machines the existing flow is fine.
Adding everything to winui-code-review instead of a Roslyn analyzer. For This repo is missing a LICENSE file #1 specifically, a code-review skill catch happens late (after the file is written and possibly already running). A Roslyn analyzer surfaces it at build time, which agents notice immediately and fix on the next iteration. The analyzer is the better long-term home; the code-review skill could be a short-term stopgap.
Additional context
This was a Claude Code session, not Copilot, so the existing winui-session-report analyzer (which expects Copilot session events) didn't apply directly — observations here are from reviewing the agent's tool transcripts and final artifacts manually.
Some of the inefficiency in the session was caused by the parent prompt (asking for "use winui-setup first" and asking for a "smoke test"). That's a prompting issue on my end, but it points to a robustness question: skill descriptions are strong enough most of the time but get overridden by parent prompts. Item Adding Microsoft SECURITY.MD #2's stronger framing would help.
What problem are you trying to solve?
Context for this feedback
I'm validating the
win-dev-skillsagent and skills with Claude Code (Opus 4.7, 1M context) rather than GitHub Copilot, to see how well they generalize across agentic harnesses. The skills work — the final app builds, launches, and is functionally correct — but the session also exposed several rough edges that are worth fixing while they're fresh. This issue is a single tracking item for those findings; happy to split into separate issues during triage.Repro session
claude-opus-4-7(Opus 4.7, 1M context window)winuiplugin v0.2.3 (installed via/plugin, then/reload-plugins)@agent-winui:winui-dev Can you create a simple todo appwinui-setup,winui-design,winui-dev-workflow,winui-ui-testingModels/,ViewModels/,Services/, with MVVM viaCommunityToolkit.Mvvm,x:Bind,ThemeResource, JSON persistence under LocalAppData. App launches cleanly, all features work, persistence verified across restart.The cost-to-output ratio is what motivated this review — an LLM-driven scaffold of a todo app should not need 112 tool uses, and identifying why it did points directly at fixable issues in the skills and tooling.
Proposed solution
1. Lint/analyzer rule for the
DataContext.Xbinding footgunThe agent generated this in
MainPage.xaml:…but the page only exposes a
ViewModelproperty forx:Bindand never assignsDataContext. The binding silently failed — clicking Delete did nothing. The XAML compiler doesn't catch this, and it only surfaced because the agent ran UI automation afterwards. Without UI testing this would have shipped silently broken. This is the most concrete and important item in the issue: it's a pattern the design skill or its templates could plausibly emit again, and there is no compiler-level guardrail today.Proposal: add a Roslyn analyzer rule (or, as a stopgap, a
winui-code-reviewlint check) that flags{Binding DataContext.X, ElementName=Y}when the named element's parent page/control never assignsDataContextin code-behind or XAML. Suggested fix-it: switch toClickhandler withTag="{x:Bind <item>}", which is the pattern the agent ultimately landed on after the smoke test caught it.2.
winui-setupshould be fallback-only, not preconditionedThe skill description says it's for "after a Windows reset, or when another winui skill reports a missing prerequisite" — i.e., a fallback. But when the parent prompt mentioned "use winui-setup first if prerequisites aren't ready", the agent ran setup checks pre-emptively because that phrasing matched what the skill does, not when it should run. On a machine with a working toolchain this is wasted turns.
Proposal: either (a) strengthen the description with explicit "do NOT run speculatively — only invoke when another skill reports a specific missing-prerequisite error string," or (b) add a fast-path entry point in
winui-dev-workflowthat just attempts the build and dispatches to setup only on the specific error patterns it can recognize. Option (a) is cheaper and likely sufficient.3.
winapp runexit-code semanticswinapp run --debug-outputexited with 255 when the app was killed by the agent at the end of the smoke test. That return code is indistinguishable from a real failure without parsing stderr — the agent had to write extra logic to second-guess whether its own session had actually succeeded.Proposal: distinguish between outcomes with separate exit codes (or, better, a structured JSON status line on stdout that agents can parse without ambiguity):
0— app exited cleanly on its own1— build failed2— launch failed (built ok, but did not start)130(or similar dedicated code) — app killed by signal/Ctrl-C4.
winui-ui-testingscales tests to template, not feature surfaceThe skill generated 19 smoke-test cases for a 4-feature todo app (add, check, delete, persist). Most cases were template-driven assertions (e.g., element-existence checks for every named control) rather than feature-driven validations of the user-visible behavior.
This was partly induced by my prompt asking for a "smoke test" — but a skill named
winui-ui-testingarguably should produce a smoke test by default and require an explicit flag for the exhaustive batch.Proposal: scale the generated test count to the declared feature surface — e.g., 1–3 tests per feature by default, with explicit opt-in for exhaustive accessibility/element audits. A
--scope=smoke|fullflag would cover this cleanly.5. Cold-build first-run experience needs progress milestones
The very first build (NuGet restore + WindowsAppSDK + .NET 10 SDK pull) is a multi-minute black box. The first agent invocation timed out mid-build because it couldn't distinguish "this is normal, still working" from "this is hung." It had to be resumed by spawning a second agent that picked up where the first left off — half the wall-time cost of the session is attributable to this one issue.
Proposal: have
BuildAndRun.ps1print explicit milestone lines that agents can grep for and treat as keep-alive signals, e.g.:Bonus: document expected cold-start timing in the skill so agents can pick a single appropriate sleep instead of polling. This mostly matters for the first run on a clean machine; on warm machines the existing flow is fine.
Alternatives considered
winui-code-reviewinstead of a Roslyn analyzer. For This repo is missing a LICENSE file #1 specifically, a code-review skill catch happens late (after the file is written and possibly already running). A Roslyn analyzer surfaces it at build time, which agents notice immediately and fix on the next iteration. The analyzer is the better long-term home; the code-review skill could be a short-term stopgap.Additional context
winui-session-reportanalyzer (which expects Copilot session events) didn't apply directly — observations here are from reviewing the agent's tool transcripts and final artifacts manually.winui-setupfirst" and asking for a "smoke test"). That's a prompting issue on my end, but it points to a robustness question: skill descriptions are strong enough most of the time but get overridden by parent prompts. Item Adding Microsoft SECURITY.MD #2's stronger framing would help.