Skip to content

[Feature]: Skill and tooling improvements from a Claude Code (Opus 4.7) session building a TodoApp #68

@nmetulev

Description

@nmetulev

What problem are you trying to solve?

Context for this feedback

I'm validating the win-dev-skills agent and skills with Claude Code (Opus 4.7, 1M context) rather than GitHub Copilot, to see how well they generalize across agentic harnesses. The skills work — the final app builds, launches, and is functionally correct — but the session also exposed several rough edges that are worth fixing while they're fresh. This issue is a single tracking item for those findings; happy to split into separate issues during triage.

Repro session

  • Harness: Claude Code, model claude-opus-4-7 (Opus 4.7, 1M context window)
  • Plugin/skills version: winui plugin v0.2.3 (installed via /plugin, then /reload-plugins)
  • Invocation: @agent-winui:winui-dev Can you create a simple todo app
  • Skills exercised: winui-setup, winui-design, winui-dev-workflow, winui-ui-testing
  • Outcome: working WinUI 3 app at ~150 LoC across Models/, ViewModels/, Services/, with MVVM via CommunityToolkit.Mvvm, x:Bind, ThemeResource, JSON persistence under LocalAppData. App launches cleanly, all features work, persistence verified across restart.
  • Cost: ~62 min wall time, ~112 tool uses across two agent invocations (the first invocation timed out mid-build and had to be resumed manually).

The cost-to-output ratio is what motivated this review — an LLM-driven scaffold of a todo app should not need 112 tool uses, and identifying why it did points directly at fixable issues in the skills and tooling.

Proposed solution

1. Lint/analyzer rule for the DataContext.X binding footgun

The agent generated this in MainPage.xaml:

<Button Command="{Binding DataContext.DeleteTodoCommand, ElementName=TodoListView}" .../>

…but the page only exposes a ViewModel property for x:Bind and never assigns DataContext. The binding silently failed — clicking Delete did nothing. The XAML compiler doesn't catch this, and it only surfaced because the agent ran UI automation afterwards. Without UI testing this would have shipped silently broken. This is the most concrete and important item in the issue: it's a pattern the design skill or its templates could plausibly emit again, and there is no compiler-level guardrail today.

Proposal: add a Roslyn analyzer rule (or, as a stopgap, a winui-code-review lint check) that flags {Binding DataContext.X, ElementName=Y} when the named element's parent page/control never assigns DataContext in code-behind or XAML. Suggested fix-it: switch to Click handler with Tag="{x:Bind <item>}", which is the pattern the agent ultimately landed on after the smoke test caught it.

2. winui-setup should be fallback-only, not preconditioned

The skill description says it's for "after a Windows reset, or when another winui skill reports a missing prerequisite" — i.e., a fallback. But when the parent prompt mentioned "use winui-setup first if prerequisites aren't ready", the agent ran setup checks pre-emptively because that phrasing matched what the skill does, not when it should run. On a machine with a working toolchain this is wasted turns.

Proposal: either (a) strengthen the description with explicit "do NOT run speculatively — only invoke when another skill reports a specific missing-prerequisite error string," or (b) add a fast-path entry point in winui-dev-workflow that just attempts the build and dispatches to setup only on the specific error patterns it can recognize. Option (a) is cheaper and likely sufficient.

3. winapp run exit-code semantics

winapp run --debug-output exited with 255 when the app was killed by the agent at the end of the smoke test. That return code is indistinguishable from a real failure without parsing stderr — the agent had to write extra logic to second-guess whether its own session had actually succeeded.

Proposal: distinguish between outcomes with separate exit codes (or, better, a structured JSON status line on stdout that agents can parse without ambiguity):

  • 0 — app exited cleanly on its own
  • 1 — build failed
  • 2 — launch failed (built ok, but did not start)
  • 130 (or similar dedicated code) — app killed by signal/Ctrl-C
  • explicit handling for "process still running, output detached"

4. winui-ui-testing scales tests to template, not feature surface

The skill generated 19 smoke-test cases for a 4-feature todo app (add, check, delete, persist). Most cases were template-driven assertions (e.g., element-existence checks for every named control) rather than feature-driven validations of the user-visible behavior.

This was partly induced by my prompt asking for a "smoke test" — but a skill named winui-ui-testing arguably should produce a smoke test by default and require an explicit flag for the exhaustive batch.

Proposal: scale the generated test count to the declared feature surface — e.g., 1–3 tests per feature by default, with explicit opt-in for exhaustive accessibility/element audits. A --scope=smoke|full flag would cover this cleanly.

5. Cold-build first-run experience needs progress milestones

The very first build (NuGet restore + WindowsAppSDK + .NET 10 SDK pull) is a multi-minute black box. The first agent invocation timed out mid-build because it couldn't distinguish "this is normal, still working" from "this is hung." It had to be resumed by spawning a second agent that picked up where the first left off — half the wall-time cost of the session is attributable to this one issue.

Proposal: have BuildAndRun.ps1 print explicit milestone lines that agents can grep for and treat as keep-alive signals, e.g.:

[winapp] RESTORE_START
[winapp] RESTORE_COMPLETE  duration=187s
[winapp] BUILD_START
[winapp] BUILD_COMPLETE    duration=42s  artifact=...\TodoApp.exe
[winapp] LAUNCH_COMPLETE   pid=1234

Bonus: document expected cold-start timing in the skill so agents can pick a single appropriate sleep instead of polling. This mostly matters for the first run on a clean machine; on warm machines the existing flow is fine.

Alternatives considered

Additional context

  • This was a Claude Code session, not Copilot, so the existing winui-session-report analyzer (which expects Copilot session events) didn't apply directly — observations here are from reviewing the agent's tool transcripts and final artifacts manually.
  • Some of the inefficiency in the session was caused by the parent prompt (asking for "use winui-setup first" and asking for a "smoke test"). That's a prompting issue on my end, but it points to a robustness question: skill descriptions are strong enough most of the time but get overridden by parent prompts. Item Adding Microsoft SECURITY.MD #2's stronger framing would help.
  • Happy to contribute PRs for any of these — particularly This repo is missing a LICENSE file #1, This repo is missing important files #3, and Tweaks #5 if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions