[Feat] Automated benchmark regression check in CI (PR vs main)

Spinning this out into its own issue (per the dev-meeting discussion) so the CI regression-check has a dedicated home. Builds on the benchmark scripts in #117 and the JSON output in #140 (cc @yuvraajnarula), and follows up the "Automated Performance Regression" part of #137.

Everything below runs entirely on GitHub-hosted runners - no external service or account required.

## Goal

On a PR, run the scaling benchmark on both the PR commit and current `main`, and post a comment showing whether the PR makes graph construction slower, faster, or the same - so regressions are visible before merge.

**Key design decision: compare on the same runner**

GitHub Actions runners are noisy - identical code can vary 20–50% between runs on different VMs. So comparing a PR against a baseline captured earlier on a different runner would produce frequent false alarms.

Instead I propose running `main` and the PR back-to-back in the same job, on the same runner, and comparing relative % change rather than absolute seconds. Same hardware for both sides → the noise largely cancels. This also matches the workflow described in the meeting.

**Correctness detail: hold the harness constant**

When we `git checkout main`, that reverts both the library and the benchmark script. To compare fairly we keep the benchmark script fixed (the PR's version) and only swap the library under test - i.e. run the same measurement against two installs of the code. Otherwise we'd be measuring with two different rulers.

## How it works

```yaml
on: pull_request (paths: src/.../create/**) + manual dispatch

1. checkout PR head → install lib → run scaling benchmark → pr.json
2. checkout main    → install lib → run same benchmark    → main.json
   (same runner, same job, same benchmark script)
3. compare.py main.json pr.json → markdown delta table
4. post / update a sticky PR comment with the result
```

Example comment:

| grid points | main | PR | Δ |
|-------------|------|----|----|
| 1,024 | 0.42s | 0.43s | +2% ✅ |
| 4,096 | 1.81s | 2.30s | +27% ⚠️ |

The only new code is a small `compare.py` (parse the two JSONs from #140, compute deltas, render the table, flag on threshold) plus the workflow YAML.

## Guardrails against noise

- Compare % change, never absolute time.
- Tolerance threshold — only flag a regression above e.g. +25% and consistent across grid sizes.
- Cap grid sizes to fit the CI time budget (small/medium grids, not 10⁷ nodes).
- Informational comment first, non-blocking - so a noisy perf check doesn't train people to ignore CI. Can be promoted to a hard gate later once the threshold is proven.

## Plan

- **Phase 0 (prereq):** land #117 + #140 - the action consumes their script + JSON. (Happy to review #140; `compare.py` depends on its JSON schema.)
- **Phase 1 (MVP):** same-runner A/B workflow + `compare.py` → sticky PR comment, informational only.
- **Phase 2:** add repetitions + median, tune the regression threshold, clear ✅/⚠️ verdict line; optionally include the memory delta (using `--track-memory` from #140).
- **Phase 3 (long-term):** add historical trend tracking to also catch slow drift across many PRs (each small enough to pass the per-PR check, but adding up over time). Can be done with `github-action-benchmark`, which stores history on a `gh-pages` branch in this repo — still no external service.

## Alternatives considered

- **Baseline stored from a separate run** (`github-action-benchmark` for the PR check): rejected for the per-PR check because the baseline runs on a different runner → cross-runner noise. Reused in Phase 3 only for trend tracking, where that's fine.

- **Instrumentation-based services** (e.g. CodSpeed): deterministic measurement, but adds an external vendor/account dependency - avoiding unless the team specifically wants it.

Does this direction look right? If so I'll start on Phase 1 once #117/#140 are in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Automated benchmark regression check in CI (PR vs main) #144

Goal

How it works

Guardrails against noise

Plan

Alternatives considered

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feat] Automated benchmark regression check in CI (PR vs main) #144

Description

Goal

How it works

Guardrails against noise

Plan

Alternatives considered

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions