Spinning this out into its own issue (per the dev-meeting discussion) so the CI regression-check has a dedicated home. Builds on the benchmark scripts in #117 and the JSON output in #140 (cc @yuvraajnarula), and follows up the "Automated Performance Regression" part of #137.
Everything below runs entirely on GitHub-hosted runners - no external service or account required.
Goal
On a PR, run the scaling benchmark on both the PR commit and current main, and post a comment showing whether the PR makes graph construction slower, faster, or the same - so regressions are visible before merge.
Key design decision: compare on the same runner
GitHub Actions runners are noisy - identical code can vary 20–50% between runs on different VMs. So comparing a PR against a baseline captured earlier on a different runner would produce frequent false alarms.
Instead I propose running main and the PR back-to-back in the same job, on the same runner, and comparing relative % change rather than absolute seconds. Same hardware for both sides → the noise largely cancels. This also matches the workflow described in the meeting.
Correctness detail: hold the harness constant
When we git checkout main, that reverts both the library and the benchmark script. To compare fairly we keep the benchmark script fixed (the PR's version) and only swap the library under test - i.e. run the same measurement against two installs of the code. Otherwise we'd be measuring with two different rulers.
How it works
on: pull_request (paths: src/.../create/**) + manual dispatch
1. checkout PR head → install lib → run scaling benchmark → pr.json
2. checkout main → install lib → run same benchmark → main.json
(same runner, same job, same benchmark script)
3. compare.py main.json pr.json → markdown delta table
4. post / update a sticky PR comment with the result
Example comment:
| grid points |
main |
PR |
Δ |
| 1,024 |
0.42s |
0.43s |
+2% ✅ |
| 4,096 |
1.81s |
2.30s |
+27% ⚠️ |
The only new code is a small compare.py (parse the two JSONs from #140, compute deltas, render the table, flag on threshold) plus the workflow YAML.
Guardrails against noise
- Compare % change, never absolute time.
- Tolerance threshold — only flag a regression above e.g. +25% and consistent across grid sizes.
- Cap grid sizes to fit the CI time budget (small/medium grids, not 10⁷ nodes).
- Informational comment first, non-blocking - so a noisy perf check doesn't train people to ignore CI. Can be promoted to a hard gate later once the threshold is proven.
Plan
Alternatives considered
-
Baseline stored from a separate run (github-action-benchmark for the PR check): rejected for the per-PR check because the baseline runs on a different runner → cross-runner noise. Reused in Phase 3 only for trend tracking, where that's fine.
-
Instrumentation-based services (e.g. CodSpeed): deterministic measurement, but adds an external vendor/account dependency - avoiding unless the team specifically wants it.
Does this direction look right? If so I'll start on Phase 1 once #117/#140 are in.
Spinning this out into its own issue (per the dev-meeting discussion) so the CI regression-check has a dedicated home. Builds on the benchmark scripts in #117 and the JSON output in #140 (cc @yuvraajnarula), and follows up the "Automated Performance Regression" part of #137.
Everything below runs entirely on GitHub-hosted runners - no external service or account required.
Goal
On a PR, run the scaling benchmark on both the PR commit and current
main, and post a comment showing whether the PR makes graph construction slower, faster, or the same - so regressions are visible before merge.Key design decision: compare on the same runner
GitHub Actions runners are noisy - identical code can vary 20–50% between runs on different VMs. So comparing a PR against a baseline captured earlier on a different runner would produce frequent false alarms.
Instead I propose running
mainand the PR back-to-back in the same job, on the same runner, and comparing relative % change rather than absolute seconds. Same hardware for both sides → the noise largely cancels. This also matches the workflow described in the meeting.Correctness detail: hold the harness constant
When we
git checkout main, that reverts both the library and the benchmark script. To compare fairly we keep the benchmark script fixed (the PR's version) and only swap the library under test - i.e. run the same measurement against two installs of the code. Otherwise we'd be measuring with two different rulers.How it works
Example comment:
The only new code is a small
compare.py(parse the two JSONs from #140, compute deltas, render the table, flag on threshold) plus the workflow YAML.Guardrails against noise
Plan
compare.pydepends on its JSON schema.)compare.py→ sticky PR comment, informational only.--track-memoryfrom feat: add memory profiling and JSON output to scaling benchmark #140).github-action-benchmark, which stores history on agh-pagesbranch in this repo — still no external service.Alternatives considered
Baseline stored from a separate run (
github-action-benchmarkfor the PR check): rejected for the per-PR check because the baseline runs on a different runner → cross-runner noise. Reused in Phase 3 only for trend tracking, where that's fine.Instrumentation-based services (e.g. CodSpeed): deterministic measurement, but adds an external vendor/account dependency - avoiding unless the team specifically wants it.
Does this direction look right? If so I'll start on Phase 1 once #117/#140 are in.