Skip to content

[Feat] Automated benchmark regression check in CI (PR vs main) #144

Description

@prajwal-tech07

Spinning this out into its own issue (per the dev-meeting discussion) so the CI regression-check has a dedicated home. Builds on the benchmark scripts in #117 and the JSON output in #140 (cc @yuvraajnarula), and follows up the "Automated Performance Regression" part of #137.

Everything below runs entirely on GitHub-hosted runners - no external service or account required.

Goal

On a PR, run the scaling benchmark on both the PR commit and current main, and post a comment showing whether the PR makes graph construction slower, faster, or the same - so regressions are visible before merge.

Key design decision: compare on the same runner

GitHub Actions runners are noisy - identical code can vary 20–50% between runs on different VMs. So comparing a PR against a baseline captured earlier on a different runner would produce frequent false alarms.

Instead I propose running main and the PR back-to-back in the same job, on the same runner, and comparing relative % change rather than absolute seconds. Same hardware for both sides → the noise largely cancels. This also matches the workflow described in the meeting.

Correctness detail: hold the harness constant

When we git checkout main, that reverts both the library and the benchmark script. To compare fairly we keep the benchmark script fixed (the PR's version) and only swap the library under test - i.e. run the same measurement against two installs of the code. Otherwise we'd be measuring with two different rulers.

How it works

on: pull_request (paths: src/.../create/**) + manual dispatch

1. checkout PR head → install lib → run scaling benchmark → pr.json
2. checkout main    → install lib → run same benchmark    → main.json
   (same runner, same job, same benchmark script)
3. compare.py main.json pr.json → markdown delta table
4. post / update a sticky PR comment with the result

Example comment:

grid points main PR Δ
1,024 0.42s 0.43s +2% ✅
4,096 1.81s 2.30s +27% ⚠️

The only new code is a small compare.py (parse the two JSONs from #140, compute deltas, render the table, flag on threshold) plus the workflow YAML.

Guardrails against noise

  • Compare % change, never absolute time.
  • Tolerance threshold — only flag a regression above e.g. +25% and consistent across grid sizes.
  • Cap grid sizes to fit the CI time budget (small/medium grids, not 10⁷ nodes).
  • Informational comment first, non-blocking - so a noisy perf check doesn't train people to ignore CI. Can be promoted to a hard gate later once the threshold is proven.

Plan

Alternatives considered

  • Baseline stored from a separate run (github-action-benchmark for the PR check): rejected for the per-PR check because the baseline runs on a different runner → cross-runner noise. Reused in Phase 3 only for trend tracking, where that's fine.

  • Instrumentation-based services (e.g. CodSpeed): deterministic measurement, but adds an external vendor/account dependency - avoiding unless the team specifically wants it.

Does this direction look right? If so I'll start on Phase 1 once #117/#140 are in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions