Skip to content

cross-vendor parity tracking #59

@samtalki

Description

@samtalki

meta-issue. whenever an apple-specific optimization lands in
MargueriteMetalExt, open a tracking issue for the CUDA and AMDGPU
equivalents so apple doesn't pull ahead on functionality (only perf).

watch-list:

  • gpu_storage keyword (when added) — no-op pass-through on CUDA / AMDGPU
  • @autoreleasepool boundary — Metal-specific; not needed on CUDA / AMDGPU,
    but document why and how their equivalents work (or don't)
  • benchmark rows in docs/src/gpu_backends.md are currently "—" for CUDA
    and AMDGPU — fill in when measured
  • device tests under MARGUERITE_TEST_GROUP=gpu — currently auto-detects
    Metal first; verify CUDA and AMDGPU code paths exercise the same coverage
  • c.gap device-resident on non-Metal backends — current BatchCache.gap::Vector{T} is host-allocated and the GPU LMO kernels write to it directly. Works on Metal (unified memory). On CUDA the host-pointer write from a device kernel is technically UB. Make gap (and objective, obj_trial, step_gamma if any kernel writes them) device-resident on non-CPU backends, with a host shadow for the convergence check.
  • GPU↔CPU numerical agreement asserts in test_gpu_backend.jl — current smoke tests check feasibility (sum≈1, in-bounds). Add asserts that Array(X_gpu) ≈ X_cpu_reference to atol scaled by eltype (1e-5 for F64, 1e-3 for F32). Catches buggy reductions that land feasible-but-wrong.
  • rrule-on-real-GPU testset — one rrule(batch_solve, expr, ScalarBox, X0_gpu, θ) testset under MARGUERITE_TEST_GROUP=gpu with FD check vs CPU. The adapt(Array, ...) host-pull boundary in batch_diff.jl is end-to-end untested today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions