You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
meta-issue. whenever an apple-specific optimization lands in MargueriteMetalExt, open a tracking issue for the CUDA and AMDGPU
equivalents so apple doesn't pull ahead on functionality (only perf).
watch-list:
gpu_storage keyword (when added) — no-op pass-through on CUDA / AMDGPU
@autoreleasepool boundary — Metal-specific; not needed on CUDA / AMDGPU,
but document why and how their equivalents work (or don't)
benchmark rows in docs/src/gpu_backends.md are currently "—" for CUDA
and AMDGPU — fill in when measured
device tests under MARGUERITE_TEST_GROUP=gpu — currently auto-detects
Metal first; verify CUDA and AMDGPU code paths exercise the same coverage
c.gap device-resident on non-Metal backends — current BatchCache.gap::Vector{T} is host-allocated and the GPU LMO kernels write to it directly. Works on Metal (unified memory). On CUDA the host-pointer write from a device kernel is technically UB. Make gap (and objective, obj_trial, step_gamma if any kernel writes them) device-resident on non-CPU backends, with a host shadow for the convergence check.
GPU↔CPU numerical agreement asserts in test_gpu_backend.jl — current smoke tests check feasibility (sum≈1, in-bounds). Add asserts that Array(X_gpu) ≈ X_cpu_reference to atol scaled by eltype (1e-5 for F64, 1e-3 for F32). Catches buggy reductions that land feasible-but-wrong.
rrule-on-real-GPU testset — one rrule(batch_solve, expr, ScalarBox, X0_gpu, θ) testset under MARGUERITE_TEST_GROUP=gpu with FD check vs CPU. The adapt(Array, ...) host-pull boundary in batch_diff.jl is end-to-end untested today.
meta-issue. whenever an apple-specific optimization lands in
MargueriteMetalExt, open a tracking issue for the CUDA and AMDGPUequivalents so apple doesn't pull ahead on functionality (only perf).
watch-list:
gpu_storagekeyword (when added) — no-op pass-through on CUDA / AMDGPU@autoreleasepoolboundary — Metal-specific; not needed on CUDA / AMDGPU,but document why and how their equivalents work (or don't)
docs/src/gpu_backends.mdare currently "—" for CUDAand AMDGPU — fill in when measured
MARGUERITE_TEST_GROUP=gpu— currently auto-detectsMetal first; verify CUDA and AMDGPU code paths exercise the same coverage
c.gapdevice-resident on non-Metal backends — currentBatchCache.gap::Vector{T}is host-allocated and the GPU LMO kernels write to it directly. Works on Metal (unified memory). On CUDA the host-pointer write from a device kernel is technically UB. Makegap(andobjective,obj_trial,step_gammaif any kernel writes them) device-resident on non-CPU backends, with a host shadow for the convergence check.test_gpu_backend.jl— current smoke tests check feasibility (sum≈1, in-bounds). Add asserts thatArray(X_gpu) ≈ X_cpu_referenceto atol scaled by eltype (1e-5for F64,1e-3for F32). Catches buggy reductions that land feasible-but-wrong.rrule(batch_solve, expr, ScalarBox, X0_gpu, θ)testset underMARGUERITE_TEST_GROUP=gpuwith FD check vs CPU. Theadapt(Array, ...)host-pull boundary inbatch_diff.jlis end-to-end untested today.