The high-speed sibling of mosaic-temporal. NVDEC/NVENC + torch-lap Hungarian + on-GPU torch kernels (Triton port queued for v0.2).
⚠️ Status: 0.1.0 release candidate. Public API (run_pipeline), kernels, solver, NVDEC/NVENC bridge, config schema, and CPU-host tests are in place. The remaining work toward 0.1.0 final is the parity-gate CI on a CUDA runner and the bench-spike sign-off on Kaggle T4 — see Roadmap. The Quickstart below is the supported API; the 3-stream CUDA-overlap optimization that motivated this repo lands in 0.2 without changing the signature.
This is the high-speed build of the video mosaic pipeline. The portable
sibling mosaic-temporal keeps a CPU fallback at every step for users without
a GPU; this repo drops every fallback so the hot path can be NVDEC →
Triton → torch-lap → NVENC end-to-end. The cost is hard: NVIDIA GPU with
CUDA ≥ 12.0 is required. The benefit is real throughput on long clips.
| Feature | mosaic-temporal | mosaic-temporal-gpu (high-speed) |
|---|---|---|
| Hungarian assignment | scipy CPU (default) | torch-linear-assignment (only) |
| Cost matrix | numpy CPU loop | torch.cdist on CUDA (Triton in v0.2) |
| Oklab grid mean | numpy | torch view+reduce on CUDA (Triton v0.2) |
| Video I/O | cv2 PNG round-trip | PyAV NVDEC → ndarray → NVENC |
| RAFT optical flow | CPU torch (slow) | not in v0.1.0 — queued for v0.3 |
| Bit-exact CPU output | yes (bit-exact-cpu) |
no — parity gated at SSIM ≥ 0.98 |
| Runtime requirement | none | NVIDIA GPU with CUDA ≥ 12.0 |
If you need the CPU fallback, the bit-exact reference, or Windows/macOS support, use mosaic-temporal. If you have a CUDA GPU and want speed, you're in the right place.
mosaic-temporal-gpu requires a CUDA build of PyTorch. Install torch first
from the official CUDA wheel index, then install this package:
# 1. CUDA 12.1 wheels (adjust cu121 to your CUDA version)
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
# 2. Pure compute kernels only (no video I/O — no PyAV)
pip install mosaic-temporal-gpu
# 2'. With NVDEC/NVENC video I/O (needs a cuvid-enabled FFmpeg + PyAV).
# The PyPI `av` wheel is software-only — see benchmarks/README.md for
# the FFmpeg+PyAV self-build recipe. The `[nvdec]` extra declares the
# `av>=12` dependency; it does NOT build FFmpeg for you.
pip install "mosaic-temporal-gpu[nvdec]"If you skip step 1, pip will resolve torch to the CPU build from PyPI
and every CUDA-only call will fail at runtime — there is no CPU fallback on
purpose. NVIDIA driver ≥ R535 and CUDA ≥ 12.0 are prerequisites. Until 0.1.0
ships to PyPI, install from source:
git clone https://github.com/hinanohart/mosaic-temporal-gpu
cd mosaic-temporal-gpu
pip install --index-url https://download.pytorch.org/whl/cu121 torch torchvision
pip install -e ".[dev]"from pathlib import Path
from mosaic_temporal_gpu import run_pipeline
stats = run_pipeline(
input_video=Path("input.mp4"),
output_video=Path("output.mp4"),
tile_dir=Path("tiles/"), # keyword-only
fps=30, # NVENC output frame rate (input fps
# auto-detection lands in 0.2)
cq=19, # h264_nvenc constant-quality (lower = better)
)
print(stats)
# {"frames": 720, "width": 1920, "height": 1080,
# "fps": 30, "active_codec": "h264_cuvid"}Pass a D1Config to override the default vivid_b preset:
from mosaic_temporal_gpu import D1Config, run_pipeline
run_pipeline(..., config=D1Config.from_preset("vivid_b"))For 0.1.0 we ship the vivid_b preset only (saturation_boost=2.10,
mkl_hybrid, neighbor_swap_rounds=5). Additional presets and a CLI
front-end are deferred to 0.2 to keep the launch surface narrow.
The active_codec field in the return value is how you confirm NVDEC
engaged on the decode side ("h264_cuvid" / "hevc_cuvid"); if it
silently falls back to software, the reader raises before any frame is
processed — see the R8 assertion in io/nvdec.py.
import torch
from mosaic_temporal_gpu import D1Config
from mosaic_temporal_gpu.kernels.cost_matrix import compute_cost_matrix_gpu
from mosaic_temporal_gpu.solvers.torch_lap import TorchLapSolver
cfg = D1Config.from_preset("vivid_b") # ✅ schema + preset
cost = compute_cost_matrix_gpu(cells, tiles) # ✅ GPU cost matrix (CUDA req'd)
assignment = TorchLapSolver().solve(cost) # ✅ GPU HungarianNvdecReader / NvencWriter are likewise importable and tested on CPU host
for their error paths; full round-trip needs CUDA.
The release contract is: for each frame of a fixed 24-frame synthetic clip,
SSIM(mosaic_temporal_gpu candidate, mosaicraft CPU reference) ≥ 0.98.
The test exists (tests/test_parity_vs_mosaicraft.py, @pytest.mark.parity),
but GitHub's free runners have no CUDA, so the parity job is not in CI
today — it runs locally on a CUDA host with pytest -m parity. A scheduled
GPU runner (Modal / RunPod) is queued for 0.1.0 final. Output is not bit-exact
(GPU reductions are non-associative); the SSIM gate is the operative contract.
src/mosaic_temporal_gpu/
__init__.py # version, public API (D1Config + exceptions today)
_version.py # single source of truth
config.py # D1Config schema (mirror of mosaic-temporal's GPU-valid subset)
kernels/
cost_matrix.py # GPU cost matrix (torch.cdist on CUDA; Triton port = v0.2)
oklab_grid.py # GPU Oklab grid mean (torch view+reduce; Triton port = v0.2)
solvers/
torch_lap.py # torch-linear-assignment wrapper
io/
nvdec.py # PyAV NVDEC reader
nvenc.py # PyAV NVENC writer
pipeline.py # end-to-end run_pipeline (single CUDA stream;
# 3-stream overlap is v0.2)
tests/
test_parity_vs_mosaicraft.py # SSIM ≥ 0.98 gate (xfail until CUDA CI)
test_pipeline_smoke.py # run_pipeline public-API contract
test_kernel_shapes.py
test_solver_torch_lap.py
test_io_bridges.py
test_config_schema.py
test_version_smoke.py
- 0.1.0 —
run_pipeline()shipped (single-stream NVDEC → mosaic → NVENC); parity gate green on a CUDA runner (Modal / RunPod queued); bench-spike sign-off on Kaggle T4. - 0.2 — 3-stream CUDA overlap (
decode | compute | encode); DLPack zero-copy on both ends of the video bridge; Triton kernels for cost matrix and Oklab grid (replacetorch.cdist/torch.view+meanonce we benchmark a real win); CLI front-end; additional presets. - 0.3 — RAFT optical flow on GPU for temporal coherence;
flow_warpmodule. - 1.0 — Stable parity gate across two driver/CUDA upgrades; one breaking-change cycle behind us.
- mosaicraft (image mosaic, pure numpy/cv2/scipy) — used here as the CPU reference for the parity gate and for the Oklab / MKL OT / Laplacian primitives.
- mosaic-temporal (video
mosaic, CPU/GPU dual path) — the portable sibling. Same
D1Configsurface, so config files port between the two.
Releases from v_next_ (released after 2026-05-16) include a sigstore keyless signature bundle
(.sigstore per artifact) attached to the GitHub Release.
pip download <pkg-name>==<version> --no-deps -d ./verify
python -m sigstore verify github \
--cert-identity 'https://github.com/hinanohart/mosaic-temporal-gpu/.github/workflows/release.yml@refs/tags/v<version>' \
--cert-oidc-issuer 'https://token.actions.githubusercontent.com' \
./verify/*.whl ./verify/*.tar.gzThe corresponding .sigstore bundles can be downloaded from the GitHub Release page.
Earlier releases were published without sigstore bundles. Re-installing those versions provides no cryptographic provenance — pin to a current release if assurance matters.
MIT. See LICENSE.