Auto-build missing OME-NGFF multiscale levels (foundation) by davidackerman · Pull Request #43 · janelia-cellmap/mesh-n-bone

davidackerman · 2026-06-12T18:53:19Z

Summary

When the input zarr exposes only s0 (no pre-built coarser levels), the downsample multires strategy now auto-builds the missing OME-NGFF multiscale levels into {output_directory}/_intermediate_scales.zarr in a single parallel super-chunk pass over s0, then reads each LOD from the pre-built pyramid. Replaces the slow per-LOD in-worker fallback that re-read the full s0 once per LOD.

This is the direct fix for the c-elegans cell run that was getting stuck at LOD 1 because hemibrain's s1 etc. didn't exist on the input.

Highlights

Per-axis factors for anisotropic voxels (per_lod_factors_for_anisotropy): downsamples finest axes first until ~isotropic, then 2× uniformly. Standard FIB-SEM practice.
ROI alignment with two modes:
- pyramid_alignment_mode: snap (default) — round unaligned ROIs INWARD; drops up to max_factor - 1 voxels per edge.
- pyramid_alignment_mode: halo — round OUTWARD, read beyond the ROI into the surrounding dataset to complete boundary cubes. No data loss.
OME-NGFF v0.4 metadata matching cellmap-analyze conventions: tr_k = tr_0 + 0.5 × vs_0 × (F_k − 1) per-axis.
Single-pass super-chunk worker: reads each s0 super-chunk ONCE, emits one output chunk at every LOD.
s0 symlinking for local sources so the pyramid group has a complete OME multiscale layout in one place; remote sources fall back to discovery-side merging.
keep_intermediate_scales config knob to retain the pyramid across runs.

Tests (20 new, 352 total in repo)

Module-level (tests/test_pyramid_builder.py, 18 tests):

Per-LOD factors: isotropic, 2.5× z anisotropy, 5× z extreme.
ROI alignment: snap drops fringe; halo rounds outward; anisotropic factor; aligned identity.
OME metadata: isotropic + anisotropic translation formulas.
Fence-post correctness: per-block super-chunk downsample = global single-pass downsample voxel-by-voxel.
E2E build: OME .zattrs + s_k arrays; existing factors skipped; local s0 symlinked.
Halo E2E: confirmed via reader log that halo mode reads from voxel 0 even when ROI starts at voxel 3; output matches halo-extended global downsample bit-exact.
Dataset-bound clip: halo extending past dataset bounds clips cleanly without exceptions.
Snap mode: unaligned ROI shrinks to factor multiples; downsample matches the snapped region exactly.

Integration (tests/test_integration_meshify.py, 2 tests):

test_auto_builds_pyramid_for_single_scale_input — single-scale input triggers auto-build; spy on _generate_meshes_at_scale calls confirms LODs 1+ read from the pyramid.
test_keep_intermediate_scales_false_cleans_up — default cleanup removes the pyramid after the pipeline.

New module ``mesh_n_bone/util/pyramid_builder.py`` that pre-computes missing s_k arrays for the downsample multires strategy when the input zarr exposes only s0. Replaces the slow per-LOD in-worker fallback that reads the entire s0 once per LOD. Key design: - Per-axis factors (per_lod_factors_for_anisotropy) — anisotropic voxels downsample 2x along the finest axes first until ~isotropic, then 2x uniformly. Standard FIB-SEM practice. - ROI alignment with two modes (align_roi_voxels): snap — round inward to max-factor multiples; drops up to max_factor-1 voxels per ROI edge. halo — round outward; reads beyond the ROI into surrounding dataset to complete cubes at unaligned edges. - OME-NGFF v0.4 metadata matching cellmap-analyze conventions: tr_k = tr_0 + 0.5 * vs_0 * (F_k - 1), per axis. - Single-pass super-chunk worker (downsample_super_chunk): reads each s0 super-chunk ONCE and emits one output chunk at every LOD. - Local s0 is symlinked into the pyramid group for a complete OME multiscale layout; remote sources fall back to discovery-side merging. Tests (16, all passing): - per_lod_factors_for_anisotropy: isotropic, 2.5x z anisotropy, 5x z extreme. - align_roi_voxels: aligned ROI identity, snap drops edges, halo rounds outward, anisotropic factor. - build_multiscales_metadata: isotropic + anisotropic translation formulas; axes block. - Fence-post correctness: per-block super-chunk downsample matches a global single-pass downsample voxel-by-voxel, isotropic + anisotropic. - End-to-end build: writes OME .zattrs + s_k arrays; existing factors are skipped; local s0 is symlinked. No wiring into Meshify yet — that's a follow-up commit on this branch.

When the input zarr exposes only s0, _generate_multires_downsample now calls _build_missing_pyramid_scales() upfront to pre-compute the missing s_k arrays into ``{output_directory}/_intermediate_scales.zarr`` via a single parallel super-chunk pass. Each LOD then reads from the pre-built pyramid instead of re-reading s0 once per LOD. New Meshify kwargs: - keep_intermediate_scales: bool = False — clean up the pyramid zarr after the pipeline finishes (default) or keep for reuse on subsequent runs over the same data. - pyramid_alignment_mode: str = "snap" — "snap" rounds unaligned ROIs INWARD (drops up to max_factor-1 voxels per edge); "halo" rounds OUTWARD and reads beyond the ROI to complete boundary cubes (no data loss). Implementation details: - _build_missing_pyramid_scales() opens a tensorstore handle for s0, builds an (origin_voxels, shape_voxels) → ndarray closure that uses to_ndarray_tensorstore, dispatches on downsample_method, computes per-axis factors via per_lod_factors_for_anisotropy, and calls build_missing_pyramid_levels. - ValueError from invalid downsample_method propagates so test expectations stay valid; transient I/O errors fall back to the per-LOD in-worker downsample path. - downsample_method="nearest" skips pyramid build (worker stride is free; cache adds no value). - Cleanup at end of get_multiscale_meshes removes the pyramid unless keep_intermediate_scales is set. Tests (4 new, 352 total): - test_auto_builds_pyramid_for_single_scale_input — exercises the end-to-end auto-build path; spy on _generate_meshes_at_scale calls confirms LODs 1+ read from `_intermediate_scales.zarr/s_k`. - test_keep_intermediate_scales_false_cleans_up — default cleanup removes the pyramid after the pipeline. - test_halo_mode_reads_beyond_roi — halo mode logs reads starting at voxel 0 even when ROI origin is voxel 3; output matches the halo-extended global downsample bit-exact. - test_halo_clipped_at_dataset_bounds — halo extending past dataset bounds clips cleanly without exceptions; output shape is correct rounded-outward size. README documents the new knobs.

…to halo Two fixes from review: 1. pyramid_alignment_mode default: snap -> halo. I shipped snap by mistake; the original design decision was halo (no data loss at unaligned ROI edges, even though reads extend beyond the ROI into surrounding dataset). 2. Symlink-safe cleanup. shutil.rmtree on its own does the right thing on modern Python (unlinks symlinks rather than following them into the target), VERIFIED by an in-test experiment, BUT given the pyramid's s0 is a symlink to the USER'S INPUT DATA, defensive coding matters here — worst case if shutil.rmtree ever regressed would be destroying user data. New _safely_remove_pyramid() walks the tree top-down, explicitly unlinks every symlink (including the s0 link) before descending, then rmdir's the empty real directories from the bottom up. Refuses to act when the pyramid root itself is somehow a symlink. Tests (2 new, 354 total): - test_safe_remove_pyramid_preserves_symlink_target: pyramid has s0 symlinked to a "user data" dir with critical content; after cleanup pyramid is gone, user data fully intact. - test_safe_remove_refuses_to_delete_when_root_is_symlink: pyramid root path itself is a symlink to user data; cleanup refuses and the data survives.

Changes from review: 1. Rename keep_intermediate_scales -> delete_intermediate_scales (default True). Matches existing delete_decimated_meshes naming + default direction. 2. _safely_remove_pyramid now parallelizes top-level subdir removal via ThreadPoolExecutor sized by self.num_workers. shutil.rmtree releases the GIL during os.unlink/os.rmdir syscalls, so threads scale on shared filesystems where the metadata server is the bottleneck. Sequential rmtree on Lustre/NFS for a pyramid with ~100k chunk files is 5-30 min (~100-1000 unlinks/sec single-thread). Parallel cleanup at num_workers=10+ brings that to minutes. Symlink safety preserved: top-level symlinks (the s0 link to user input) are unlinked BEFORE any parallel rmtree spawns, so worker threads never receive a path that points into user data. Tests + README updated.

Replaces top-level-subdir-only parallelism with file-level parallelism following cellmap-analyze's pattern (dask_util.delete_tmp_dir_blockwise). Walks the tree once collecting all real files, partitions across num_workers threads via ThreadPoolExecutor, each thread os.unlinks its batch. Empty real dirs are removed bottom-up afterward. Symlinks are unlinked inline during the walk before any worker threads spawn — the s0 link to user input is never handed to a parallel worker. Adds test_safe_remove_handles_many_chunk_files exercising 500+ chunk- like files in a flat-zarr-v2 layout.

pyramid_builder._create_zarr_v2_array previously did 'import zarr', a test-env-only dep in pixi.toml. In production (default pixi env) the import failed and the auto-build silently fell back to in-worker per-LOD downsampling — defeating the whole point of the feature. Switched to tensorstore (a core dep) for zarr v2 writes. Same on-disk layout (.zarray + flat chunk files), readable by the zarr library when needed. Added a numpy-dtype -> tensorstore-dtype map for the common label dtypes. Verified: pyramid build now succeeds in the default pixi env with no zarr installed (smoke test reads back .zarray + chunk files via plain os.listdir). Audited src/ for other test-env-only imports; only stranded one was zarr — mmh3 only appears as cloudvolume's vendored copy, available wherever cloudvolume is.

Two fixes from the c-elegans cell run staring at one log line for minutes: 1. Pyramid build was sequential. build_missing_pyramid_levels iterated the super-chunk grid one at a time, reading ~134 MB of s0 per super-chunk from Lustre. For 27 super-chunks on the c-elegans cell volume, that's serialised Lustre I/O — minutes per chunk. Switched to a ThreadPoolExecutor sized by self.num_workers (threaded through from Meshify, which gets it from -n on the CLI). tensorstore reads release the GIL so threads scale on Lustre metadata + bandwidth bottlenecks. Adds periodic progress logs ('K/N super-chunks done') every 5% so you can watch it move. 2. Logs were block-buffered when stderr was redirected to a file (LSF case). util/logging.py now: - reconfigures sys.stdout/stderr to line_buffering=True - installs a custom _FlushingStreamHandler that calls self.flush() + self.stream.flush() after each emit - removes any prior default handlers first so our handler wins tail -f on the LSF output.log now updates in real time instead of waiting for the OS buffer to fill. Same num_workers value drives the pyramid super-chunk parallelism AND the pyramid cleanup parallelism, so -n N gives you N-way parallelism across both ends of the pyramid lifecycle.

Previously the super-chunk size was hardcoded — out_chunk capped at 256 voxels per axis × max_factor (typically 8) = 2048-voxel super-chunk edges. At uint64 labels that's a 2048^3 * 8 byte = 68 GB read per worker per task. Easy OOM. Now sizes super_chunk from per-worker memory budget the same way assembly does: per_worker_budget_MB = _estimate_block_target_mb_from_dask_config( 'dask-config.yaml') per_thread_budget_bytes = per_worker_budget_MB * 1e6 / num_workers / amplification_factor amplification = 1 + sum(1/f^3 for factors > 1) * 2 ≈ 1.3 (cumulative LOD output buffers + scratch). Then super-chunk side = floor(cuberoot(per_thread_budget / itemsize)), rounded down to a multiple of max_factor per axis. For a worker with 15 GB / 24 threads = 625 MB per-thread, uint64 labels, max_factor=8: super_chunk_side ≈ floor((625e6/8)^(1/3)) = ~427 voxels → 424 (mult of 8), out_chunk = 53 voxels. Logs the chosen sizes + per-task read size so it's visible up-front whether the values look sensible for the run.

Refactored the pyramid build to mirror the assembly pipeline's memory-aware retry pattern: - _size_super_chunk(thread_count) helper computes out_chunk_shape so each thread's super-chunk read + scratch fits in per-thread budget. - Build wrapped in retry loop: on MemoryError, halves attempt_workers and recomputes a bigger super-chunk size (more memory per thread). Up to memory_retry_max retries (default 3). - Partial pyramid from a failed attempt is torn down before retry so the next attempt starts clean. Same effect as assembly's 'halve processes/job, restart wave' pattern, but adapted for the driver-threaded pyramid build (one process, N threads instead of N processes per LSF job). For the c-elegans cell volume on a typical Lustre worker (15 GB job memory, -n 24), first attempt uses ~625 MB/thread; if that OOMs we'd retry with 12 threads and ~1.25 GB/thread, etc. For genuine production scale-out across the LSF cluster (as opposed to within a single driver process), still a follow-up — but driver- threaded with adaptive sizing already covers up to ~100s of GB of input volume on a reasonably-spec'd workstation/login-node.

downsample_super_chunk now accepts a write_chunk callback. When provided, each LOD's output buffer is written to disk and dropped immediately, instead of being collected into a dict alongside every other LOD's output. Peak in-memory per thread drops from 'super-chunk + sum(all LOD outputs)' (~1.14x super-chunk for isotropic 2/4/8 factors) to 'super-chunk + one LOD output' (~1.125x super-chunk). Kept the direct s0 → s_k downsample (mode-of-8x8x8 of s0 for LOD 3, etc.) rather than switching to cascade. Cascade is the more common connectomics convention but the user prefers the current behavior for now. Memory amplification factor updated to reflect the lower peak.

Refactored _build_missing_pyramid_scales from driver-side ThreadPoolExecutor to genuine cluster dispatch via dask bag + run_with_oom_retry, mirroring how get_chunked_meshes and assemble_meshes work. - New module-level worker process_super_chunk_for_dask(sc_origin, config) in pyramid_builder.py. Picklable, opens its own tensorstore handles (cached per worker process), reads s0 super-chunk, downsamples to each missing LOD, writes each chunk via tensorstore, drops the buffer. - New driver-side helper prepare_pyramid_metadata_and_arrays() writes the OME-NGFF multiscales metadata and creates the empty output zarr arrays before workers fan out. - Proactive processes-per-job sizing using the assembly pipeline's helpers (_recommended_assembly_processes, _assembly_config_for_processes). If a super-chunk doesn't fit in (job_memory / base_processes), the dask config is adjusted upfront to halve processes-per-job (= double memory per worker), then run_with_oom_retry can still halve further on actual OOM. - Tolerant of missing dask-config.yaml: tests / local runs without LSF just use LocalCluster. - Diagnostic log of the driver process's RLIMIT_NOFILE so future OS-FD issues are visible up-front; also raises soft → hard if possible. Removed the driver-side num_workers cap (was a 32-thread cap meant for the in-process ThreadPoolExecutor; cluster dispatch doesn't need it). All 355 tests pass.

…y sizing Two correctness fixes from the c-elegans cell run: 1. On-disk output chunk shape was sized from per-thread memory budget, producing 18^3 chunks → 33k files per s_k array → subsequent meshing read of s1 hit tensorstore timeouts on Lustre metadata. Now: out_chunk_shape inherits the input's chunk shape clamped to [32, 256] and rounded to a multiple of max_factor per axis. SAME chunk shape used for s1, s2, s3 (uniform across scales). For the c-elegans cell with src chunks 132x134x166, out_chunk lands at 128x128x160 — s1 has ~72 chunks instead of 33k. super_chunk_shape (the s0 read region per task) is independent: a multiple of (out_chunk * max_factor) sized to fit per-worker memory. 2. Per-worker memory was computed as job_memory / num_workers — wrong for cluster dispatch where each dask worker is its own process in an LSF job. Correct: per-worker = job_memory / processes_per_job (each LSF job hosts `processes` workers sharing the job's memory). For c-elegans cell with 180 GB / 12 processes, that's 15 GB per worker — vs the previous (incorrect) 2.5 GB / 576 = 4 MB. The correct accounting lets super-chunks be sized properly. Proactive process-count adjustment via _recommended_assembly_processes mirrors how assembly's _plan_assembly_waves chooses processes/job per wave: if the minimum possible super-chunk wouldn't fit at base_processes, halve processes/job upfront (not just on OOM retry). Diagnostic FD-limit log is now silent unless the raise actually fails; explicit log of effective worker count after task-count capping ('dispatching N super-chunks across M dask workers'). All 355 tests pass.

Previous logic grew super-chunks to fill per-worker memory, which on small volumes (c-elegans cell, 1.5 G voxels) collapsed the task count to 1 — single worker for all -n 576, no parallelism. Now super_chunk_shape is FIXED at out_chunk * max_factor (the minimum that respects alignment). For c-elegans cell that's [1024,1024,1280] = 1.34 GB per task. With processes=12 / 180 GB job memory = 15 GB per worker, peak per task is 1.85 GB ≪ 9 GB (60% memory_fraction). 8 super-chunks total → 8 effective workers. Volumes too small to use the full -n stay parallelism-limited; that's expected and the rest of -n just doesn't get scaled up (dask only requests what it needs). The memory tuning is now ONLY via processes/job: if the minimum super-chunk wouldn't fit in (job_memory / base_processes), halve processes/job to give each worker more memory. Same as assembly's _recommended_assembly_processes pattern, but for a single uniform task estimate. Removed the cuberoot-multiplier sizing that was growing super-chunks into one giant single-task on small volumes.

The rewrite_index_with_empty_fragments reachability pass was top-down only — descending from REAL top-LOD positions through REAL parents. When two LODs are independently meshed (downsample multires strategy), mode-downsampling can drop a thin edge at LOD k+1 that still exists at LOD k. The real LOD-k leaf would then be classified as an orphan (no real parent chain) and dropped from the manifest entirely, producing a chunk-shaped hole at the finest LOD in NG. Add a bottom-up ancestor walk from every REAL position. Reachability is now the union of both passes, so real leaves stay listed; empty intermediates are written as 0-byte placeholders and NG's existing fall-back-to-parent rule lets it traverse past them down to the real leaf. Concrete case: seg 5363 in jrc_P3_E5_D1_N2 mito had real LOD-0 fragments (3,3,4) and (3,4,4) whose LOD-1 parents (1,1,2) and (1,2,2) were empty after mode-downsampling. Before this fix those LOD-0 fragments were dropped from the manifest; after, they're listed and the rendered hole goes away.

keep_chunked_meshes (default False) preserves the per-block PLY tree (tmp_chunked_s{lod}/{mesh_id}/block_*.ply) plus the per-segment assembled meshes after a run, so debugging a specific mesh can compare raw MC output against the post-decimation per-LOD versions. Gates all the cleanup rmtrees in _assemble_mesh, assemble_meshes, and _generate_meshes_at_scale. _existing_pyramid_scales reads OME-NGFF metadata at the auto-build output path and returns the per-LOD downsample factor for each scale whose .zarray exists on disk. _build_missing_pyramid_scales now checks this first and reuses the existing pyramid when it covers all the needed factors, instead of unconditionally rebuilding from s0. Cuts 4-5 minutes off debug re-runs in the same output dir.

Workers were OOM-killed even at processes/job=1 (180 GB/worker) on bw-1 mito. The actual per-task peak grew with each of the 8 super-chunks a worker processes before exit. Four root causes, four fixes: 1. SANE_CHUNK_MAX 256 → 128. The src zarr chunk shape inherits into out_chunk_shape, which multiplies by max_factor=8 for super_chunk. bw-1's 224³ source chunks produced a 1792³ super-chunk (11.5 GB at uint16). Capping at 128 keeps super_chunk_shape ≤ 1024³ (~2 GB at uint16) regardless of input chunking, and triples parallelism for large volumes (27 tasks vs 8). 2. amplification 1.375x → 2.5x. The original estimate accounted only for s0 + s1/s2/s3 output buffers, not for tensorstore decode/encode caches and glibc fragmentation across sequential super-chunks. The underestimate kept the proactive process-per-job reduction from triggering until OOMs forced the post-hoc retry loop. 3. Disable tensorstore cache_pool (total_bytes_limit=0) in the capped context spec. Decoded-chunk cache is unbounded by default and lives for the worker's lifetime, since _PYRAMID_WORKER_TS_CACHE holds the input + 3 output handles across all tasks. We don't benefit from caching here (each super-chunk reads/writes a non-overlapping region). Plumb the capped context into _ts_handle_for_output too. 4. Explicit `del s0_block` + gc.collect() + libc malloc_trim() at the end of process_super_chunk_for_dask. Free-but-mapped pages don't return to OS without an explicit trim — they compound across the 8+ sequential tasks each worker handles. Together these address the cross-task accumulation that no single estimate-based retry can catch.

Tests instantiate Meshify via object.__new__ + manual attribute assignment, bypassing __init__. Direct self.keep_chunked_meshes access in the rmtree gates raised AttributeError there. Switch to getattr(self, "keep_chunked_meshes", False) so tests don't need to set the attribute explicitly.

Without this, every mesh's manifest stores lod_scales=[1,2,4,8] and lod_scale_multiplier=1, so NG reads the per-LOD spatial resolutions as literally 1-8 nm. At any reasonable viewing zoom (pixel_size > 8 nm in world units) every LOD satisfies "scale <= pixel_size", so NG picks the coarsest LOD per its selection rule. Result: meshes always render at LOD-N-1, even when zoomed in close — the "low res at moderate zoom" bug. By setting lod_scale_multiplier to the source voxel size (e.g. 16 for 16 nm c-elegans data), NG sees effective lod_scales of [voxel_size, 2*voxel_size, 4*voxel_size, ...]. NG now picks LOD-0 at pixel_size ~ voxel_size, transitioning to coarser LODs as zoom-out makes the chunk size shrink in screen space. Matches hemibrain's setup (which uses transform=[16,16,16] for the same effect). Plumbed through: - write_sharded_info_file and write_info_file gain a lod_scale_multiplier parameter (default 1.0 = backward-compatible). - meshify.py passes self.base_voxel_size_funlib[0] from the source segmentation's voxel size to both call sites. - multires.py CLI reads a new voxel_size_nm config knob (defaults to 1.0) and passes it through. Users running the standalone multires CLI should set this in their run-config.yaml's optional_decimation_settings. Users wanting "load less" at zoom-out (e.g. many simultaneous meshes) can bump NG's "Resolution (mesh)" slider above 1.0 px to widen the LOD transition window.

davidackerman marked this pull request as ready for review June 12, 2026 19:09

davidackerman added 5 commits June 12, 2026 15:42

davidackerman force-pushed the auto-build-multiscale-pyramid branch from b7ad4ac to 182489e Compare June 12, 2026 19:44

davidackerman added 13 commits June 12, 2026 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-build missing OME-NGFF multiscale levels (foundation)#43

Auto-build missing OME-NGFF multiscale levels (foundation)#43
davidackerman wants to merge 18 commits into
masterfrom
auto-build-multiscale-pyramid

davidackerman commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidackerman commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Highlights

Tests (20 new, 352 total in repo)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidackerman commented Jun 12, 2026 •

edited

Loading