Auto-build missing OME-NGFF multiscale levels (foundation)#43
Open
davidackerman wants to merge 18 commits into
Open
Auto-build missing OME-NGFF multiscale levels (foundation)#43davidackerman wants to merge 18 commits into
davidackerman wants to merge 18 commits into
Conversation
New module ``mesh_n_bone/util/pyramid_builder.py`` that pre-computes
missing s_k arrays for the downsample multires strategy when the
input zarr exposes only s0. Replaces the slow per-LOD in-worker
fallback that reads the entire s0 once per LOD.
Key design:
- Per-axis factors (per_lod_factors_for_anisotropy) — anisotropic voxels
downsample 2x along the finest axes first until ~isotropic, then 2x
uniformly. Standard FIB-SEM practice.
- ROI alignment with two modes (align_roi_voxels):
snap — round inward to max-factor multiples; drops up to
max_factor-1 voxels per ROI edge.
halo — round outward; reads beyond the ROI into surrounding
dataset to complete cubes at unaligned edges.
- OME-NGFF v0.4 metadata matching cellmap-analyze conventions:
tr_k = tr_0 + 0.5 * vs_0 * (F_k - 1), per axis.
- Single-pass super-chunk worker (downsample_super_chunk): reads each
s0 super-chunk ONCE and emits one output chunk at every LOD.
- Local s0 is symlinked into the pyramid group for a complete OME
multiscale layout; remote sources fall back to discovery-side merging.
Tests (16, all passing):
- per_lod_factors_for_anisotropy: isotropic, 2.5x z anisotropy, 5x z
extreme.
- align_roi_voxels: aligned ROI identity, snap drops edges, halo rounds
outward, anisotropic factor.
- build_multiscales_metadata: isotropic + anisotropic translation
formulas; axes block.
- Fence-post correctness: per-block super-chunk downsample matches a
global single-pass downsample voxel-by-voxel, isotropic + anisotropic.
- End-to-end build: writes OME .zattrs + s_k arrays; existing factors
are skipped; local s0 is symlinked.
No wiring into Meshify yet — that's a follow-up commit on this branch.
When the input zarr exposes only s0, _generate_multires_downsample now
calls _build_missing_pyramid_scales() upfront to pre-compute the
missing s_k arrays into ``{output_directory}/_intermediate_scales.zarr``
via a single parallel super-chunk pass. Each LOD then reads from the
pre-built pyramid instead of re-reading s0 once per LOD.
New Meshify kwargs:
- keep_intermediate_scales: bool = False — clean up the pyramid zarr
after the pipeline finishes (default) or keep for reuse on
subsequent runs over the same data.
- pyramid_alignment_mode: str = "snap" — "snap" rounds unaligned ROIs
INWARD (drops up to max_factor-1 voxels per edge); "halo" rounds
OUTWARD and reads beyond the ROI to complete boundary cubes (no
data loss).
Implementation details:
- _build_missing_pyramid_scales() opens a tensorstore handle for s0,
builds an (origin_voxels, shape_voxels) → ndarray closure that uses
to_ndarray_tensorstore, dispatches on downsample_method, computes
per-axis factors via per_lod_factors_for_anisotropy, and calls
build_missing_pyramid_levels.
- ValueError from invalid downsample_method propagates so test
expectations stay valid; transient I/O errors fall back to the
per-LOD in-worker downsample path.
- downsample_method="nearest" skips pyramid build (worker stride is
free; cache adds no value).
- Cleanup at end of get_multiscale_meshes removes the pyramid unless
keep_intermediate_scales is set.
Tests (4 new, 352 total):
- test_auto_builds_pyramid_for_single_scale_input — exercises the
end-to-end auto-build path; spy on _generate_meshes_at_scale calls
confirms LODs 1+ read from `_intermediate_scales.zarr/s_k`.
- test_keep_intermediate_scales_false_cleans_up — default cleanup
removes the pyramid after the pipeline.
- test_halo_mode_reads_beyond_roi — halo mode logs reads starting at
voxel 0 even when ROI origin is voxel 3; output matches the
halo-extended global downsample bit-exact.
- test_halo_clipped_at_dataset_bounds — halo extending past dataset
bounds clips cleanly without exceptions; output shape is correct
rounded-outward size.
README documents the new knobs.
…to halo Two fixes from review: 1. pyramid_alignment_mode default: snap -> halo. I shipped snap by mistake; the original design decision was halo (no data loss at unaligned ROI edges, even though reads extend beyond the ROI into surrounding dataset). 2. Symlink-safe cleanup. shutil.rmtree on its own does the right thing on modern Python (unlinks symlinks rather than following them into the target), VERIFIED by an in-test experiment, BUT given the pyramid's s0 is a symlink to the USER'S INPUT DATA, defensive coding matters here — worst case if shutil.rmtree ever regressed would be destroying user data. New _safely_remove_pyramid() walks the tree top-down, explicitly unlinks every symlink (including the s0 link) before descending, then rmdir's the empty real directories from the bottom up. Refuses to act when the pyramid root itself is somehow a symlink. Tests (2 new, 354 total): - test_safe_remove_pyramid_preserves_symlink_target: pyramid has s0 symlinked to a "user data" dir with critical content; after cleanup pyramid is gone, user data fully intact. - test_safe_remove_refuses_to_delete_when_root_is_symlink: pyramid root path itself is a symlink to user data; cleanup refuses and the data survives.
Changes from review: 1. Rename keep_intermediate_scales -> delete_intermediate_scales (default True). Matches existing delete_decimated_meshes naming + default direction. 2. _safely_remove_pyramid now parallelizes top-level subdir removal via ThreadPoolExecutor sized by self.num_workers. shutil.rmtree releases the GIL during os.unlink/os.rmdir syscalls, so threads scale on shared filesystems where the metadata server is the bottleneck. Sequential rmtree on Lustre/NFS for a pyramid with ~100k chunk files is 5-30 min (~100-1000 unlinks/sec single-thread). Parallel cleanup at num_workers=10+ brings that to minutes. Symlink safety preserved: top-level symlinks (the s0 link to user input) are unlinked BEFORE any parallel rmtree spawns, so worker threads never receive a path that points into user data. Tests + README updated.
Replaces top-level-subdir-only parallelism with file-level parallelism following cellmap-analyze's pattern (dask_util.delete_tmp_dir_blockwise). Walks the tree once collecting all real files, partitions across num_workers threads via ThreadPoolExecutor, each thread os.unlinks its batch. Empty real dirs are removed bottom-up afterward. Symlinks are unlinked inline during the walk before any worker threads spawn — the s0 link to user input is never handed to a parallel worker. Adds test_safe_remove_handles_many_chunk_files exercising 500+ chunk- like files in a flat-zarr-v2 layout.
b7ad4ac to
182489e
Compare
pyramid_builder._create_zarr_v2_array previously did 'import zarr', a test-env-only dep in pixi.toml. In production (default pixi env) the import failed and the auto-build silently fell back to in-worker per-LOD downsampling — defeating the whole point of the feature. Switched to tensorstore (a core dep) for zarr v2 writes. Same on-disk layout (.zarray + flat chunk files), readable by the zarr library when needed. Added a numpy-dtype -> tensorstore-dtype map for the common label dtypes. Verified: pyramid build now succeeds in the default pixi env with no zarr installed (smoke test reads back .zarray + chunk files via plain os.listdir). Audited src/ for other test-env-only imports; only stranded one was zarr — mmh3 only appears as cloudvolume's vendored copy, available wherever cloudvolume is.
Two fixes from the c-elegans cell run staring at one log line for
minutes:
1. Pyramid build was sequential. build_missing_pyramid_levels iterated
the super-chunk grid one at a time, reading ~134 MB of s0 per
super-chunk from Lustre. For 27 super-chunks on the c-elegans cell
volume, that's serialised Lustre I/O — minutes per chunk. Switched
to a ThreadPoolExecutor sized by self.num_workers (threaded
through from Meshify, which gets it from -n on the CLI).
tensorstore reads release the GIL so threads scale on Lustre
metadata + bandwidth bottlenecks. Adds periodic progress logs
('K/N super-chunks done') every 5% so you can watch it move.
2. Logs were block-buffered when stderr was redirected to a file
(LSF case). util/logging.py now:
- reconfigures sys.stdout/stderr to line_buffering=True
- installs a custom _FlushingStreamHandler that calls self.flush()
+ self.stream.flush() after each emit
- removes any prior default handlers first so our handler wins
tail -f on the LSF output.log now updates in real time instead of
waiting for the OS buffer to fill.
Same num_workers value drives the pyramid super-chunk parallelism AND
the pyramid cleanup parallelism, so -n N gives you N-way parallelism
across both ends of the pyramid lifecycle.
Previously the super-chunk size was hardcoded — out_chunk capped at
256 voxels per axis × max_factor (typically 8) = 2048-voxel super-chunk
edges. At uint64 labels that's a 2048^3 * 8 byte = 68 GB read per
worker per task. Easy OOM.
Now sizes super_chunk from per-worker memory budget the same way
assembly does:
per_worker_budget_MB = _estimate_block_target_mb_from_dask_config(
'dask-config.yaml')
per_thread_budget_bytes = per_worker_budget_MB * 1e6 / num_workers
/ amplification_factor
amplification = 1 + sum(1/f^3 for factors > 1) * 2 ≈ 1.3 (cumulative
LOD output buffers + scratch). Then super-chunk side =
floor(cuberoot(per_thread_budget / itemsize)), rounded down to a
multiple of max_factor per axis.
For a worker with 15 GB / 24 threads = 625 MB per-thread, uint64
labels, max_factor=8:
super_chunk_side ≈ floor((625e6/8)^(1/3)) = ~427 voxels → 424 (mult
of 8), out_chunk = 53 voxels.
Logs the chosen sizes + per-task read size so it's visible up-front
whether the values look sensible for the run.
Refactored the pyramid build to mirror the assembly pipeline's memory-aware retry pattern: - _size_super_chunk(thread_count) helper computes out_chunk_shape so each thread's super-chunk read + scratch fits in per-thread budget. - Build wrapped in retry loop: on MemoryError, halves attempt_workers and recomputes a bigger super-chunk size (more memory per thread). Up to memory_retry_max retries (default 3). - Partial pyramid from a failed attempt is torn down before retry so the next attempt starts clean. Same effect as assembly's 'halve processes/job, restart wave' pattern, but adapted for the driver-threaded pyramid build (one process, N threads instead of N processes per LSF job). For the c-elegans cell volume on a typical Lustre worker (15 GB job memory, -n 24), first attempt uses ~625 MB/thread; if that OOMs we'd retry with 12 threads and ~1.25 GB/thread, etc. For genuine production scale-out across the LSF cluster (as opposed to within a single driver process), still a follow-up — but driver- threaded with adaptive sizing already covers up to ~100s of GB of input volume on a reasonably-spec'd workstation/login-node.
downsample_super_chunk now accepts a write_chunk callback. When provided, each LOD's output buffer is written to disk and dropped immediately, instead of being collected into a dict alongside every other LOD's output. Peak in-memory per thread drops from 'super-chunk + sum(all LOD outputs)' (~1.14x super-chunk for isotropic 2/4/8 factors) to 'super-chunk + one LOD output' (~1.125x super-chunk). Kept the direct s0 → s_k downsample (mode-of-8x8x8 of s0 for LOD 3, etc.) rather than switching to cascade. Cascade is the more common connectomics convention but the user prefers the current behavior for now. Memory amplification factor updated to reflect the lower peak.
Refactored _build_missing_pyramid_scales from driver-side ThreadPoolExecutor to genuine cluster dispatch via dask bag + run_with_oom_retry, mirroring how get_chunked_meshes and assemble_meshes work. - New module-level worker process_super_chunk_for_dask(sc_origin, config) in pyramid_builder.py. Picklable, opens its own tensorstore handles (cached per worker process), reads s0 super-chunk, downsamples to each missing LOD, writes each chunk via tensorstore, drops the buffer. - New driver-side helper prepare_pyramid_metadata_and_arrays() writes the OME-NGFF multiscales metadata and creates the empty output zarr arrays before workers fan out. - Proactive processes-per-job sizing using the assembly pipeline's helpers (_recommended_assembly_processes, _assembly_config_for_processes). If a super-chunk doesn't fit in (job_memory / base_processes), the dask config is adjusted upfront to halve processes-per-job (= double memory per worker), then run_with_oom_retry can still halve further on actual OOM. - Tolerant of missing dask-config.yaml: tests / local runs without LSF just use LocalCluster. - Diagnostic log of the driver process's RLIMIT_NOFILE so future OS-FD issues are visible up-front; also raises soft → hard if possible. Removed the driver-side num_workers cap (was a 32-thread cap meant for the in-process ThreadPoolExecutor; cluster dispatch doesn't need it). All 355 tests pass.
…y sizing
Two correctness fixes from the c-elegans cell run:
1. On-disk output chunk shape was sized from per-thread memory budget,
producing 18^3 chunks → 33k files per s_k array → subsequent
meshing read of s1 hit tensorstore timeouts on Lustre metadata.
Now: out_chunk_shape inherits the input's chunk shape clamped to
[32, 256] and rounded to a multiple of max_factor per axis. SAME
chunk shape used for s1, s2, s3 (uniform across scales). For the
c-elegans cell with src chunks 132x134x166, out_chunk lands at
128x128x160 — s1 has ~72 chunks instead of 33k.
super_chunk_shape (the s0 read region per task) is independent: a
multiple of (out_chunk * max_factor) sized to fit per-worker memory.
2. Per-worker memory was computed as job_memory / num_workers — wrong
for cluster dispatch where each dask worker is its own process in
an LSF job. Correct: per-worker = job_memory / processes_per_job
(each LSF job hosts `processes` workers sharing the job's memory).
For c-elegans cell with 180 GB / 12 processes, that's 15 GB per
worker — vs the previous (incorrect) 2.5 GB / 576 = 4 MB. The
correct accounting lets super-chunks be sized properly.
Proactive process-count adjustment via _recommended_assembly_processes
mirrors how assembly's _plan_assembly_waves chooses processes/job per
wave: if the minimum possible super-chunk wouldn't fit at
base_processes, halve processes/job upfront (not just on OOM retry).
Diagnostic FD-limit log is now silent unless the raise actually fails;
explicit log of effective worker count after task-count capping
('dispatching N super-chunks across M dask workers'). All 355 tests pass.
Previous logic grew super-chunks to fill per-worker memory, which on small volumes (c-elegans cell, 1.5 G voxels) collapsed the task count to 1 — single worker for all -n 576, no parallelism. Now super_chunk_shape is FIXED at out_chunk * max_factor (the minimum that respects alignment). For c-elegans cell that's [1024,1024,1280] = 1.34 GB per task. With processes=12 / 180 GB job memory = 15 GB per worker, peak per task is 1.85 GB ≪ 9 GB (60% memory_fraction). 8 super-chunks total → 8 effective workers. Volumes too small to use the full -n stay parallelism-limited; that's expected and the rest of -n just doesn't get scaled up (dask only requests what it needs). The memory tuning is now ONLY via processes/job: if the minimum super-chunk wouldn't fit in (job_memory / base_processes), halve processes/job to give each worker more memory. Same as assembly's _recommended_assembly_processes pattern, but for a single uniform task estimate. Removed the cuberoot-multiplier sizing that was growing super-chunks into one giant single-task on small volumes.
The rewrite_index_with_empty_fragments reachability pass was top-down only — descending from REAL top-LOD positions through REAL parents. When two LODs are independently meshed (downsample multires strategy), mode-downsampling can drop a thin edge at LOD k+1 that still exists at LOD k. The real LOD-k leaf would then be classified as an orphan (no real parent chain) and dropped from the manifest entirely, producing a chunk-shaped hole at the finest LOD in NG. Add a bottom-up ancestor walk from every REAL position. Reachability is now the union of both passes, so real leaves stay listed; empty intermediates are written as 0-byte placeholders and NG's existing fall-back-to-parent rule lets it traverse past them down to the real leaf. Concrete case: seg 5363 in jrc_P3_E5_D1_N2 mito had real LOD-0 fragments (3,3,4) and (3,4,4) whose LOD-1 parents (1,1,2) and (1,2,2) were empty after mode-downsampling. Before this fix those LOD-0 fragments were dropped from the manifest; after, they're listed and the rendered hole goes away.
keep_chunked_meshes (default False) preserves the per-block PLY tree
(tmp_chunked_s{lod}/{mesh_id}/block_*.ply) plus the per-segment
assembled meshes after a run, so debugging a specific mesh can compare
raw MC output against the post-decimation per-LOD versions. Gates all
the cleanup rmtrees in _assemble_mesh, assemble_meshes, and
_generate_meshes_at_scale.
_existing_pyramid_scales reads OME-NGFF metadata at the auto-build
output path and returns the per-LOD downsample factor for each scale
whose .zarray exists on disk. _build_missing_pyramid_scales now checks
this first and reuses the existing pyramid when it covers all the
needed factors, instead of unconditionally rebuilding from s0. Cuts
4-5 minutes off debug re-runs in the same output dir.
Workers were OOM-killed even at processes/job=1 (180 GB/worker) on bw-1 mito. The actual per-task peak grew with each of the 8 super-chunks a worker processes before exit. Four root causes, four fixes: 1. SANE_CHUNK_MAX 256 → 128. The src zarr chunk shape inherits into out_chunk_shape, which multiplies by max_factor=8 for super_chunk. bw-1's 224³ source chunks produced a 1792³ super-chunk (11.5 GB at uint16). Capping at 128 keeps super_chunk_shape ≤ 1024³ (~2 GB at uint16) regardless of input chunking, and triples parallelism for large volumes (27 tasks vs 8). 2. amplification 1.375x → 2.5x. The original estimate accounted only for s0 + s1/s2/s3 output buffers, not for tensorstore decode/encode caches and glibc fragmentation across sequential super-chunks. The underestimate kept the proactive process-per-job reduction from triggering until OOMs forced the post-hoc retry loop. 3. Disable tensorstore cache_pool (total_bytes_limit=0) in the capped context spec. Decoded-chunk cache is unbounded by default and lives for the worker's lifetime, since _PYRAMID_WORKER_TS_CACHE holds the input + 3 output handles across all tasks. We don't benefit from caching here (each super-chunk reads/writes a non-overlapping region). Plumb the capped context into _ts_handle_for_output too. 4. Explicit `del s0_block` + gc.collect() + libc malloc_trim() at the end of process_super_chunk_for_dask. Free-but-mapped pages don't return to OS without an explicit trim — they compound across the 8+ sequential tasks each worker handles. Together these address the cross-task accumulation that no single estimate-based retry can catch.
Tests instantiate Meshify via object.__new__ + manual attribute assignment, bypassing __init__. Direct self.keep_chunked_meshes access in the rmtree gates raised AttributeError there. Switch to getattr(self, "keep_chunked_meshes", False) so tests don't need to set the attribute explicitly.
Without this, every mesh's manifest stores lod_scales=[1,2,4,8] and lod_scale_multiplier=1, so NG reads the per-LOD spatial resolutions as literally 1-8 nm. At any reasonable viewing zoom (pixel_size > 8 nm in world units) every LOD satisfies "scale <= pixel_size", so NG picks the coarsest LOD per its selection rule. Result: meshes always render at LOD-N-1, even when zoomed in close — the "low res at moderate zoom" bug. By setting lod_scale_multiplier to the source voxel size (e.g. 16 for 16 nm c-elegans data), NG sees effective lod_scales of [voxel_size, 2*voxel_size, 4*voxel_size, ...]. NG now picks LOD-0 at pixel_size ~ voxel_size, transitioning to coarser LODs as zoom-out makes the chunk size shrink in screen space. Matches hemibrain's setup (which uses transform=[16,16,16] for the same effect). Plumbed through: - write_sharded_info_file and write_info_file gain a lod_scale_multiplier parameter (default 1.0 = backward-compatible). - meshify.py passes self.base_voxel_size_funlib[0] from the source segmentation's voxel size to both call sites. - multires.py CLI reads a new voxel_size_nm config knob (defaults to 1.0) and passes it through. Users running the standalone multires CLI should set this in their run-config.yaml's optional_decimation_settings. Users wanting "load less" at zoom-out (e.g. many simultaneous meshes) can bump NG's "Resolution (mesh)" slider above 1.0 px to widen the LOD transition window.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the input zarr exposes only s0 (no pre-built coarser levels), the downsample multires strategy now auto-builds the missing OME-NGFF multiscale levels into
{output_directory}/_intermediate_scales.zarrin a single parallel super-chunk pass over s0, then reads each LOD from the pre-built pyramid. Replaces the slow per-LOD in-worker fallback that re-read the full s0 once per LOD.This is the direct fix for the c-elegans cell run that was getting stuck at LOD 1 because hemibrain's
s1etc. didn't exist on the input.Highlights
per_lod_factors_for_anisotropy): downsamples finest axes first until ~isotropic, then 2× uniformly. Standard FIB-SEM practice.pyramid_alignment_mode: snap(default) — round unaligned ROIs INWARD; drops up tomax_factor - 1voxels per edge.pyramid_alignment_mode: halo— round OUTWARD, read beyond the ROI into the surrounding dataset to complete boundary cubes. No data loss.tr_k = tr_0 + 0.5 × vs_0 × (F_k − 1)per-axis.keep_intermediate_scalesconfig knob to retain the pyramid across runs.Tests (20 new, 352 total in repo)
Module-level (
tests/test_pyramid_builder.py, 18 tests):.zattrs+ s_k arrays; existing factors skipped; local s0 symlinked.Integration (
tests/test_integration_meshify.py, 2 tests):test_auto_builds_pyramid_for_single_scale_input— single-scale input triggers auto-build; spy on_generate_meshes_at_scalecalls confirms LODs 1+ read from the pyramid.test_keep_intermediate_scales_false_cleans_up— default cleanup removes the pyramid after the pipeline.