Skip to content

Auto-build missing OME-NGFF multiscale levels (foundation)#43

Open
davidackerman wants to merge 18 commits into
masterfrom
auto-build-multiscale-pyramid
Open

Auto-build missing OME-NGFF multiscale levels (foundation)#43
davidackerman wants to merge 18 commits into
masterfrom
auto-build-multiscale-pyramid

Conversation

@davidackerman

@davidackerman davidackerman commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

When the input zarr exposes only s0 (no pre-built coarser levels), the downsample multires strategy now auto-builds the missing OME-NGFF multiscale levels into {output_directory}/_intermediate_scales.zarr in a single parallel super-chunk pass over s0, then reads each LOD from the pre-built pyramid. Replaces the slow per-LOD in-worker fallback that re-read the full s0 once per LOD.

This is the direct fix for the c-elegans cell run that was getting stuck at LOD 1 because hemibrain's s1 etc. didn't exist on the input.

Highlights

  • Per-axis factors for anisotropic voxels (per_lod_factors_for_anisotropy): downsamples finest axes first until ~isotropic, then 2× uniformly. Standard FIB-SEM practice.
  • ROI alignment with two modes:
    • pyramid_alignment_mode: snap (default) — round unaligned ROIs INWARD; drops up to max_factor - 1 voxels per edge.
    • pyramid_alignment_mode: halo — round OUTWARD, read beyond the ROI into the surrounding dataset to complete boundary cubes. No data loss.
  • OME-NGFF v0.4 metadata matching cellmap-analyze conventions: tr_k = tr_0 + 0.5 × vs_0 × (F_k − 1) per-axis.
  • Single-pass super-chunk worker: reads each s0 super-chunk ONCE, emits one output chunk at every LOD.
  • s0 symlinking for local sources so the pyramid group has a complete OME multiscale layout in one place; remote sources fall back to discovery-side merging.
  • keep_intermediate_scales config knob to retain the pyramid across runs.

Tests (20 new, 352 total in repo)

Module-level (tests/test_pyramid_builder.py, 18 tests):

  • Per-LOD factors: isotropic, 2.5× z anisotropy, 5× z extreme.
  • ROI alignment: snap drops fringe; halo rounds outward; anisotropic factor; aligned identity.
  • OME metadata: isotropic + anisotropic translation formulas.
  • Fence-post correctness: per-block super-chunk downsample = global single-pass downsample voxel-by-voxel.
  • E2E build: OME .zattrs + s_k arrays; existing factors skipped; local s0 symlinked.
  • Halo E2E: confirmed via reader log that halo mode reads from voxel 0 even when ROI starts at voxel 3; output matches halo-extended global downsample bit-exact.
  • Dataset-bound clip: halo extending past dataset bounds clips cleanly without exceptions.
  • Snap mode: unaligned ROI shrinks to factor multiples; downsample matches the snapped region exactly.

Integration (tests/test_integration_meshify.py, 2 tests):

  • test_auto_builds_pyramid_for_single_scale_input — single-scale input triggers auto-build; spy on _generate_meshes_at_scale calls confirms LODs 1+ read from the pyramid.
  • test_keep_intermediate_scales_false_cleans_up — default cleanup removes the pyramid after the pipeline.

@davidackerman davidackerman marked this pull request as ready for review June 12, 2026 19:09
New module ``mesh_n_bone/util/pyramid_builder.py`` that pre-computes
missing s_k arrays for the downsample multires strategy when the
input zarr exposes only s0. Replaces the slow per-LOD in-worker
fallback that reads the entire s0 once per LOD.

Key design:
- Per-axis factors (per_lod_factors_for_anisotropy) — anisotropic voxels
  downsample 2x along the finest axes first until ~isotropic, then 2x
  uniformly. Standard FIB-SEM practice.
- ROI alignment with two modes (align_roi_voxels):
    snap   — round inward to max-factor multiples; drops up to
             max_factor-1 voxels per ROI edge.
    halo   — round outward; reads beyond the ROI into surrounding
             dataset to complete cubes at unaligned edges.
- OME-NGFF v0.4 metadata matching cellmap-analyze conventions:
  tr_k = tr_0 + 0.5 * vs_0 * (F_k - 1), per axis.
- Single-pass super-chunk worker (downsample_super_chunk): reads each
  s0 super-chunk ONCE and emits one output chunk at every LOD.
- Local s0 is symlinked into the pyramid group for a complete OME
  multiscale layout; remote sources fall back to discovery-side merging.

Tests (16, all passing):
- per_lod_factors_for_anisotropy: isotropic, 2.5x z anisotropy, 5x z
  extreme.
- align_roi_voxels: aligned ROI identity, snap drops edges, halo rounds
  outward, anisotropic factor.
- build_multiscales_metadata: isotropic + anisotropic translation
  formulas; axes block.
- Fence-post correctness: per-block super-chunk downsample matches a
  global single-pass downsample voxel-by-voxel, isotropic + anisotropic.
- End-to-end build: writes OME .zattrs + s_k arrays; existing factors
  are skipped; local s0 is symlinked.

No wiring into Meshify yet — that's a follow-up commit on this branch.
When the input zarr exposes only s0, _generate_multires_downsample now
calls _build_missing_pyramid_scales() upfront to pre-compute the
missing s_k arrays into ``{output_directory}/_intermediate_scales.zarr``
via a single parallel super-chunk pass. Each LOD then reads from the
pre-built pyramid instead of re-reading s0 once per LOD.

New Meshify kwargs:
- keep_intermediate_scales: bool = False — clean up the pyramid zarr
  after the pipeline finishes (default) or keep for reuse on
  subsequent runs over the same data.
- pyramid_alignment_mode: str = "snap" — "snap" rounds unaligned ROIs
  INWARD (drops up to max_factor-1 voxels per edge); "halo" rounds
  OUTWARD and reads beyond the ROI to complete boundary cubes (no
  data loss).

Implementation details:
- _build_missing_pyramid_scales() opens a tensorstore handle for s0,
  builds an (origin_voxels, shape_voxels) → ndarray closure that uses
  to_ndarray_tensorstore, dispatches on downsample_method, computes
  per-axis factors via per_lod_factors_for_anisotropy, and calls
  build_missing_pyramid_levels.
- ValueError from invalid downsample_method propagates so test
  expectations stay valid; transient I/O errors fall back to the
  per-LOD in-worker downsample path.
- downsample_method="nearest" skips pyramid build (worker stride is
  free; cache adds no value).
- Cleanup at end of get_multiscale_meshes removes the pyramid unless
  keep_intermediate_scales is set.

Tests (4 new, 352 total):
- test_auto_builds_pyramid_for_single_scale_input — exercises the
  end-to-end auto-build path; spy on _generate_meshes_at_scale calls
  confirms LODs 1+ read from `_intermediate_scales.zarr/s_k`.
- test_keep_intermediate_scales_false_cleans_up — default cleanup
  removes the pyramid after the pipeline.
- test_halo_mode_reads_beyond_roi — halo mode logs reads starting at
  voxel 0 even when ROI origin is voxel 3; output matches the
  halo-extended global downsample bit-exact.
- test_halo_clipped_at_dataset_bounds — halo extending past dataset
  bounds clips cleanly without exceptions; output shape is correct
  rounded-outward size.

README documents the new knobs.
…to halo

Two fixes from review:

1. pyramid_alignment_mode default: snap -> halo. I shipped snap by
   mistake; the original design decision was halo (no data loss at
   unaligned ROI edges, even though reads extend beyond the ROI into
   surrounding dataset).

2. Symlink-safe cleanup. shutil.rmtree on its own does the right thing
   on modern Python (unlinks symlinks rather than following them into
   the target), VERIFIED by an in-test experiment, BUT given the
   pyramid's s0 is a symlink to the USER'S INPUT DATA, defensive coding
   matters here — worst case if shutil.rmtree ever regressed would be
   destroying user data.

   New _safely_remove_pyramid() walks the tree top-down, explicitly
   unlinks every symlink (including the s0 link) before descending,
   then rmdir's the empty real directories from the bottom up.
   Refuses to act when the pyramid root itself is somehow a symlink.

Tests (2 new, 354 total):
- test_safe_remove_pyramid_preserves_symlink_target: pyramid has s0
  symlinked to a "user data" dir with critical content; after cleanup
  pyramid is gone, user data fully intact.
- test_safe_remove_refuses_to_delete_when_root_is_symlink: pyramid
  root path itself is a symlink to user data; cleanup refuses and
  the data survives.
Changes from review:

1. Rename keep_intermediate_scales -> delete_intermediate_scales (default
   True). Matches existing delete_decimated_meshes naming + default
   direction.

2. _safely_remove_pyramid now parallelizes top-level subdir removal via
   ThreadPoolExecutor sized by self.num_workers. shutil.rmtree releases
   the GIL during os.unlink/os.rmdir syscalls, so threads scale on
   shared filesystems where the metadata server is the bottleneck.

   Sequential rmtree on Lustre/NFS for a pyramid with ~100k chunk files
   is 5-30 min (~100-1000 unlinks/sec single-thread). Parallel cleanup
   at num_workers=10+ brings that to minutes.

   Symlink safety preserved: top-level symlinks (the s0 link to user
   input) are unlinked BEFORE any parallel rmtree spawns, so worker
   threads never receive a path that points into user data.

Tests + README updated.
Replaces top-level-subdir-only parallelism with file-level parallelism
following cellmap-analyze's pattern (dask_util.delete_tmp_dir_blockwise).
Walks the tree once collecting all real files, partitions across
num_workers threads via ThreadPoolExecutor, each thread os.unlinks its
batch. Empty real dirs are removed bottom-up afterward.

Symlinks are unlinked inline during the walk before any worker
threads spawn — the s0 link to user input is never handed to a
parallel worker.

Adds test_safe_remove_handles_many_chunk_files exercising 500+ chunk-
like files in a flat-zarr-v2 layout.
@davidackerman davidackerman force-pushed the auto-build-multiscale-pyramid branch from b7ad4ac to 182489e Compare June 12, 2026 19:44
pyramid_builder._create_zarr_v2_array previously did 'import zarr', a
test-env-only dep in pixi.toml. In production (default pixi env) the
import failed and the auto-build silently fell back to in-worker
per-LOD downsampling — defeating the whole point of the feature.

Switched to tensorstore (a core dep) for zarr v2 writes. Same on-disk
layout (.zarray + flat chunk files), readable by the zarr library
when needed. Added a numpy-dtype -> tensorstore-dtype map for the
common label dtypes.

Verified: pyramid build now succeeds in the default pixi env with no
zarr installed (smoke test reads back .zarray + chunk files via plain
os.listdir).

Audited src/ for other test-env-only imports; only stranded one was
zarr — mmh3 only appears as cloudvolume's vendored copy, available
wherever cloudvolume is.
Two fixes from the c-elegans cell run staring at one log line for
minutes:

1. Pyramid build was sequential. build_missing_pyramid_levels iterated
   the super-chunk grid one at a time, reading ~134 MB of s0 per
   super-chunk from Lustre. For 27 super-chunks on the c-elegans cell
   volume, that's serialised Lustre I/O — minutes per chunk. Switched
   to a ThreadPoolExecutor sized by self.num_workers (threaded
   through from Meshify, which gets it from -n on the CLI).
   tensorstore reads release the GIL so threads scale on Lustre
   metadata + bandwidth bottlenecks. Adds periodic progress logs
   ('K/N super-chunks done') every 5% so you can watch it move.

2. Logs were block-buffered when stderr was redirected to a file
   (LSF case). util/logging.py now:
   - reconfigures sys.stdout/stderr to line_buffering=True
   - installs a custom _FlushingStreamHandler that calls self.flush()
     + self.stream.flush() after each emit
   - removes any prior default handlers first so our handler wins

   tail -f on the LSF output.log now updates in real time instead of
   waiting for the OS buffer to fill.

Same num_workers value drives the pyramid super-chunk parallelism AND
the pyramid cleanup parallelism, so -n N gives you N-way parallelism
across both ends of the pyramid lifecycle.
Previously the super-chunk size was hardcoded — out_chunk capped at
256 voxels per axis × max_factor (typically 8) = 2048-voxel super-chunk
edges. At uint64 labels that's a 2048^3 * 8 byte = 68 GB read per
worker per task. Easy OOM.

Now sizes super_chunk from per-worker memory budget the same way
assembly does:
  per_worker_budget_MB = _estimate_block_target_mb_from_dask_config(
                            'dask-config.yaml')
  per_thread_budget_bytes = per_worker_budget_MB * 1e6 / num_workers
                            / amplification_factor

amplification = 1 + sum(1/f^3 for factors > 1) * 2 ≈ 1.3 (cumulative
LOD output buffers + scratch). Then super-chunk side =
floor(cuberoot(per_thread_budget / itemsize)), rounded down to a
multiple of max_factor per axis.

For a worker with 15 GB / 24 threads = 625 MB per-thread, uint64
labels, max_factor=8:
  super_chunk_side ≈ floor((625e6/8)^(1/3)) = ~427 voxels → 424 (mult
  of 8), out_chunk = 53 voxels.

Logs the chosen sizes + per-task read size so it's visible up-front
whether the values look sensible for the run.
Refactored the pyramid build to mirror the assembly pipeline's
memory-aware retry pattern:

- _size_super_chunk(thread_count) helper computes out_chunk_shape so
  each thread's super-chunk read + scratch fits in per-thread budget.
- Build wrapped in retry loop: on MemoryError, halves attempt_workers
  and recomputes a bigger super-chunk size (more memory per thread).
  Up to memory_retry_max retries (default 3).
- Partial pyramid from a failed attempt is torn down before retry so
  the next attempt starts clean.

Same effect as assembly's 'halve processes/job, restart wave' pattern,
but adapted for the driver-threaded pyramid build (one process, N
threads instead of N processes per LSF job). For the c-elegans cell
volume on a typical Lustre worker (15 GB job memory, -n 24), first
attempt uses ~625 MB/thread; if that OOMs we'd retry with 12 threads
and ~1.25 GB/thread, etc.

For genuine production scale-out across the LSF cluster (as opposed
to within a single driver process), still a follow-up — but driver-
threaded with adaptive sizing already covers up to ~100s of GB of
input volume on a reasonably-spec'd workstation/login-node.
downsample_super_chunk now accepts a write_chunk callback. When
provided, each LOD's output buffer is written to disk and dropped
immediately, instead of being collected into a dict alongside every
other LOD's output. Peak in-memory per thread drops from
'super-chunk + sum(all LOD outputs)' (~1.14x super-chunk for
isotropic 2/4/8 factors) to 'super-chunk + one LOD output' (~1.125x
super-chunk).

Kept the direct s0 → s_k downsample (mode-of-8x8x8 of s0 for LOD 3,
etc.) rather than switching to cascade. Cascade is the more common
connectomics convention but the user prefers the current behavior
for now.

Memory amplification factor updated to reflect the lower peak.
Refactored _build_missing_pyramid_scales from driver-side ThreadPoolExecutor
to genuine cluster dispatch via dask bag + run_with_oom_retry, mirroring
how get_chunked_meshes and assemble_meshes work.

- New module-level worker process_super_chunk_for_dask(sc_origin, config)
  in pyramid_builder.py. Picklable, opens its own tensorstore handles
  (cached per worker process), reads s0 super-chunk, downsamples to each
  missing LOD, writes each chunk via tensorstore, drops the buffer.
- New driver-side helper prepare_pyramid_metadata_and_arrays() writes
  the OME-NGFF multiscales metadata and creates the empty output zarr
  arrays before workers fan out.
- Proactive processes-per-job sizing using the assembly pipeline's
  helpers (_recommended_assembly_processes, _assembly_config_for_processes).
  If a super-chunk doesn't fit in (job_memory / base_processes), the
  dask config is adjusted upfront to halve processes-per-job (= double
  memory per worker), then run_with_oom_retry can still halve further
  on actual OOM.
- Tolerant of missing dask-config.yaml: tests / local runs without LSF
  just use LocalCluster.
- Diagnostic log of the driver process's RLIMIT_NOFILE so future OS-FD
  issues are visible up-front; also raises soft → hard if possible.

Removed the driver-side num_workers cap (was a 32-thread cap meant for
the in-process ThreadPoolExecutor; cluster dispatch doesn't need it).

All 355 tests pass.
…y sizing

Two correctness fixes from the c-elegans cell run:

1. On-disk output chunk shape was sized from per-thread memory budget,
   producing 18^3 chunks → 33k files per s_k array → subsequent
   meshing read of s1 hit tensorstore timeouts on Lustre metadata.

   Now: out_chunk_shape inherits the input's chunk shape clamped to
   [32, 256] and rounded to a multiple of max_factor per axis. SAME
   chunk shape used for s1, s2, s3 (uniform across scales). For the
   c-elegans cell with src chunks 132x134x166, out_chunk lands at
   128x128x160 — s1 has ~72 chunks instead of 33k.

   super_chunk_shape (the s0 read region per task) is independent: a
   multiple of (out_chunk * max_factor) sized to fit per-worker memory.

2. Per-worker memory was computed as job_memory / num_workers — wrong
   for cluster dispatch where each dask worker is its own process in
   an LSF job. Correct: per-worker = job_memory / processes_per_job
   (each LSF job hosts `processes` workers sharing the job's memory).

   For c-elegans cell with 180 GB / 12 processes, that's 15 GB per
   worker — vs the previous (incorrect) 2.5 GB / 576 = 4 MB. The
   correct accounting lets super-chunks be sized properly.

Proactive process-count adjustment via _recommended_assembly_processes
mirrors how assembly's _plan_assembly_waves chooses processes/job per
wave: if the minimum possible super-chunk wouldn't fit at
base_processes, halve processes/job upfront (not just on OOM retry).

Diagnostic FD-limit log is now silent unless the raise actually fails;
explicit log of effective worker count after task-count capping
('dispatching N super-chunks across M dask workers'). All 355 tests pass.
Previous logic grew super-chunks to fill per-worker memory, which on
small volumes (c-elegans cell, 1.5 G voxels) collapsed the task count
to 1 — single worker for all -n 576, no parallelism.

Now super_chunk_shape is FIXED at out_chunk * max_factor (the minimum
that respects alignment). For c-elegans cell that's [1024,1024,1280] =
1.34 GB per task. With processes=12 / 180 GB job memory = 15 GB per
worker, peak per task is 1.85 GB ≪ 9 GB (60% memory_fraction). 8
super-chunks total → 8 effective workers. Volumes too small to use the
full -n stay parallelism-limited; that's expected and the rest of -n
just doesn't get scaled up (dask only requests what it needs).

The memory tuning is now ONLY via processes/job: if the minimum
super-chunk wouldn't fit in (job_memory / base_processes), halve
processes/job to give each worker more memory. Same as assembly's
_recommended_assembly_processes pattern, but for a single uniform
task estimate.

Removed the cuberoot-multiplier sizing that was growing super-chunks
into one giant single-task on small volumes.
The rewrite_index_with_empty_fragments reachability pass was top-down
only — descending from REAL top-LOD positions through REAL parents.
When two LODs are independently meshed (downsample multires strategy),
mode-downsampling can drop a thin edge at LOD k+1 that still exists at
LOD k. The real LOD-k leaf would then be classified as an orphan (no
real parent chain) and dropped from the manifest entirely, producing a
chunk-shaped hole at the finest LOD in NG.

Add a bottom-up ancestor walk from every REAL position. Reachability
is now the union of both passes, so real leaves stay listed; empty
intermediates are written as 0-byte placeholders and NG's existing
fall-back-to-parent rule lets it traverse past them down to the real
leaf.

Concrete case: seg 5363 in jrc_P3_E5_D1_N2 mito had real LOD-0
fragments (3,3,4) and (3,4,4) whose LOD-1 parents (1,1,2) and (1,2,2)
were empty after mode-downsampling. Before this fix those LOD-0
fragments were dropped from the manifest; after, they're listed and
the rendered hole goes away.
keep_chunked_meshes (default False) preserves the per-block PLY tree
(tmp_chunked_s{lod}/{mesh_id}/block_*.ply) plus the per-segment
assembled meshes after a run, so debugging a specific mesh can compare
raw MC output against the post-decimation per-LOD versions. Gates all
the cleanup rmtrees in _assemble_mesh, assemble_meshes, and
_generate_meshes_at_scale.

_existing_pyramid_scales reads OME-NGFF metadata at the auto-build
output path and returns the per-LOD downsample factor for each scale
whose .zarray exists on disk. _build_missing_pyramid_scales now checks
this first and reuses the existing pyramid when it covers all the
needed factors, instead of unconditionally rebuilding from s0. Cuts
4-5 minutes off debug re-runs in the same output dir.
Workers were OOM-killed even at processes/job=1 (180 GB/worker) on
bw-1 mito. The actual per-task peak grew with each of the 8 super-chunks
a worker processes before exit. Four root causes, four fixes:

1. SANE_CHUNK_MAX 256 → 128. The src zarr chunk shape inherits into
   out_chunk_shape, which multiplies by max_factor=8 for super_chunk.
   bw-1's 224³ source chunks produced a 1792³ super-chunk (11.5 GB at
   uint16). Capping at 128 keeps super_chunk_shape ≤ 1024³ (~2 GB at
   uint16) regardless of input chunking, and triples parallelism for
   large volumes (27 tasks vs 8).

2. amplification 1.375x → 2.5x. The original estimate accounted only
   for s0 + s1/s2/s3 output buffers, not for tensorstore decode/encode
   caches and glibc fragmentation across sequential super-chunks. The
   underestimate kept the proactive process-per-job reduction from
   triggering until OOMs forced the post-hoc retry loop.

3. Disable tensorstore cache_pool (total_bytes_limit=0) in the capped
   context spec. Decoded-chunk cache is unbounded by default and lives
   for the worker's lifetime, since _PYRAMID_WORKER_TS_CACHE holds the
   input + 3 output handles across all tasks. We don't benefit from
   caching here (each super-chunk reads/writes a non-overlapping
   region). Plumb the capped context into _ts_handle_for_output too.

4. Explicit `del s0_block` + gc.collect() + libc malloc_trim() at the
   end of process_super_chunk_for_dask. Free-but-mapped pages don't
   return to OS without an explicit trim — they compound across the
   8+ sequential tasks each worker handles.

Together these address the cross-task accumulation that no single
estimate-based retry can catch.
Tests instantiate Meshify via object.__new__ + manual attribute
assignment, bypassing __init__. Direct self.keep_chunked_meshes access
in the rmtree gates raised AttributeError there. Switch to
getattr(self, "keep_chunked_meshes", False) so tests don't need to set
the attribute explicitly.
Without this, every mesh's manifest stores lod_scales=[1,2,4,8] and
lod_scale_multiplier=1, so NG reads the per-LOD spatial resolutions as
literally 1-8 nm. At any reasonable viewing zoom (pixel_size > 8 nm in
world units) every LOD satisfies "scale <= pixel_size", so NG picks
the coarsest LOD per its selection rule. Result: meshes always render
at LOD-N-1, even when zoomed in close — the "low res at moderate zoom"
bug.

By setting lod_scale_multiplier to the source voxel size (e.g. 16 for
16 nm c-elegans data), NG sees effective lod_scales of
[voxel_size, 2*voxel_size, 4*voxel_size, ...]. NG now picks LOD-0 at
pixel_size ~ voxel_size, transitioning to coarser LODs as zoom-out
makes the chunk size shrink in screen space. Matches hemibrain's setup
(which uses transform=[16,16,16] for the same effect).

Plumbed through:
- write_sharded_info_file and write_info_file gain a
  lod_scale_multiplier parameter (default 1.0 = backward-compatible).
- meshify.py passes self.base_voxel_size_funlib[0] from the source
  segmentation's voxel size to both call sites.
- multires.py CLI reads a new voxel_size_nm config knob (defaults to
  1.0) and passes it through. Users running the standalone multires
  CLI should set this in their run-config.yaml's
  optional_decimation_settings.

Users wanting "load less" at zoom-out (e.g. many simultaneous meshes)
can bump NG's "Resolution (mesh)" slider above 1.0 px to widen the
LOD transition window.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant