Add writing to zarr dataset for eval-mode of trained models by leifdenby · Pull Request #104 · mllam/neural-lam

leifdenby · 2025-01-24T17:08:53Z

Describe your changes

Adds new CLI flag to neural_lam.train_model called --save-eval-to-zarr-path <path-to-dataset> which can be added when running neural-lam in eval mode (i.e. neural_lan.train_model --eval ...) to write the predictions to a zarr dataset stored in <path-to-dataset>. This functionality is motivated by our want to be able to store model predictions for later verification.

Example of usage:

Model trained with

$> pdm run python -m neural_lam.train_model --config_path tests/datastore_examples/mdp/danra_100m_winds/config.yaml --hidden_dim 2 --epochs 1 --ar_steps_train 3 --ar_steps_eval 3 --graph 1level

used for inference with

$> pdm run python -m neural_lam.train_model \
   --config_path tests/datastore_examples/mdp/danra_100m_winds/config.yaml \
   --hidden_dim 2 --epochs 1 --ar_steps_train 1 --ar_steps_eval 3 --eval val \
   --load saved_models/train-graph_lam-4x2-01_24_17-2502/min_val_loss.ckpt \
   --val_steps_to_log 3 --graph 1level --save-eval-to-zarr-path state_predictions.zarr/

results in:

$> zarrdump state_predictions.zarr
<xarray.Dataset> Size: 123MB
Dimensions:                    (elapsed_forecast_duration: 3, start_time: 11,
                                state_feature: 2, x: 789, y: 589)
Coordinates:
  * elapsed_forecast_duration  (elapsed_forecast_duration) timedelta64[ns] 24B ...
  * start_time                 (start_time) datetime64[ns] 88B 1990-09-07T06:...
  * state_feature              (state_feature) <U5 40B 'u100m' 'v100m'
    time                       (start_time, elapsed_forecast_duration) datetime64[ns] 264B dask.array<chunksize=(4, 3), meta=np.ndarray>
  * x                          (x) float64 6kB -1.999e+06 ... -2.925e+04
  * y                          (y) float64 5kB -6.095e+05 ... 8.605e+05
Data variables:
    state                      (start_time, elapsed_forecast_duration, state_feature, x, y) float32 123MB dask.array<chunksize=(4, 3, 2, 789, 589), meta=np.ndarray>

NB: This does not implement the inversion of the transformations that take place in mllam-data-prep (e.g. splitting individual features back into separate variables and levels. Also, the zarr datasets store time as [start_time, elapsed_forecast_duration] rather than [start_time, sample] to avoid producing a large array with many empty-values (NaNs) which would otherwise happen because each sample has a different start time. In the snippet below I have demonstrated how one could return to absolute time (probably there is a better way to do this...):

import xarray as xr
import matplotlib.pyplot as plt
ds = xr.open_zarr("state_predictions.zarr/", chunks={})

ds.state.isel(x=0, y=0, start_time=slice(0, 4)).plot(hue="start_time", col="state_feature")
plt.savefig("state_predictions_relative_time.png")

ds_abs_time = xr.concat([
    ds.isel(start_time=i).swap_dims(dict(elapsed_forecast_duration="time")) for i in range(len(ds.start_time))
], dim="sample")
ds_abs_time.state.isel(x=0, y=0).plot(hue="sample", col="state_feature")
plt.savefig("state_predictions_absolute_time.png")

Example plot with time-axis showing elapsed time:

Example plot with time-axis showing absolute time:

This probably needs more work, but I think it is ready for people to try it out and let me know what they think 😄

Issue Link

Implements #89

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

…write-to-zarr

SimonKamuk · 2025-02-07T15:10:18Z

I haven't tested the code myself yet, but this looks great! One question, is test_step called on all gpus in an multi-gpu setup? If so, i guess multiple gpus writing to the same zarr is not an issue as long as the chunks are different, but the metadata is only written correctly, if the GPU which handles batch_idx 0 finishes first, right?

leifdenby · 2025-02-10T08:35:35Z

! One question, is test_step called on all gpus in an multi-gpu setup? If so, i guess multiple gpus writing to the same zarr is not an issue as long as the chunks are different, but the metadata is only written correctly, if the GPU which handles batch_idx 0 finishes first, right?

That is a good point. I haven't actually tested writing on multiple with interference run across multiple GPUs in parallel. There could be a race condition with the fact that I create the zarr dataset (i.e. write the meta info with ds.to_zarr(path, mode="w") only for batch_idx==0 and for other batch_idx I append with ds.to_zarr(path, mode="a") https://github.com/mllam/neural-lam/pull/104/files#diff-043440a2d7a2cd62bb349e74c9fe4f55e69a8d0e4801f6ab2e150d381b11c74dR429. So if batch_idx != 0 is ready before batch_idx == 0 then I think this could might cause an exception to be raised (since there will be no dataset to append to). I am actually not sure how to do this correctly. @observingClouds do you know? I started by looking into the xr.to_zarr(.., range=...) range argument, but for that I think you also have to write the zarr meta data first.

Could we get away with issuing a warning to say this has only been tested for single GPU inference so far?

observingClouds · 2025-02-10T08:45:59Z

@leifdenby do we know all available output times at the time of starting the write process? If so we can create the metadata first and chunks can be written independently without conflicting.

leifdenby · 2025-02-10T09:21:30Z

do we know all available output times at the time of starting the write process

Yes, we should be able to get these from the datastore. So you would go with using the range argument then? Also, does that mean that we'd create the metadata on say the rank 0 process first and then each separate process would write its own ranges? That would mean introducing something like an MPI barrier after the metadata write, right @observingClouds?

joeloskarsson · 2025-02-10T10:55:49Z

does that mean that we'd create the metadata on say the rank 0 process first and then each separate process would write its own ranges

I think this sounds like a good idea. You could use something like https://pytorch.org/docs/stable/distributed.html#torch.distributed.barrier for that.

Am interested in having this work with multi-gpu, so I don't think we should just issue a warning and ignore the multi-gpu case. Something that might complicate things (if we allow batch_size > 1) is the note from https://github.com/joeloskarsson/neural-lam-dev?tab=readme-ov-file#evaluate-models.

Note: While it is technically possible to use multiple GPUs for running evaluation, this is strongly discouraged. If using multiple devices the DistributedSampler will replicate some samples to make sure all devices have the same batch size, meaning that evaluation metrics will be unreliable. A possible workaround is to just use batch size 1 during evaluation. This issue stems from PyTorch Lightning. See for example Lightning-AI/torchmetrics#1886 for more discussion.

If samples are duplicated you could end up with different processes writing to the same region. So that is something to think about.

leifdenby · 2025-02-10T11:43:28Z

Am interested in having this work with multi-gpu, so I don't think we should just issue a warning and ignore the multi-gpu case. Something that might complicate things (if we allow batch_size > 1) is the note from https://github.com/joeloskarsson/neural-lam-dev?tab=readme-ov-file#evaluate-models.

Ok, things are never as easy as they seem 😆

What I am proposing is that we merge this single-GPU implementation (so that people can start using it) and add multi-GPU in a later PR when we have figured out how to do that. If we did that I would issue a warning with just the single-GPU implementation

sadamov · 2025-02-13T20:01:19Z

I have implemented a version that works with multi-gpu, using region="auto" after writing the full metadata initially. you can find the code here: https://github.com/sadamov/neural-lam/tree/write_zarr

And here is the output of a --eval test on the danra test datastore:

Shortcomings:

The chunking breaks down for large datastores and need some work (manual chunking?) - could be related to the reindexing step
the logging is too verbose and not helpful

Thought this might be useful here, or in the follow-up PR.

joeloskarsson · 2025-02-14T08:18:57Z

+            t0 = da_pred.coords["time"].values[0]
+            da_pred.coords["start_time"] = t0
+            da_pred.coords["elapsed_forecast_duration"] = da_pred.time - t0


Please have another look over how target times are used here. batch_times being fed in here are not [analysis_time, analysis_time+time_step, analysis_time+2time_step, ...], but rather [analysis_time+time_step, analysis_time+2time_step, ...]. This also because batch_predictions does not include the state at the analysis time.

This means that start_time is currently not when the forecast was started.

Just adjusting t0 would fix this: joeloskarsson@721ac5e However, you probably need to get the step length from somewhere else than in that commit, as self.step_length is not present on main.

joeloskarsson · 2025-02-14T10:34:19Z

+        # Convert predictions to DataArray using _create_dataarray_from_tensor
+        das_pred = []
+        for i in range(len(batch_times)):
+            da_pred = self._create_dataarray_from_tensor(


Now that _create_dataarray_from_tensor is also used here it seems unreasonable to not properly deal with the hack where a WeatherDataset is instantiatied at each call (

neural-lam/neural_lam/models/ar_model.py

Lines 182 to 185 in 1c281a2

# TODO: creating an instance of WeatherDataset here on every call is

# not how this should be done but whether WeatherDataset should be

# provided to ARModel or where to put plotting still needs discussion

weather_dataset = WeatherDataset(datastore=self._datastore, split=split)

). As is we are doing instatiation of O(NT) WeatherDatasets when saving to zarr (not a memory problem, as we throw them away, but very wastefull). This will quickly become a problem, as the WeatherDatasets will grow when we merge in more boundar-related changes.

This is an alternative hack to avoid constantly making new datasets: joeloskarsson@d277b08 however, this is still very much a hack and the TODO remains that we should handle this some proper way.

joeloskarsson · 2025-02-14T11:18:27Z

+
+        if batch_idx == 0:
+            logger.info(f"Saving predictions to {zarr_output_path}")
+            da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True)


I am getting erroneous start_times with this, due to problems with their encoding. This seems to fix it for me

Suggested change

da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True)

da_pred_batch.to_zarr(

zarr_output_path,

mode="w",

consolidated=True,

encoding={

"start_time": {

"units": "Seconds since 1970-01-01 00:00:00",

"dtype": "int64",

},

},

)

, setting it to use unix standard time. But I don't know if that is the best solution.

@joeloskarsson which zarr version are you using?

This was with zarr 2.18.3, have not tested with zarr 3

joeloskarsson · 2025-02-14T15:08:00Z

+            t0 = da_pred.coords["time"].values[0]
+            da_pred.coords["start_time"] = t0
+            da_pred.coords["elapsed_forecast_duration"] = da_pred.time - t0
+            da_pred = da_pred.swap_dims({"time": "elapsed_forecast_duration"})


Do we really want to leave the time coordinate in this DataArray? It looks to me like those values will only be valid for one of the forecasts (and I am not entirely sure which one).

joeloskarsson · 2025-02-15T13:07:42Z

+        if self.args.save_eval_to_zarr_path:
+            self._save_predictions_to_zarr(
+                batch_times=batch_times,
+                batch_predictions=prediction,


These predictions are in the standardized scale. At some point before these are written to disk in the zarr they should be rescaled to the original data scale.

Can be done as joeloskarsson@d3f636e

observingClouds · 2026-02-16T14:44:30Z

+        if batch_idx == 0:
+            logger.info(f"Saving predictions to {zarr_output_path}")
+            ds_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True)
+        else:
+            ds_pred_batch.to_zarr(
+                zarr_output_path, mode="a", append_dim="start_time"
+            )


we might need to be a bit careful with the time encoding here. In my research branch I get errors like:

UserWarning: Times can't be serialized faithfully to int64 with requested units 'days since 2020-02-12T01:20:00'. Serializing with units 'hours since 2020-02-12T01:20:00' instead. Set encoding['dtype'] to floating point dtype to serialize with units 'days since 2020-02-12T01:20:00'. Set encoding['units'] to 'hours since 2020-02-12T01:20:00' to silence this warning.

Basically, the first time step is saved as days, because it might be 00:00UTC, but consecutive steps might be in e.g. hours.

…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138

Jayant-kernel · 2026-02-28T15:06:41Z

I opened a clean follow-up PR that addresses all review items from #104:

Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region="auto"
Correct start_time: analysis_time = batch_times[:,0] - step_length
Rescale predictions to original scale in test_step before writing
Drop raw time coord; explicit int64 encoding for time variables
Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation
Add --save-eval-to-zarr-path CLI flag
Add tests/test_zarr_eval.py single-GPU integration test

New PR: #289

joeloskarsson · 2026-03-15T16:01:21Z

@leifdenby please answer in #289 if you want to continue working on this or if we should close this and move efforts to #289.

sadamov · 2026-05-07T08:57:58Z

@leifdenby @observingClouds is this superceded by #289?

…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138

elbdmi and others added 2 commits January 22, 2025 13:35

add convertion to zarr logic

2f5c32e

first implementation of write-to-zarr during eval

b80d010

leifdenby mentioned this pull request Jan 24, 2025

Output predictions as zarr dataset #89

Open

sadamov linked an issue Jan 25, 2025 that may be closed by this pull request

Output predictions as zarr dataset #89

Open

leifdenby added the enhancement New feature or request label Feb 6, 2025

leifdenby added 2 commits February 6, 2025 11:41

Merge branch 'main' of https://github.com/mllam/neural-lam into feat/…

305b8d0

…write-to-zarr

add test cli inference run with zarr output

de104f4

leifdenby requested review from SimonKamuk and observingClouds February 6, 2025 14:47

leifdenby added this to the v0.5.0 milestone Feb 10, 2025

sadamov mentioned this pull request Feb 10, 2025

add write zarr functionality joeloskarsson/neural-lam-dev#8

Merged

joeloskarsson reviewed Feb 14, 2025

View reviewed changes

joeloskarsson reviewed Feb 15, 2025

View reviewed changes

leifdenby mentioned this pull request Apr 2, 2025

Enabling training on interior and boundary datastore #138

Open

leifdenby added 6 commits September 24, 2025 08:15

copy attrs from input datastore

2199e47

set attrs take 2

1105955

start_time -> analysis_time

2fe8859

don't unstack, mdp.invert will do this

e699f3f

decode datastore zarr with cf_xarray

a3cc461

can only encode MultiIndex on xr.Datasets

c54accf

leifdenby added 3 commits September 25, 2025 15:07

cf standard names on zarr output time axes

f57ab61

Merge branch 'main' into feat/write-to-zarr

82aed31

update deps

b6abd4d

joeloskarsson modified the milestones: v0.6.0, v0.7.0 Jan 12, 2026

observingClouds reviewed Feb 16, 2026

View reviewed changes

observingClouds mentioned this pull request Feb 24, 2026

Restart inference from previous run #247

Open

Jayant-kernel mentioned this pull request Feb 28, 2026

fix: save eval predictions to zarr (multi-GPU safe, correct time semantics #289

Open

21 tasks

sadamov assigned leifdenby and observingClouds May 7, 2026

joeloskarsson modified the milestones: v0.7.0, v0.8.0 Jun 8, 2026

	# TODO: creating an instance of WeatherDataset here on every call is
	# not how this should be done but whether WeatherDataset should be
	# provided to ARModel or where to put plotting still needs discussion
	weather_dataset = WeatherDataset(datastore=self._datastore, split=split)

-            da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True)
+           da_pred_batch.to_zarr(
+               zarr_output_path,
+               mode="w",
+               consolidated=True,
+               encoding={
+                   "start_time": {
+                       "units": "Seconds since 1970-01-01 00:00:00",
+                       "dtype": "int64",
+                   },
+               },
+           )

Uh oh!

Conversation

leifdenby commented Jan 24, 2025

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

SimonKamuk commented Feb 7, 2025

Uh oh!

leifdenby commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

observingClouds commented Feb 10, 2025

Uh oh!

leifdenby commented Feb 10, 2025

Uh oh!

joeloskarsson commented Feb 10, 2025

Uh oh!

leifdenby commented Feb 10, 2025

Uh oh!

sadamov commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jayant-kernel commented Feb 28, 2026

Uh oh!

joeloskarsson commented Mar 15, 2026

Uh oh!

sadamov commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

leifdenby commented Feb 10, 2025 •

edited

Loading

sadamov commented Feb 13, 2025 •

edited

Loading