Add writing to zarr dataset for eval-mode of trained models#104
Add writing to zarr dataset for eval-mode of trained models#104leifdenby wants to merge 13 commits into
Conversation
|
I haven't tested the code myself yet, but this looks great! One question, is test_step called on all gpus in an multi-gpu setup? If so, i guess multiple gpus writing to the same zarr is not an issue as long as the chunks are different, but the metadata is only written correctly, if the GPU which handles batch_idx 0 finishes first, right? |
That is a good point. I haven't actually tested writing on multiple with interference run across multiple GPUs in parallel. There could be a race condition with the fact that I create the zarr dataset (i.e. write the meta info with Could we get away with issuing a warning to say this has only been tested for single GPU inference so far? |
|
@leifdenby do we know all available output times at the time of starting the write process? If so we can create the metadata first and chunks can be written independently without conflicting. |
Yes, we should be able to get these from the datastore. So you would go with using the |
I think this sounds like a good idea. You could use something like https://pytorch.org/docs/stable/distributed.html#torch.distributed.barrier for that. Am interested in having this work with multi-gpu, so I don't think we should just issue a warning and ignore the multi-gpu case. Something that might complicate things (if we allow batch_size > 1) is the note from https://github.com/joeloskarsson/neural-lam-dev?tab=readme-ov-file#evaluate-models.
If samples are duplicated you could end up with different processes writing to the same region. So that is something to think about. |
Ok, things are never as easy as they seem 😆 What I am proposing is that we merge this single-GPU implementation (so that people can start using it) and add multi-GPU in a later PR when we have figured out how to do that. If we did that I would issue a warning with just the single-GPU implementation |
|
I have implemented a version that works with multi-gpu, using region="auto" after writing the full metadata initially. you can find the code here: https://github.com/sadamov/neural-lam/tree/write_zarr And here is the output of a Shortcomings:
Thought this might be useful here, or in the follow-up PR. |
| t0 = da_pred.coords["time"].values[0] | ||
| da_pred.coords["start_time"] = t0 | ||
| da_pred.coords["elapsed_forecast_duration"] = da_pred.time - t0 |
There was a problem hiding this comment.
Please have another look over how target times are used here. batch_times being fed in here are not [analysis_time, analysis_time+time_step, analysis_time+2time_step, ...], but rather [analysis_time+time_step, analysis_time+2time_step, ...]. This also because batch_predictions does not include the state at the analysis time.
This means that start_time is currently not when the forecast was started.
There was a problem hiding this comment.
Just adjusting t0 would fix this: joeloskarsson@721ac5e However, you probably need to get the step length from somewhere else than in that commit, as self.step_length is not present on main.
| # Convert predictions to DataArray using _create_dataarray_from_tensor | ||
| das_pred = [] | ||
| for i in range(len(batch_times)): | ||
| da_pred = self._create_dataarray_from_tensor( |
There was a problem hiding this comment.
Now that _create_dataarray_from_tensor is also used here it seems unreasonable to not properly deal with the hack where a WeatherDataset is instantiatied at each call (
neural-lam/neural_lam/models/ar_model.py
Lines 182 to 185 in 1c281a2
There was a problem hiding this comment.
This is an alternative hack to avoid constantly making new datasets: joeloskarsson@d277b08 however, this is still very much a hack and the TODO remains that we should handle this some proper way.
|
|
||
| if batch_idx == 0: | ||
| logger.info(f"Saving predictions to {zarr_output_path}") | ||
| da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True) |
There was a problem hiding this comment.
I am getting erroneous start_times with this, due to problems with their encoding. This seems to fix it for me
| da_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True) | |
| da_pred_batch.to_zarr( | |
| zarr_output_path, | |
| mode="w", | |
| consolidated=True, | |
| encoding={ | |
| "start_time": { | |
| "units": "Seconds since 1970-01-01 00:00:00", | |
| "dtype": "int64", | |
| }, | |
| }, | |
| ) |
, setting it to use unix standard time. But I don't know if that is the best solution.
There was a problem hiding this comment.
@joeloskarsson which zarr version are you using?
There was a problem hiding this comment.
This was with zarr 2.18.3, have not tested with zarr 3
| t0 = da_pred.coords["time"].values[0] | ||
| da_pred.coords["start_time"] = t0 | ||
| da_pred.coords["elapsed_forecast_duration"] = da_pred.time - t0 | ||
| da_pred = da_pred.swap_dims({"time": "elapsed_forecast_duration"}) |
There was a problem hiding this comment.
Do we really want to leave the time coordinate in this DataArray? It looks to me like those values will only be valid for one of the forecasts (and I am not entirely sure which one).
| if self.args.save_eval_to_zarr_path: | ||
| self._save_predictions_to_zarr( | ||
| batch_times=batch_times, | ||
| batch_predictions=prediction, |
There was a problem hiding this comment.
These predictions are in the standardized scale. At some point before these are written to disk in the zarr they should be rescaled to the original data scale.
| if batch_idx == 0: | ||
| logger.info(f"Saving predictions to {zarr_output_path}") | ||
| ds_pred_batch.to_zarr(zarr_output_path, mode="w", consolidated=True) | ||
| else: | ||
| ds_pred_batch.to_zarr( | ||
| zarr_output_path, mode="a", append_dim="start_time" | ||
| ) |
There was a problem hiding this comment.
we might need to be a bit careful with the time encoding here. In my research branch I get errors like:
UserWarning: Times can't be serialized faithfully to int64 with requested units 'days since 2020-02-12T01:20:00'. Serializing with units 'hours since 2020-02-12T01:20:00' instead. Set encoding['dtype'] to floating point dtype to serialize with units 'days since 2020-02-12T01:20:00'. Set encoding['units'] to 'hours since 2020-02-12T01:20:00' to silence this warning.
Basically, the first time step is saved as days, because it might be 00:00UTC, but consecutive steps might be in e.g. hours.
…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138
…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138
…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138
|
I opened a clean follow-up PR that addresses all review items from #104:
New PR: #289 |
|
@leifdenby please answer in #289 if you want to continue working on this or if we should close this and move efforts to #289. |
|
@leifdenby @observingClouds is this superceded by #289? |
…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138
…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138
…ntics) - Multi-GPU safe: rank-0 writes metadata template (compute=False), then barrier(), then all ranks write via region='auto' - Correct start_time: analysis_time = batch_times[:,0] - step_length - Rescale predictions to original scale in test_step before writing - Drop raw 'time' coord; explicit int64 encoding for time variables - Cache WeatherDataset per (split, category) to avoid O(N*T) re-instantiation - Add --save-eval-to-zarr-path CLI flag - Add tests/test_zarr_eval.py single-GPU integration test Refs mllam#104 Part of mllam#138

Describe your changes
Adds new CLI flag to
neural_lam.train_modelcalled--save-eval-to-zarr-path <path-to-dataset>which can be added when running neural-lam inevalmode (i.e.neural_lan.train_model --eval ...) to write the predictions to a zarr dataset stored in<path-to-dataset>. This functionality is motivated by our want to be able to store model predictions for later verification.Example of usage:
Model trained with
$> pdm run python -m neural_lam.train_model --config_path tests/datastore_examples/mdp/danra_100m_winds/config.yaml --hidden_dim 2 --epochs 1 --ar_steps_train 3 --ar_steps_eval 3 --graph 1levelused for inference with
$> pdm run python -m neural_lam.train_model \ --config_path tests/datastore_examples/mdp/danra_100m_winds/config.yaml \ --hidden_dim 2 --epochs 1 --ar_steps_train 1 --ar_steps_eval 3 --eval val \ --load saved_models/train-graph_lam-4x2-01_24_17-2502/min_val_loss.ckpt \ --val_steps_to_log 3 --graph 1level --save-eval-to-zarr-path state_predictions.zarr/results in:
NB: This does not implement the inversion of the transformations that take place in
mllam-data-prep(e.g. splitting individual features back into separate variables and levels. Also, the zarr datasets store time as[start_time, elapsed_forecast_duration]rather than[start_time, sample]to avoid producing a large array with many empty-values (NaNs) which would otherwise happen because each sample has a different start time. In the snippet below I have demonstrated how one could return to absolute time (probably there is a better way to do this...):Example plot with time-axis showing elapsed time:

Example plot with time-axis showing absolute time:

This probably needs more work, but I think it is ready for people to try it out and let me know what they think 😄
Issue Link
Implements #89
Type of change
Checklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee