Background
Each sync run currently rebuilds the entire Zarr store from all NetCDF files in the download directory, even when only one or two new months arrived. For datasets with long histories (CHIRPS3 over several years, ERA5-Land hourly) this means re-reading and re-writing data that hasn't changed.
The architecture already has everything needed to avoid this.
What we have
download_dataset() in downloader.py already computes changed_files — the list of NetCDF files that were newly written or modified in the current sync run — by diffing directory mtimes before and after the eo_function call. This list is returned to the ingestion layer but currently not used for anything other than logging.
xr.open_mfdataset + to_zarr support append_dim="time", which allows new time slices to be written into an existing Zarr store without touching the data already there.
Proposed change
Instead of always rebuilding the full Zarr from scratch, build_dataset_zarr should:
- If no Zarr store exists yet — write it in full as today.
- If a Zarr store already exists and
changed_files is non-empty — open only the changed NetCDF files, select the time range they cover, and append to the existing store with to_zarr(..., append_dim="time").
This keeps sync time proportional to how much is new, not how much exists.
Why not switch eo_functions to return xarray.Dataset instead
An alternative design would have dhis2eo download functions return an xr.Dataset directly instead of writing NetCDF files, removing the intermediate disk step. This was evaluated and rejected for three reasons:
- Restartability. ERA5-Land and CHIRPS3 downloads span many months, each a separate slow API request. The file-per-month pattern means a crash resumes from where it left off. Returning a Dataset means the entire period must be in memory before anything is saved — a crash loses everything.
- Memory. Accumulating all months via
xr.concat before returning can easily reach tens of GB for ERA5-Land hourly at country scale.
- Background task compatibility. FastAPI's
BackgroundTasks discards return values. Capturing a returned Dataset from a background task would require replacing BackgroundTasks with a proper task queue (Celery, ARQ, etc.).
The file-based contract is the right boundary between dhis2eo and the climate-api. The incremental append improvement fits naturally within it.
Related
The changed_files return value from download_dataset() should also become part of the agreed interface with dhis2eo — currently the climate-api reconstructs it from mtime diffs rather than trusting the return value of download(). Aligning on this would give both sides a cleaner contract.
Background
Each sync run currently rebuilds the entire Zarr store from all NetCDF files in the download directory, even when only one or two new months arrived. For datasets with long histories (CHIRPS3 over several years, ERA5-Land hourly) this means re-reading and re-writing data that hasn't changed.
The architecture already has everything needed to avoid this.
What we have
download_dataset()indownloader.pyalready computeschanged_files— the list of NetCDF files that were newly written or modified in the current sync run — by diffing directory mtimes before and after the eo_function call. This list is returned to the ingestion layer but currently not used for anything other than logging.xr.open_mfdataset+to_zarrsupportappend_dim="time", which allows new time slices to be written into an existing Zarr store without touching the data already there.Proposed change
Instead of always rebuilding the full Zarr from scratch,
build_dataset_zarrshould:changed_filesis non-empty — open only the changed NetCDF files, select the time range they cover, and append to the existing store withto_zarr(..., append_dim="time").This keeps sync time proportional to how much is new, not how much exists.
Why not switch eo_functions to return xarray.Dataset instead
An alternative design would have
dhis2eodownload functions return anxr.Datasetdirectly instead of writing NetCDF files, removing the intermediate disk step. This was evaluated and rejected for three reasons:xr.concatbefore returning can easily reach tens of GB for ERA5-Land hourly at country scale.BackgroundTasksdiscards return values. Capturing a returned Dataset from a background task would require replacingBackgroundTaskswith a proper task queue (Celery, ARQ, etc.).The file-based contract is the right boundary between
dhis2eoand the climate-api. The incremental append improvement fits naturally within it.Related
The
changed_filesreturn value fromdownload_dataset()should also become part of the agreed interface withdhis2eo— currently the climate-api reconstructs it from mtime diffs rather than trusting the return value ofdownload(). Aligning on this would give both sides a cleaner contract.