Support incremental Zarr append using changed files from eo_function downloads

## Background

Each sync run currently rebuilds the entire Zarr store from all NetCDF files in the download directory, even when only one or two new months arrived. For datasets with long histories (CHIRPS3 over several years, ERA5-Land hourly) this means re-reading and re-writing data that hasn't changed.

The architecture already has everything needed to avoid this.

## What we have

`download_dataset()` in `downloader.py` already computes `changed_files` — the list of NetCDF files that were newly written or modified in the current sync run — by diffing directory mtimes before and after the eo_function call. This list is returned to the ingestion layer but currently not used for anything other than logging.

`xr.open_mfdataset` + `to_zarr` support `append_dim="time"`, which allows new time slices to be written into an existing Zarr store without touching the data already there.

## Proposed change

Instead of always rebuilding the full Zarr from scratch, `build_dataset_zarr` should:

1. If no Zarr store exists yet — write it in full as today.
2. If a Zarr store already exists and `changed_files` is non-empty — open only the changed NetCDF files, select the time range they cover, and append to the existing store with `to_zarr(..., append_dim="time")`.

This keeps sync time proportional to how much is new, not how much exists.

## Why not switch eo_functions to return xarray.Dataset instead

An alternative design would have `dhis2eo` download functions return an `xr.Dataset` directly instead of writing NetCDF files, removing the intermediate disk step. This was evaluated and rejected for three reasons:

- **Restartability.** ERA5-Land and CHIRPS3 downloads span many months, each a separate slow API request. The file-per-month pattern means a crash resumes from where it left off. Returning a Dataset means the entire period must be in memory before anything is saved — a crash loses everything.
- **Memory.** Accumulating all months via `xr.concat` before returning can easily reach tens of GB for ERA5-Land hourly at country scale.
- **Background task compatibility.** FastAPI's `BackgroundTasks` discards return values. Capturing a returned Dataset from a background task would require replacing `BackgroundTasks` with a proper task queue (Celery, ARQ, etc.).

The file-based contract is the right boundary between `dhis2eo` and the climate-api. The incremental append improvement fits naturally within it.

## Related

The `changed_files` return value from `download_dataset()` should also become part of the agreed interface with `dhis2eo` — currently the climate-api reconstructs it from mtime diffs rather than trusting the return value of `download()`. Aligning on this would give both sides a cleaner contract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support incremental Zarr append using changed files from eo_function downloads #64

Background

What we have

Proposed change

Why not switch eo_functions to return xarray.Dataset instead

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support incremental Zarr append using changed files from eo_function downloads #64

Description

Background

What we have

Proposed change

Why not switch eo_functions to return xarray.Dataset instead

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions