feat: temporal resampling for derived datasets (CLIM-679) by turban · Pull Request #73 · dhis2/climate-api

turban · 2026-05-07T11:45:35Z

Summary

Introduces sync_kind: derived dataset type materialized by named processes rather than downloaded
Adds processing/ package with OGC-process-style route POST /processes/{process_id}/execution; the first process is resample, which aggregates a source dataset to a coarser temporal resolution (hourly→daily, daily→weekly/monthly) using xarray
Dataset registry now validates a processing: block (with process_id: resample) for derived datasets; ingestion: is only required for non-derived sync kinds
Adds derived YAML templates: chirps3 weekly + monthly (from daily), ERA5-Land daily temperature + precipitation (from hourly)
Supersedes PR Implement temporal resampling for derived managed datasets [CLIM-679] #63; incorporates process registry design from issue YAML-defined process registry with OGC API Processes exposure #65

Key design decisions (vs PR #63)

Route is POST /processes/{process_id}/execution instead of POST /resample — aligns with OGC API Processes style and makes the registry extensible to future process IDs
YAML key is processing: {process_id: resample, ...} instead of resample: {...} — separates process identity from parameters
DERIVED_DATA_DIR uses CACHE_OVERRIDE/XDG_DATA_HOME env pattern (consistent with artifacts and pygeoapi dirs) instead of __file__-relative path arithmetic
_coerce_numpy_datetime is shared from shared/time.py rather than duplicated
Restored XDG/CACHE_OVERRIDE path resolution for artifacts_dir and pygeoapi_dir that were lost when bringing files from the temporal-resampling branch

Test plan

uv run pytest -q — 174 passed, 1 skipped (all green)
POST /processes/resample/execution with a valid derived dataset returns 200 with status: "completed"
POST /processes/resample/execution with a non-derived dataset returns 400
POST /processes/resample/execution with unknown dataset returns 404
POST /processes/unknown/execution returns 404
Dataset registry loads chirps3 weekly/monthly and ERA5-Land daily templates without errors
Derived dataset templates without processing: block are rejected with clear error message

Introduces a new sync_kind=derived dataset type with a processing registry design (issue #65). Derived datasets are materialized by named processes (process_id: resample) rather than downloaded. The resample process aggregates an existing managed source dataset to a coarser temporal resolution (daily, weekly, monthly) using xarray resampling. Key changes: - processing/ package with resample logic, OGC-process-style route POST /processes/{process_id}/execution, schemas, and services - dataset registry validates processing: block for derived sync_kind; ingestion: block is now skipped for derived datasets - SyncKind.DERIVED added to schemas; sync engine returns NOT_SYNCABLE for derived datasets - YAML templates for chirps3 weekly/monthly and ERA5-Land daily derived datasets added under src/climate_api/data/datasets/ - Restores XDG/CACHE_OVERRIDE path resolution for artifacts dir (ingestions/services.py) and pygeoapi dir (publications/services.py) that were overwritten by the temporal-resampling branch - sync_engine: datetime.min.time() replaced with time(0) - resample.py: uses _coerce_numpy_datetime from shared/time instead of local duplicate; DERIVED_DATA_DIR uses XDG/CACHE_OVERRIDE pattern - Supersedes PR #63; incorporates process registry design from issue #65

abyot

We dropped cache_info because it had become the wrong abstraction.

turban · 2026-05-08T12:56:12Z

Design note: decouple resampling from YAML templates

The runtime POST /processes/resample/execution endpoint is the right approach. One remaining coupling worth addressing before merge: the endpoint takes a dataset_id that must reference a pre-defined YAML template (e.g. chirps3_precipitation_weekly). Adding bi-weekly, dekadal, or any other period type still requires a new YAML entry.

Proposed change

Pass resample parameters directly in the request body — no template needed:

POST /processes/resample/execution
{
  "source_dataset_id": "chirps3_precipitation_daily",
  "period_type": "weekly",
  "method": "sum",
  "start": "2026-W01",
  "end": "2026-W10",
  "extent_id": "sle",
  "publish": true
}

The derived artifact ID is auto-generated from the source + parameters (e.g. chirps3_precipitation_daily_weekly_sum). No YAML entry needed at all.

What this removes

The sync_kind: derived and processing: blocks from YAML templates — source datasets only
SUPPORTED_RESAMPLE_METHODS and derived-template validation in the registry
The pre-defined derived templates in chirps3.yaml and era5_land.yaml
The NOT_SYNCABLE special case for derived sync kinds (no derived sync state to track)

What stays the same

The resampling logic in processing/resample.py — this is solid and reusable
The store_materialized_zarr_artifact() path
The artifact caching / reuse check
The weekly period format support in shared/time.py

Sync coupling

When a source artifact syncs, the derived artifacts built from it become stale. With this design, the client is responsible for re-triggering resampling after a source sync — or the sync engine can fan out to known derived artifacts by querying what has been materialized from that source. Either way, it's cleaner than a YAML-defined sync policy.

turban · 2026-05-08T13:02:42Z

earthkit-transforms as a resampling backend?

Looked into earthkit-transforms (Apache 2.0, maintained by ECMWF, v0.5.4 released March 2026 with a 1.0 imminent). The short answer: it won't meaningfully simplify the resampling logic here — it's a thin convenience wrapper over xr.Dataset.resample() and inherits the same behaviour and limitations.

What it adds over bare xarray:

Named daily/monthly helpers (daily_mean, monthly_sum, etc.) with a time_shift option for local-time aggregation
remove_partial_periods=True, but only when time_shift is set — no general incomplete-period guard
deaccumulate — duplicates climate_api.transforms.deaccumulate_era5

None of these address the actual flexibility bottleneck in the branch.

Where the real constraint is

_resample_frequency() and _PERIOD_ORDER are what limit the supported period types:

_PERIOD_ORDER = {"hourly": 0, "daily": 1, "weekly": 2, "monthly": 3, "yearly": 4}

def _resample_frequency(*, target_period_type: str, week_start: str) -> str:
    if target_period_type == "daily":   return "1D"
    if target_period_type == "weekly":  return "W-MON" if week_start == "monday" else "W-SUN"
    if target_period_type == "monthly": return "MS"
    if target_period_type == "yearly":  return "YS"
    raise HTTPException(...)

Bi-weekly, dekadal, and any other period type falls through to a 400. _PERIOD_ORDER rejects them before the resampling call is even reached.

A more flexible approach

Expose the pandas offset alias directly in the request body instead of mapping through named period types:

{
  "source_dataset_id": "chirps3_precipitation_daily",
  "frequency": "2W",
  "method": "sum",
  "start": "2026-01-01",
  "end": "2026-03-01"
}

This removes _resample_frequency(), _PERIOD_ORDER, and the need to enumerate period types anywhere. xarray (and earthkit-transforms, if you want to adopt it for other reasons) both accept arbitrary pandas frequency strings, so bi-weekly ("2W"), dekadal ("10D"), seasonal ("QS"), or anything else works without code changes.

The incomplete-edge-period logic in _drop_incomplete_edge_periods() already works at the datetime level rather than the named-period level, so it would survive this change largely intact.

abyot reviewed May 7, 2026

View reviewed changes

Comment thread src/climate_api/ingestions/sync_engine.py Outdated

abyot reviewed May 7, 2026

View reviewed changes

fix: replace dropped cache_info with ingestion in _supports_append

6e10577

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: temporal resampling for derived datasets (CLIM-679)#73

feat: temporal resampling for derived datasets (CLIM-679)#73
turban wants to merge 2 commits intomainfrom
feat/temporal-resampling

turban commented May 7, 2026

Uh oh!

Uh oh!

abyot left a comment

Uh oh!

turban commented May 8, 2026

Uh oh!

turban commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

turban commented May 7, 2026

Summary

Key design decisions (vs PR #63)

Test plan

Uh oh!

Uh oh!

abyot left a comment

Choose a reason for hiding this comment

Uh oh!

turban commented May 8, 2026

Design note: decouple resampling from YAML templates

Proposed change

What this removes

What stays the same

Sync coupling

Uh oh!

turban commented May 8, 2026

earthkit-transforms as a resampling backend?

Where the real constraint is

A more flexible approach

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants