Skip to content

feat: temporal resampling for derived datasets (CLIM-679)#73

Open
turban wants to merge 2 commits intomainfrom
feat/temporal-resampling
Open

feat: temporal resampling for derived datasets (CLIM-679)#73
turban wants to merge 2 commits intomainfrom
feat/temporal-resampling

Conversation

@turban
Copy link
Copy Markdown
Contributor

@turban turban commented May 7, 2026

Summary

  • Introduces sync_kind: derived dataset type materialized by named processes rather than downloaded
  • Adds processing/ package with OGC-process-style route POST /processes/{process_id}/execution; the first process is resample, which aggregates a source dataset to a coarser temporal resolution (hourly→daily, daily→weekly/monthly) using xarray
  • Dataset registry now validates a processing: block (with process_id: resample) for derived datasets; ingestion: is only required for non-derived sync kinds
  • Adds derived YAML templates: chirps3 weekly + monthly (from daily), ERA5-Land daily temperature + precipitation (from hourly)
  • Supersedes PR Implement temporal resampling for derived managed datasets [CLIM-679] #63; incorporates process registry design from issue YAML-defined process registry with OGC API Processes exposure #65

Key design decisions (vs PR #63)

  • Route is POST /processes/{process_id}/execution instead of POST /resample — aligns with OGC API Processes style and makes the registry extensible to future process IDs
  • YAML key is processing: {process_id: resample, ...} instead of resample: {...} — separates process identity from parameters
  • DERIVED_DATA_DIR uses CACHE_OVERRIDE/XDG_DATA_HOME env pattern (consistent with artifacts and pygeoapi dirs) instead of __file__-relative path arithmetic
  • _coerce_numpy_datetime is shared from shared/time.py rather than duplicated
  • Restored XDG/CACHE_OVERRIDE path resolution for artifacts_dir and pygeoapi_dir that were lost when bringing files from the temporal-resampling branch

Test plan

  • uv run pytest -q — 174 passed, 1 skipped (all green)
  • POST /processes/resample/execution with a valid derived dataset returns 200 with status: "completed"
  • POST /processes/resample/execution with a non-derived dataset returns 400
  • POST /processes/resample/execution with unknown dataset returns 404
  • POST /processes/unknown/execution returns 404
  • Dataset registry loads chirps3 weekly/monthly and ERA5-Land daily templates without errors
  • Derived dataset templates without processing: block are rejected with clear error message

Introduces a new sync_kind=derived dataset type with a processing registry
design (issue #65). Derived datasets are materialized by named processes
(process_id: resample) rather than downloaded. The resample process
aggregates an existing managed source dataset to a coarser temporal
resolution (daily, weekly, monthly) using xarray resampling.

Key changes:
- processing/ package with resample logic, OGC-process-style route
  POST /processes/{process_id}/execution, schemas, and services
- dataset registry validates processing: block for derived sync_kind;
  ingestion: block is now skipped for derived datasets
- SyncKind.DERIVED added to schemas; sync engine returns NOT_SYNCABLE for
  derived datasets
- YAML templates for chirps3 weekly/monthly and ERA5-Land daily derived
  datasets added under src/climate_api/data/datasets/
- Restores XDG/CACHE_OVERRIDE path resolution for artifacts dir
  (ingestions/services.py) and pygeoapi dir (publications/services.py)
  that were overwritten by the temporal-resampling branch
- sync_engine: datetime.min.time() replaced with time(0)
- resample.py: uses _coerce_numpy_datetime from shared/time instead of
  local duplicate; DERIVED_DATA_DIR uses XDG/CACHE_OVERRIDE pattern
- Supersedes PR #63; incorporates process registry design from issue #65
Comment thread src/climate_api/ingestions/sync_engine.py Outdated
Copy link
Copy Markdown
Member

@abyot abyot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dropped cache_info because it had become the wrong abstraction.

@turban
Copy link
Copy Markdown
Contributor Author

turban commented May 8, 2026

Design note: decouple resampling from YAML templates

The runtime POST /processes/resample/execution endpoint is the right approach. One remaining coupling worth addressing before merge: the endpoint takes a dataset_id that must reference a pre-defined YAML template (e.g. chirps3_precipitation_weekly). Adding bi-weekly, dekadal, or any other period type still requires a new YAML entry.

Proposed change

Pass resample parameters directly in the request body — no template needed:

POST /processes/resample/execution
{
  "source_dataset_id": "chirps3_precipitation_daily",
  "period_type": "weekly",
  "method": "sum",
  "start": "2026-W01",
  "end": "2026-W10",
  "extent_id": "sle",
  "publish": true
}

The derived artifact ID is auto-generated from the source + parameters (e.g. chirps3_precipitation_daily_weekly_sum). No YAML entry needed at all.

What this removes

  • The sync_kind: derived and processing: blocks from YAML templates — source datasets only
  • SUPPORTED_RESAMPLE_METHODS and derived-template validation in the registry
  • The pre-defined derived templates in chirps3.yaml and era5_land.yaml
  • The NOT_SYNCABLE special case for derived sync kinds (no derived sync state to track)

What stays the same

  • The resampling logic in processing/resample.py — this is solid and reusable
  • The store_materialized_zarr_artifact() path
  • The artifact caching / reuse check
  • The weekly period format support in shared/time.py

Sync coupling

When a source artifact syncs, the derived artifacts built from it become stale. With this design, the client is responsible for re-triggering resampling after a source sync — or the sync engine can fan out to known derived artifacts by querying what has been materialized from that source. Either way, it's cleaner than a YAML-defined sync policy.

@turban
Copy link
Copy Markdown
Contributor Author

turban commented May 8, 2026

earthkit-transforms as a resampling backend?

Looked into earthkit-transforms (Apache 2.0, maintained by ECMWF, v0.5.4 released March 2026 with a 1.0 imminent). The short answer: it won't meaningfully simplify the resampling logic here — it's a thin convenience wrapper over xr.Dataset.resample() and inherits the same behaviour and limitations.

What it adds over bare xarray:

  • Named daily/monthly helpers (daily_mean, monthly_sum, etc.) with a time_shift option for local-time aggregation
  • remove_partial_periods=True, but only when time_shift is set — no general incomplete-period guard
  • deaccumulate — duplicates climate_api.transforms.deaccumulate_era5

None of these address the actual flexibility bottleneck in the branch.

Where the real constraint is

_resample_frequency() and _PERIOD_ORDER are what limit the supported period types:

_PERIOD_ORDER = {"hourly": 0, "daily": 1, "weekly": 2, "monthly": 3, "yearly": 4}

def _resample_frequency(*, target_period_type: str, week_start: str) -> str:
    if target_period_type == "daily":   return "1D"
    if target_period_type == "weekly":  return "W-MON" if week_start == "monday" else "W-SUN"
    if target_period_type == "monthly": return "MS"
    if target_period_type == "yearly":  return "YS"
    raise HTTPException(...)

Bi-weekly, dekadal, and any other period type falls through to a 400. _PERIOD_ORDER rejects them before the resampling call is even reached.

A more flexible approach

Expose the pandas offset alias directly in the request body instead of mapping through named period types:

{
  "source_dataset_id": "chirps3_precipitation_daily",
  "frequency": "2W",
  "method": "sum",
  "start": "2026-01-01",
  "end": "2026-03-01"
}

This removes _resample_frequency(), _PERIOD_ORDER, and the need to enumerate period types anywhere. xarray (and earthkit-transforms, if you want to adopt it for other reasons) both accept arbitrary pandas frequency strings, so bi-weekly ("2W"), dekadal ("10D"), seasonal ("QS"), or anything else works without code changes.

The incomplete-edge-period logic in _drop_incomplete_edge_periods() already works at the datetime level rather than the named-period level, so it would survive this change largely intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants