feat: temporal resampling for derived datasets (CLIM-679)#73
feat: temporal resampling for derived datasets (CLIM-679)#73
Conversation
Introduces a new sync_kind=derived dataset type with a processing registry design (issue #65). Derived datasets are materialized by named processes (process_id: resample) rather than downloaded. The resample process aggregates an existing managed source dataset to a coarser temporal resolution (daily, weekly, monthly) using xarray resampling. Key changes: - processing/ package with resample logic, OGC-process-style route POST /processes/{process_id}/execution, schemas, and services - dataset registry validates processing: block for derived sync_kind; ingestion: block is now skipped for derived datasets - SyncKind.DERIVED added to schemas; sync engine returns NOT_SYNCABLE for derived datasets - YAML templates for chirps3 weekly/monthly and ERA5-Land daily derived datasets added under src/climate_api/data/datasets/ - Restores XDG/CACHE_OVERRIDE path resolution for artifacts dir (ingestions/services.py) and pygeoapi dir (publications/services.py) that were overwritten by the temporal-resampling branch - sync_engine: datetime.min.time() replaced with time(0) - resample.py: uses _coerce_numpy_datetime from shared/time instead of local duplicate; DERIVED_DATA_DIR uses XDG/CACHE_OVERRIDE pattern - Supersedes PR #63; incorporates process registry design from issue #65
abyot
left a comment
There was a problem hiding this comment.
We dropped cache_info because it had become the wrong abstraction.
Design note: decouple resampling from YAML templatesThe runtime Proposed changePass resample parameters directly in the request body — no template needed: POST /processes/resample/execution
{
"source_dataset_id": "chirps3_precipitation_daily",
"period_type": "weekly",
"method": "sum",
"start": "2026-W01",
"end": "2026-W10",
"extent_id": "sle",
"publish": true
}The derived artifact ID is auto-generated from the source + parameters (e.g. What this removes
What stays the same
Sync couplingWhen a source artifact syncs, the derived artifacts built from it become stale. With this design, the client is responsible for re-triggering resampling after a source sync — or the sync engine can fan out to known derived artifacts by querying what has been materialized from that source. Either way, it's cleaner than a YAML-defined sync policy. |
earthkit-transforms as a resampling backend?Looked into What it adds over bare xarray:
None of these address the actual flexibility bottleneck in the branch. Where the real constraint is
_PERIOD_ORDER = {"hourly": 0, "daily": 1, "weekly": 2, "monthly": 3, "yearly": 4}
def _resample_frequency(*, target_period_type: str, week_start: str) -> str:
if target_period_type == "daily": return "1D"
if target_period_type == "weekly": return "W-MON" if week_start == "monday" else "W-SUN"
if target_period_type == "monthly": return "MS"
if target_period_type == "yearly": return "YS"
raise HTTPException(...)Bi-weekly, dekadal, and any other period type falls through to a 400. A more flexible approachExpose the pandas offset alias directly in the request body instead of mapping through named period types: {
"source_dataset_id": "chirps3_precipitation_daily",
"frequency": "2W",
"method": "sum",
"start": "2026-01-01",
"end": "2026-03-01"
}This removes The incomplete-edge-period logic in |
Summary
sync_kind: deriveddataset type materialized by named processes rather than downloadedprocessing/package with OGC-process-style routePOST /processes/{process_id}/execution; the first process isresample, which aggregates a source dataset to a coarser temporal resolution (hourly→daily, daily→weekly/monthly) using xarrayprocessing:block (withprocess_id: resample) for derived datasets;ingestion:is only required for non-derived sync kindsKey design decisions (vs PR #63)
POST /processes/{process_id}/executioninstead ofPOST /resample— aligns with OGC API Processes style and makes the registry extensible to future process IDsprocessing: {process_id: resample, ...}instead ofresample: {...}— separates process identity from parametersDERIVED_DATA_DIRusesCACHE_OVERRIDE/XDG_DATA_HOMEenv pattern (consistent with artifacts and pygeoapi dirs) instead of__file__-relative path arithmetic_coerce_numpy_datetimeis shared fromshared/time.pyrather than duplicatedartifacts_dirandpygeoapi_dirthat were lost when bringing files from the temporal-resampling branchTest plan
uv run pytest -q— 174 passed, 1 skipped (all green)POST /processes/resample/executionwith a valid derived dataset returns 200 withstatus: "completed"POST /processes/resample/executionwith a non-derived dataset returns 400POST /processes/resample/executionwith unknown dataset returns 404POST /processes/unknown/executionreturns 404processing:block are rejected with clear error message