Skip to content

feat: replace convert_units and pre_process with a unified transforms pipeline #79

@turban

Description

@turban

Problem

The current codebase has two separate, ad-hoc mechanisms for post-download data transformation before the GeoZarr is written:

Neither supports multiple ordered transformations, and adding new conversions requires changing core code rather than configuration.

Proposed solution

Replace both fields with a single transforms list in the dataset YAML, using the same dotted-path callable pattern already used by ingestion.function:

```yaml

era5_land temperature

transforms:

  • function: climate_api.transforms.convert_units

era5_land precipitation

transforms:

  • function: climate_api.transforms.deaccumulate_era5
  • function: climate_api.transforms.convert_units
    ```

Each callable has the signature (ds: xr.Dataset, dataset: dict[str, Any]) -> xr.Dataset and is resolved at runtime via importlib, exactly like download functions. convert_units reads the existing units/convert_units fields (kept for STAC metadata). Transforms from external packages (e.g. dhis2eo) are supported without any changes to core code.

Changes required

  • Add src/climate_api/transforms/ module with at least convert_units (replacing _UNIT_CONVERSIONS) and a placeholder/implementation for deaccumulate_era5
  • Update build_dataset_zarr() in downloader.py to run the transforms pipeline instead of calling _apply_unit_conversion() directly
  • Update era5_land.yaml to use transforms: entries, removing pre_process and keeping convert_units/units for STAC metadata only
  • Remove the hardcoded _UNIT_CONVERSIONS dict and _apply_unit_conversion() from downloader.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions