Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/language/reference/builders/aggregates.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Current aggregate authoring is explicit and scalar-expression-based.
| `avg` | `def avg(expr: ColumnExpr) -> AggregateMeasure` | Average one numeric scalar expression. |
| `min` | `def min(expr: ColumnExpr) -> AggregateMeasure` | Return the minimum non-null value for one orderable scalar expression. |
| `max` | `def max(expr: ColumnExpr) -> AggregateMeasure` | Return the maximum non-null value for one orderable scalar expression. |
| `approx_count_distinct` | `def approx_count_distinct(expr: ColumnExpr) -> AggregateMeasure` | Estimate distinct non-null expression values. |

## Modifiers

Expand All @@ -30,7 +31,7 @@ Aggregate measures support method-style modifiers:
## Example

```incan
from pub::inql.functions import add, avg, col, count, count_distinct, count_expr, count_if, eq, lit, max, min, str_lit, sum
from pub::inql.functions import add, approx_count_distinct, avg, col, count, count_distinct, count_expr, count_if, eq, lit, max, min, str_lit, sum

grouped = orders.group_by([col("customer_id")]).agg([
sum(add(col("amount"), lit(5))),
Expand All @@ -42,6 +43,7 @@ grouped = orders.group_by([col("customer_id")]).agg([
avg(col("amount")),
min(col("created_at")),
max(col("created_at")),
approx_count_distinct(col("user_id")),
])
```

Expand All @@ -54,5 +56,7 @@ grouped = orders.group_by([col("customer_id")]).agg([
- `count_if(predicate)` is compatibility sugar for `count().filter(predicate)`. Rows where the predicate is false or
null do not contribute to the aggregate.
- `sum`, `avg`, `min`, and `max` skip null values. They return backend-null results when no non-null input value exists.
- `approx_count_distinct(expr)` is approximate by contract, skips null values, allows aggregate-local filters, and rejects
an extra `distinct()` modifier because distinct estimation is already the helper's semantics.
- Unsupported aggregate modifiers fail at lowering or backend planning; they are not ignored.
- Future `.column` sugar and scoped aggregate symbols should lower to this same surface rather than replacing its semantics.
31 changes: 31 additions & 0 deletions docs/language/reference/functions/approximate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Approximate Functions (Reference)

Approximate helpers are explicit opt-in functions. InQL does not silently replace exact aggregates with approximate
execution because a backend can do so.

The current implemented slice is one aggregate:

| Function | Meaning |
| --- | --- |
| `approx_count_distinct(expr)` | Estimate the number of distinct non-null values produced by one expression. |

```incan
from pub::inql.functions import approx_count_distinct, col

summary = (
events
.group_by([col("campaign_id")])
.agg([approx_count_distinct(col("user_id"))])
)
```

`approx_count_distinct` is registered as an approximate aggregate with HyperLogLog-family metadata. The portable author
contract is an approximate non-null distinct-count estimate; the first slice does not expose a user-tunable relative
error parameter because the standard Substrait mapping for this function is unary. Backend adapters must keep this
approximation visible in capability/error handling rather than redefining exact `count_distinct` semantics.

The helper lowers through the standard Substrait `approx_count_distinct` aggregate extension name. The DataFusion
adapter maps that declaration to DataFusion's `approx_distinct` implementation name at the backend boundary.

Approximate percentile functions, sketch-state values, sketch serialization, and sketch merge/estimate helpers remain
future slices until their accuracy parameters, logical sketch types, and compatibility rules are explicit.
4 changes: 3 additions & 1 deletion docs/language/reference/functions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@ Today the concrete shipped surfaces are documented here:
- [Nested data functions](nested.md)
- [Window functions](windows.md)
- [Format functions](format.md)
- [Approximate functions](approximate.md)

The canonical scalar literal helper is `lit(...)`. Typed literal helpers construct the same scalar-expression representation.

The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions/<family>/<name>.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, windows, and format helpers. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature.
The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions/<family>/<name>.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, windows, format helpers, and approximate aggregates. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, approximation metadata, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature.

The registry is the source for non-derivable machine facts. Public helper declarations are the source for argument names, argument types, and return types. Docstrings remain human-facing explanation, examples, and parameter intent. The `registry-metadata` check validates the checked API metadata projections produced from public facade aliases, registry decorators, and decorated callable signatures. Runtime registry entries are lazy and process-local: they support helper execution and lowering for loaded helpers, while the complete public catalog comes from checked metadata. This matters for generated docs, diagnostics, Prism lowering, and backend capability checks as the catalog grows.

Expand Down Expand Up @@ -42,5 +43,6 @@ The registered helper surface currently includes:
| `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` |
| `sum(...)`, `count()`, `count_expr(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions; `count_expr(...)` is a compatibility spelling for future `count(expr)` helper overloading |
| `count_distinct(...)`, `count_if(...)` | aggregate | compatibility helpers that lower through aggregate modifiers over canonical `count` semantics |
| `approx_count_distinct(...)` | aggregate | approximate aggregate that lowers through the standard Substrait `approx_count_distinct` extension and is adapted to DataFusion's `approx_distinct` implementation at the backend boundary |

Future ANSI-style families should grow under this section instead of bloating `dataset_types` or `dataset_methods`.
1 change: 1 addition & 0 deletions docs/release_notes/v0_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable).
- **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, and `posexplode_outer(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, and lower through the current Substrait extension-relation gap encoding.
- **Window functions:** RFC 019 adds the first window-function planning slice with `window()` specs, `row_number()`, `rank()`, `dense_rank()`, and `with_window_column(...)`. Ranking windows require explicit ordering and lower through Substrait `ConsistentPartitionWindowRel`; backend execution support remains a separate adapter capability.
- **Format functions:** RFC 022 adds the first deterministic hashing slice with `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, and `sha2(...)`. Hash helpers operate on UTF-8 string bytes, return lowercase hexadecimal strings, lower through registry-owned Substrait metadata, and execute through the DataFusion-backed Session path.
- **Approximate functions:** RFC 023 adds the first approximate aggregate slice with `approx_count_distinct(...)`. The helper is opt-in, marked approximate in registry metadata, lowers through the standard Substrait `approx_count_distinct` aggregate extension name, and executes through the DataFusion-backed Session path via an adapter-local mapping to DataFusion's `approx_distinct` implementation.
- **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation.
- **Function extension policy:** RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics.
- **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution.
Expand Down
54 changes: 49 additions & 5 deletions docs/rfcs/023_approximate_sketch_functions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# InQL RFC 023: Approximate and sketch functions

- **Status:** Draft
- **Status:** In Progress
- **Created:** 2026-04-27
- **Author(s):** Danny Meijer (@dannymeijer)
- **Related:**
Expand All @@ -11,7 +11,7 @@
- InQL RFC 024 (function extension policy)
- **Issue:** [InQL #40](https://github.com/dannys-code-corner/InQL/issues/40)
- **RFC PR:** —
- **Written against:** Incan v0.2
- **Written against:** Incan v0.3-era InQL
- **Shipped in:** —

## Summary
Expand Down Expand Up @@ -112,10 +112,54 @@ This RFC is additive. Existing exact aggregates must not change semantics when a
- **Execution / interchange** — Prism and Substrait lowering must preserve approximate parameters, sketch state types, and merge semantics or reject unsupported functions.
- **Documentation** — docs must label approximate functions clearly and explain accuracy parameters.

## Unresolved questions
## Design Decisions

### Resolved

- The first implementation slice is `approx_count_distinct(expr)`. It is an aggregate measure, not a scalar expression,
and its helper name makes approximate execution an explicit author choice.
- `approx_count_distinct` is registered as approximate metadata with HyperLogLog-family semantics, mergeability, and an
approximate cardinality-result interpretation.
- The first slice follows the standard Substrait unary `approx_count_distinct` aggregate mapping. It does not expose a
user-tunable relative-error parameter because the validated standard mapping does not carry one.
- DataFusion's implementation is named `approx_distinct`; InQL keeps the standard Substrait function name in emitted
function metadata and rewrites only the DataFusion consumer declaration at the backend adapter boundary.
- `approx_count_distinct` allows aggregate-local filters and rejects an extra `distinct()` modifier because distinct
estimation is already the helper's semantics.
- `approx_percentile` is not implemented in this slice because the local Substrait aggregate-approx extension has a
standard `approx_count_distinct` mapping but no matching standard approximate percentile contract to preserve.
- Sketch-state construction, merge, estimate, serialization, and deserialization helpers remain future work until InQL
has explicit sketch logical types and compatibility rules.

### Remaining

- Should InQL standardize one sketch family per use case or expose multiple named families?
- What serialization format, if any, should be portable across backends?
- How should accuracy guarantees be documented without implying backend-independent statistical promises that are not true?

<!-- When every question is resolved, rename this section to **Design Decisions**, group answers under ### Resolved, and remove this comment. -->
- Should future approximate aggregates expose user-tunable accuracy parameters through aggregate options, option records,
or separate helper names when Substrait has no standard parameter slot?
- Which approximate percentile family should become the portable core contract, and how should percentile domain,
interpolation, and accuracy be specified?

## Implementation Plan

1. Add registry approximation metadata with exact-helper defaults.
2. Add `approx_count_distinct(expr)` under a logical approximate function family.
3. Add a stable Substrait anchor and keep emitted function metadata on the standard `approx_count_distinct` name.
4. Add a DataFusion adapter-local rewrite to `approx_distinct` for the first backend.
5. Add focused helper, registry, Substrait lowering, Prism, and DataFusion-backed session tests with materialized output.
6. Add user-facing approximate-function docs, aggregate-builder docs, and release notes.
7. Leave approximate percentile and sketch-state helpers for later RFC 023 slices once remaining contracts are resolved.

## Progress Checklist

- [x] RFC 023 moved to In Progress with a first implementation slice and recorded design decisions.
- [x] Registry approximation metadata added for intentionally approximate functions.
- [x] `approx_count_distinct` helper added under the function catalog.
- [x] Standard Substrait `approx_count_distinct` extension metadata added.
- [x] DataFusion adapter-local `approx_count_distinct` to `approx_distinct` mapping added.
- [x] Focused helper, registry, Substrait lowering, Prism, and DataFusion-backed session tests added.
- [x] User-facing approximate-function docs, aggregate-builder docs, and release notes added.
- [ ] Approximate percentile semantics specified and implemented.
- [ ] Sketch-state logical types specified and implemented.
- [ ] Sketch merge, estimate, serialize, and deserialize helpers specified and implemented.
2 changes: 1 addition & 1 deletion docs/rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la
| [020][rfc-020] | Draft | Nested data functions | |
| [021][rfc-021] | In Progress | Generator and table-valued functions | |
| [022][rfc-022] | In Progress | Semi-structured and format functions | |
| [023][rfc-023] | Draft | Approximate and sketch functions | |
| [023][rfc-023] | In Progress | Approximate and sketch functions | |
| [024][rfc-024] | Draft | Function extension policy | |

<!-- TODO: #7: auto populate this table (like how we do in incan) -->
Expand Down
6 changes: 6 additions & 0 deletions src/aggregate_builders.incn
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ pub enum AggregateKind(str):
Avg = "avg"
Min = "min"
Max = "max"
ApproxCountDistinct = "approx_count_distinct"


@derive(Clone)
Expand Down Expand Up @@ -137,3 +138,8 @@ pub def min(expr: ColumnExpr) -> AggregateMeasure:
pub def max(expr: ColumnExpr) -> AggregateMeasure:
"""Build one `max` aggregate measure over a scalar expression."""
return _aggregate_measure("max", AggregateKind.Max, expr, true)


pub def approx_count_distinct(expr: ColumnExpr) -> AggregateMeasure:
"""Build one approximate distinct-count aggregate measure over a scalar expression."""
return _aggregate_measure("approx_count_distinct", AggregateKind.ApproxCountDistinct, expr, true)
Loading
Loading