Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/language/reference/dataset_methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@ The Substrait helper surface behind these methods is split by semantic role:
| `with_column` | `def with_column(self, name: str, expr: ColumnExpr) -> Self` | Add or replace one projected column using a scalar expression. |
| `group_by` | `def group_by(self, columns: list[ColumnExpr]) -> Self` | Define grouping keys using scalar expressions. |
| `agg` | `def agg(self, measures: list[AggregateMeasure]) -> Self` | Apply aggregate measures over the current relation or current grouping. |
| `generate` | `def generate(self, generator: GeneratorApplication) -> Self` | Apply a relation-shaping generator such as `explode(...)` with explicit output aliases. |
| `order_by` | `def order_by(self, columns: list[ColumnExpr]) -> Self` | Sort rows by scalar expressions or ordering helpers such as `asc(...)` and `desc(...)`. |
| `limit` | `def limit(self, n: int) -> Self` | Cap row count. |
| `explode` | `def explode(self) -> Self` | Expand a nested list column into rows. |
| `explode` | `def explode(self) -> Self` | Compatibility marker for the older EXPLODE extension path. Prefer `generate(explode(...))`. |

## `with_column`

Expand Down Expand Up @@ -67,6 +68,7 @@ def enrich(orders: LazyFrame[Order]) -> LazyFrame[Order]:

- `join(...)` is constrained to same-carrier inputs and the boolean join predicate surface shown in the signature.
- `select(...)` preserves projection shape; explicit projection lists are represented today through `with_column(...)` and scalar-expression builders.
- `generate(...)` preserves all input columns and appends generated output aliases. Alias collisions are rejected during planning/lowering.
- `DataFrame[T]` exposes materialized metadata and preview text; row-level accessors belong to the materialized DataFrame API surface.
- Query-block and scoped DSL surfaces lower into these builder APIs rather than defining separate method semantics.

Expand Down
32 changes: 32 additions & 0 deletions docs/language/reference/functions/generators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Generator and Table-Valued Functions (Reference)

Generators are relation-shaping operations. They are registry-backed like scalar and aggregate helpers, but they return
`GeneratorApplication` values and must be applied through a relation method such as `generate(...)`.

```incan
from pub::inql import LazyFrame
from pub::inql.functions import col, explode
from models import Order

def order_lines(orders: LazyFrame[Order]) -> LazyFrame[Order]:
return orders.generate(explode(col("line_items"), "line_item"))
```

The explicit generator surface currently includes:

| Function | Output aliases | Relation effect |
| --- | --- | --- |
| `explode(expr, as_)` | one value column | Emits one row per array element; null or empty inputs emit zero rows. |
| `explode_outer(expr, as_)` | one value column | Preserves the input row for null or empty inputs and emits a null generated value. |
| `posexplode(expr, position_as, value_as)` | position and value columns | Emits one row per array element with a zero-based position column. |
| `posexplode_outer(expr, position_as, value_as)` | position and value columns | Outer positional explode with the same zero-based position rule. |

Generator applications preserve input columns and append generated columns in declaration order. Generated aliases are
required, must be non-empty, and must not collide with existing input columns.

The older zero-argument `DataSet.explode()` method remains available as a compatibility marker for the current Substrait
extension relation gap. New code should prefer `generate(explode(...))` so the relation-shaping function identity and
output schema are explicit.

Nested scalar helpers such as `array_flatten(...)` remain scalar expressions. They do not expand rows and are documented
on the [nested data functions](nested.md) page.
4 changes: 3 additions & 1 deletion docs/language/reference/functions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,12 @@ Today the concrete shipped surfaces are documented here:
- [Filter builders](../builders/filters.md)
- [Aggregate builders](../builders/aggregates.md)
- [Projection builders](../builders/projections.md)
- [Generator and table-valued functions](generators.md)
- [Nested data functions](nested.md)

The canonical scalar literal helper is `lit(...)`. Typed literal helpers construct the same scalar-expression representation.

The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions/<family>/<name>.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, and nested data. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature.
The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions/<family>/<name>.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, and nested data. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature.

The registry is the source for non-derivable machine facts. Public helper declarations are the source for argument names, argument types, and return types. Docstrings remain human-facing explanation, examples, and parameter intent. The `registry-metadata` check validates the checked API metadata projections produced from public facade aliases, registry decorators, and decorated callable signatures. Runtime registry entries are lazy and process-local: they support helper execution and lowering for loaded helpers, while the complete public catalog comes from checked metadata. This matters for generated docs, diagnostics, Prism lowering, and backend capability checks as the catalog grows.

Expand All @@ -33,6 +34,7 @@ The registered helper surface currently includes:
| `in_(...)`, `between(...)` | scalar | built-in membership/range lowering (`SingularOrList` and `between`) |
| `abs(...)`, `ceil(...)`, `floor(...)`, `round(...)` | scalar | registered Substrait math scalar mappings; `round(...)` is currently the single-argument form |
| `array(...)`, `cardinality(...)`, `array_contains(...)`, `arrays_overlap(...)`, `array_position(...)`, `element_at(...)`, `array_sort(...)`, `array_distinct(...)`, `array_except(...)`, `array_intersect(...)`, `array_union(...)`, `array_join(...)`, `array_slice(...)`, `array_reverse(...)`, `array_flatten(...)`, `map_from_arrays(...)`, `map_extract(...)`, `map_contains_key(...)`, `map_keys(...)`, `map_values(...)`, `map_entries(...)`, `named_struct(...)` | scalar | registered nested scalar helpers backed by Substrait extension mappings; `map_contains_key(...)` lowers as a documented predicate rewrite |
| `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)` | generator | relation-extension mappings consumed by `generate(...)`; positional forms use zero-based positions |
| `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` |
| `sum(...)`, `count()`, `count_expr(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions; `count_expr(...)` is a compatibility spelling for future `count(expr)` helper overloading |
| `count_distinct(...)`, `count_if(...)` | aggregate | compatibility helpers that lower through aggregate modifiers over canonical `count` semantics |
Expand Down
3 changes: 3 additions & 0 deletions docs/language/reference/substrait/operator_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ Core Substrait does not define a portable unnest or explode `Rel` at the logical
Current package-level RFC 002 boundary registration:

- `https://inql.io/extensions/v0.1/unnest.yaml#explode`
- `https://inql.io/extensions/v0.1/unnest.yaml#explode_outer`
- `https://inql.io/extensions/v0.1/unnest.yaml#posexplode`
- `https://inql.io/extensions/v0.1/unnest.yaml#posexplode_outer`

### Pivot / unpivot

Expand Down
1 change: 1 addition & 0 deletions docs/release_notes/v0_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable).
- **Core scalar functions:** RFC 015 adds registry-backed scalar function applications and the first core helper slice for casts, comparisons, boolean logic, null/NaN predicates, arithmetic, conditionals, membership/range predicates, and ordering expressions. Implemented helpers lower to Substrait IR through registry metadata, built-in Rex shapes, or structural sort-field lowering; DataFusion remains the first execution adapter rather than the semantic boundary.
- **Common scalar functions:** The first RFC 018 slice adds registry-backed math helpers for `abs(...)`, `ceil(...)`, `floor(...)`, and single-argument `round(...)`, with Substrait mappings and DataFusion-backed execution coverage.
- **Nested data functions:** RFC 020 adds registry-backed scalar helpers for array construction/access, cardinality, containment, overlap, sorting, set-like operations, joining, slicing, reversing, scalar array flattening, map construction/access, map key/value/entry extraction, map key containment, and named struct construction. These helpers lower through Substrait extension metadata and execute through the DataFusion-backed Session path without introducing generator semantics.
- **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, and `posexplode_outer(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, and lower through the current Substrait extension-relation gap encoding.
- **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation.
- **Function extension policy:** RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics.
- **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution.
Expand Down
34 changes: 20 additions & 14 deletions docs/rfcs/021_generator_table_functions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# InQL RFC 021: Generator and table-valued functions

- **Status:** Draft
- **Status:** In Progress
- **Created:** 2026-04-27
- **Author(s):** Danny Meijer (@dannymeijer)
- **Related:**
Expand Down Expand Up @@ -42,14 +42,15 @@ InQL already has an unnest/explode design direction through its Substrait work.

## Guide-level explanation (how authors think about it)

Authors should use generators when one input row may become multiple output rows:
Authors should use generators when one input row may become multiple output rows. In the current builder surface,
generators are constructed as explicit applications and then applied to a relation:

```incan
from pub::inql.functions import col
from pub::inql.functions import col, explode

items = (
orders
.explode(col("line_items"), as_="line_item")
.generate(explode(col("line_items"), "line_item"))
.select(["order_id", "line_item"])
)
```
Expand All @@ -64,25 +65,25 @@ Generator functions must be registry entries with function class `generator` or

`explode_outer(array_expr)` must preserve the input row when the input array is null or empty and must produce a null generated value according to its output schema.

`posexplode(array_expr)` and `posexplode_outer(array_expr)` must include a positional output column in addition to the generated element. The position origin must be specified before this RFC reaches Planned status.
`posexplode(array_expr)` and `posexplode_outer(array_expr)` must include a positional output column in addition to the generated element. Positional output is zero-based because `posexplode` follows the Spark-compatible naming convention rather than InQL's one-based scalar collection indexing rule.

`inline(array_of_struct_expr)` must expand each struct element into output columns. `inline_outer` must preserve outer rows for null or empty input according to the outer generator rule.

`stack` must construct multiple output rows from explicit expressions according to a declared row count and output schema.

`flatten` must be treated as a table-valued/generator operation when supported. Its exact input type, recursive behavior, path behavior, and output columns must be specified before it reaches Planned status.
`flatten` must be treated as a table-valued/generator operation when supported. Portable InQL does not yet define Snowflake-style recursive/path flattening; scalar `array_flatten(...)` remains part of RFC 020 and does not change row cardinality.

Every generator must define output column names, output types, nullability, interaction with existing columns, and aliasing requirements. Name collisions must be diagnosed unless an explicit overwrite or qualification rule applies.

## Design details

### Syntax

Generators may appear as dataframe relation methods, query-block clauses, or table-valued function forms. Regardless of syntax, they must lower to relation-shaping operations.
Generators may appear as dataframe relation methods, query-block clauses, or table-valued function forms. Regardless of syntax, they must lower to relation-shaping operations. The initial builder API uses `generate(generator)` to avoid overloading the existing zero-argument compatibility `explode()` method.

### Semantics

Generator output schema is part of the relation schema after the generator operation. Generators may preserve input columns, replace a nested column with generated columns, or produce a new relation depending on the function and syntax, but the behavior must be explicit.
Generator output schema is part of the relation schema after the generator operation. The initial portable generator applications preserve all input columns and append generated output columns in declaration order. Generated aliases are required, must be non-empty, and must not collide with existing columns.

### Interaction with other InQL surfaces

Expand Down Expand Up @@ -112,11 +113,16 @@ Existing unnest/explode behavior should align with this RFC. If current behavior
- **Execution / interchange** — Prism and Substrait lowering must represent cardinality changes and output schemas faithfully.
- **Documentation** — generator docs should explain cardinality and schema effects before listing helper names.

## Unresolved questions
## Design Decisions

- Should positional generators use zero-based or one-based positions?
- Should `.explode(...)` preserve all input columns by default?
- What aliasing syntax should be required for generated output columns?
- What subset of Snowflake-style `flatten` behavior belongs in portable InQL versus a warehouse compatibility extension?
### Resolved

<!-- When every question is resolved, rename this section to **Design Decisions**, group answers under ### Resolved, and remove this comment. -->
- Positional generators use zero-based positions for compatibility with the `posexplode` naming convention.
- Explicit generator applications preserve all input columns by default and append generated output columns.
- Generated aliases are required at builder construction time.
- Snowflake-style recursive/path `flatten` remains outside the portable core until its output schema and compatibility category are specified separately.

### Remaining

- `inline`, `inline_outer`, `stack`, and portable table-valued `flatten` need separate helper slices on top of the generator application model.
- Query-block generator syntax still needs compiler/query-surface work.
2 changes: 1 addition & 1 deletion docs/rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la
| [018][rfc-018] | Draft | Common scalar function catalog | |
| [019][rfc-019] | Draft | Window functions | |
| [020][rfc-020] | Draft | Nested data functions | |
| [021][rfc-021] | Draft | Generator and table-valued functions | |
| [021][rfc-021] | In Progress | Generator and table-valued functions | |
| [022][rfc-022] | Draft | Semi-structured and format functions | |
| [023][rfc-023] | Draft | Approximate and sketch functions | |
| [024][rfc-024] | Draft | Function extension policy | |
Expand Down
Loading
Loading