Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/language/reference/functions/format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Format Functions (Reference)

Format helpers operate on scalar payloads that are already present in a relation. They do not read files, infer source
schemas from external locations, or change relation cardinality.

The current implemented slice is deterministic string hashing:

| Function | Meaning |
| --- | --- |
| `md5(expr)` | Return the lowercase hexadecimal MD5 digest for one string expression. |
| `sha224(expr)` | Return the lowercase hexadecimal SHA-224 digest for one string expression. |
| `sha256(expr)` | Return the lowercase hexadecimal SHA-256 digest for one string expression. |
| `sha384(expr)` | Return the lowercase hexadecimal SHA-384 digest for one string expression. |
| `sha512(expr)` | Return the lowercase hexadecimal SHA-512 digest for one string expression. |
| `sha2(expr, bit_length)` | Compatibility helper that rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths. |

```incan
from pub::inql.functions import col, md5, sha2

projected = (
events
.with_column("user_hash", sha2(col("user_id"), 256))
.with_column("payload_md5", md5(col("payload")))
)
```

Hash helpers operate on UTF-8 string bytes and return lowercase hexadecimal strings. `sha2(...)` accepts `224`, `256`,
`384`, and `512`; unsupported digest lengths are rejected by the helper rather than being passed through to a backend.

JSON, CSV, URL, and dynamic-value predicate helpers remain future format-function slices until their schema arguments,
option records, path validation rules, and dynamic value model are specified.
4 changes: 3 additions & 1 deletion docs/language/reference/functions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@ Today the concrete shipped surfaces are documented here:
- [Generator and table-valued functions](generators.md)
- [Nested data functions](nested.md)
- [Window functions](windows.md)
- [Format functions](format.md)

The canonical scalar literal helper is `lit(...)`. Typed literal helpers construct the same scalar-expression representation.

The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions/<family>/<name>.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, and windows. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature.
The current registry-backed helper surface is registered in the package-owned function registry. Registry types live in `src/function_registry.incn`, the shared package registry lives in `src/functions/registry.incn`, and concrete public helper entries are produced by `function_registry.add(...)` decorators in individual `src/functions/<family>/<name>.incn` modules. The registry-backed families are references, literals, casts, operators, predicates, conditionals, math, ordering, aggregates, generators, nested data, windows, and format helpers. Each runtime entry exposes a stable function reference such as `inql.functions.col`, namespace, canonical name, typed lifecycle metadata (`since`, versioned changes, and optional deprecation), RFC 024 policy category, function class, null behavior, alias policy, aggregate modifier policy, and Substrait mapping metadata. Checked function signatures come from the public helper declaration, not from a second hand-written registry signature.

The registry is the source for non-derivable machine facts. Public helper declarations are the source for argument names, argument types, and return types. Docstrings remain human-facing explanation, examples, and parameter intent. The `registry-metadata` check validates the checked API metadata projections produced from public facade aliases, registry decorators, and decorated callable signatures. Runtime registry entries are lazy and process-local: they support helper execution and lowering for loaded helpers, while the complete public catalog comes from checked metadata. This matters for generated docs, diagnostics, Prism lowering, and backend capability checks as the catalog grows.

Expand All @@ -37,6 +38,7 @@ The registered helper surface currently includes:
| `array(...)`, `cardinality(...)`, `array_contains(...)`, `arrays_overlap(...)`, `array_position(...)`, `element_at(...)`, `array_sort(...)`, `array_distinct(...)`, `array_except(...)`, `array_intersect(...)`, `array_union(...)`, `array_join(...)`, `array_slice(...)`, `array_reverse(...)`, `array_flatten(...)`, `map_from_arrays(...)`, `map_extract(...)`, `map_contains_key(...)`, `map_keys(...)`, `map_values(...)`, `map_entries(...)`, `named_struct(...)` | scalar | registered nested scalar helpers backed by Substrait extension mappings; `map_contains_key(...)` lowers as a documented predicate rewrite |
| `explode(...)`, `explode_outer(...)`, `posexplode(...)`, `posexplode_outer(...)` | generator | relation-extension mappings consumed by `generate(...)`; positional forms use zero-based positions |
| `window()`, `row_number()`, `rank()`, `dense_rank()` | window | `window()` builds structural window-spec metadata; ranking helpers lower through `ConsistentPartitionWindowRel` when placed with `with_window_column(...)` |
| `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, `sha2(...)` | scalar | registered format/hash helpers; concrete helpers lower through Substrait extension mappings, while `sha2(...)` rewrites to a supported concrete SHA-2 helper |
| `asc(...)`, `desc(...)`, `asc_nulls_first(...)`, `asc_nulls_last(...)`, `desc_nulls_first(...)`, `desc_nulls_last(...)` | ordering | structural sort-field helpers consumed by `order_by(...)` and lowered to Substrait `SortRel.sorts` |
| `sum(...)`, `count()`, `count_expr(...)`, `avg(...)`, `min(...)`, `max(...)` | aggregate | registered Substrait extension functions; `count_expr(...)` is a compatibility spelling for future `count(expr)` helper overloading |
| `count_distinct(...)`, `count_if(...)` | aggregate | compatibility helpers that lower through aggregate modifiers over canonical `count` semantics |
Expand Down
1 change: 1 addition & 0 deletions docs/release_notes/v0_1.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Entries will be filled in as work lands (link RFCs and PRs when applicable).
- **Nested data functions:** RFC 020 adds registry-backed scalar helpers for array construction/access, cardinality, containment, overlap, sorting, set-like operations, joining, slicing, reversing, scalar array flattening, map construction/access, map key/value/entry extraction, map key containment, and named struct construction. These helpers lower through Substrait extension metadata and execute through the DataFusion-backed Session path without introducing generator semantics.
- **Generator functions:** RFC 021 adds registry-backed generator applications for `explode(...)`, `explode_outer(...)`, `posexplode(...)`, and `posexplode_outer(...)`. Generators remain relation-shaping operations applied with `generate(...)`; they preserve input columns, require explicit output aliases, and lower through the current Substrait extension-relation gap encoding.
- **Window functions:** RFC 019 adds the first window-function planning slice with `window()` specs, `row_number()`, `rank()`, `dense_rank()`, and `with_window_column(...)`. Ranking windows require explicit ordering and lower through Substrait `ConsistentPartitionWindowRel`; backend execution support remains a separate adapter capability.
- **Format functions:** RFC 022 adds the first deterministic hashing slice with `md5(...)`, `sha224(...)`, `sha256(...)`, `sha384(...)`, `sha512(...)`, and `sha2(...)`. Hash helpers operate on UTF-8 string bytes, return lowercase hexadecimal strings, lower through registry-owned Substrait metadata, and execute through the DataFusion-backed Session path.
- **Function registry:** RFC 014 adds declaration-site registry decorators for the current public helper surface, including stable function references, checked signature projection, lifecycle metadata, behavior categories, alias policy, Substrait mapping categories, and checked API metadata drift validation.
- **Function extension policy:** RFC 024 policy metadata now distinguishes portable core functions, namespaced extension-only functions, opt-in compatibility aliases, engine-specific functions, and rejected compatibility requests without adding an extension plugin system or backend-owned semantics.
- **Projection:** builder-based `with_column`, `add`, `mul`, and literal expression helpers now lower derived columns through Prism, Substrait, and Session execution.
Expand Down
39 changes: 34 additions & 5 deletions docs/rfcs/022_semi_structured_format_functions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# InQL RFC 022: Semi-structured and format functions

- **Status:** Draft
- **Status:** In Progress
- **Created:** 2026-04-27
- **Author(s):** Danny Meijer (@dannymeijer)
- **Related:**
Expand All @@ -12,7 +12,7 @@
- InQL RFC 020 (nested data functions)
- **Issue:** [InQL #39](https://github.com/dannys-code-corner/InQL/issues/39)
- **RFC PR:** —
- **Written against:** Incan v0.2
- **Written against:** Incan v0.3-era InQL
- **Shipped in:** —

## Summary
Expand Down Expand Up @@ -115,12 +115,41 @@ This RFC is additive. It should not change existing CSV ingestion behavior.
- **Execution / interchange** — Prism and Substrait lowering must preserve parser options, hash encodings, and structured return values or diagnose unsupported functions.
- **Documentation** — docs should distinguish scalar format functions from session read/write APIs.

## Unresolved questions
## Design Decisions

### Resolved

- The first implementation slice is deterministic hashing. JSON, CSV, URL, dynamic-value predicates, and structured parser helpers remain future slices because their schema arguments, option records, path validation, and dynamic value model are not settled here.
- Hash helpers in this slice operate on UTF-8 string bytes and return lowercase hexadecimal strings.
- Portable concrete hash helpers are `md5`, `sha224`, `sha256`, `sha384`, and `sha512`, each with an honest Substrait extension mapping and DataFusion-backed execution coverage.
- `sha2(expr, bit_length)` is a compatibility helper, not a separate backend mapping. It rewrites to `sha224`, `sha256`, `sha384`, or `sha512` for supported literal bit lengths and rejects unsupported values.
- `sha1`, `crc32`, and `xxhash64` are not implemented in the first slice because no honest Substrait/DataFusion mapping was validated for this branch.

### Remaining

- Should `from_json` accept model types directly as schema arguments, or only explicit schema values?
- Should invalid JSON path expressions be compile-time errors when literal and runtime errors otherwise?
- What option-record shape should CSV and JSON scalar parsers use?
- Should hash functions return binary values or lowercase hexadecimal strings by default?
- Should future binary-oriented hash helpers return binary values, lowercase hexadecimal strings, or an explicit typed encoding wrapper?
- Which variant-style type predicates are portable enough for InQL core, and which should stay in a Snowflake-compatibility extension?

<!-- When every question is resolved, rename this section to **Design Decisions**, group answers under ### Resolved, and remove this comment. -->
## Implementation Plan

1. Add registry-backed hashing helpers under a logical function family.
2. Add stable Substrait extension anchors for concrete hash helpers.
3. Keep `sha2(...)` as a compatibility rewrite over concrete helpers rather than a second mapping.
4. Add focused helper, registry, Substrait lowering, and DataFusion session tests with concrete digest values.
5. Add user-facing format-function docs and release notes.
6. Leave parser, URL, and dynamic-value helpers for later RFC 022 slices once their remaining design questions are resolved.

## Progress Checklist

- [x] RFC 022 moved to In Progress with a first implementation slice and recorded design decisions.
- [x] `md5`, `sha224`, `sha256`, `sha384`, `sha512`, and `sha2` helpers added under the function catalog.
- [x] Concrete hash helpers registered with Substrait extension metadata.
- [x] `sha2(...)` implemented as a literal-bit-length rewrite with invalid-input diagnostics.
- [x] Focused helper, registry, Substrait lowering, and DataFusion-backed session tests added.
- [x] User-facing format-function docs and release notes added.
- [ ] JSON and CSV scalar parser helpers specified and implemented.
- [ ] URL helper semantics specified and implemented.
- [ ] Dynamic-value predicate semantics specified and implemented.
2 changes: 1 addition & 1 deletion docs/rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ InQL uses its **own** RFC series (starting at 000), independent of the [Incan la
| [019][rfc-019] | In Progress | Window functions | |
| [020][rfc-020] | Draft | Nested data functions | |
| [021][rfc-021] | In Progress | Generator and table-valued functions | |
| [022][rfc-022] | Draft | Semi-structured and format functions | |
| [022][rfc-022] | In Progress | Semi-structured and format functions | |
| [023][rfc-023] | Draft | Approximate and sketch functions | |
| [024][rfc-024] | Draft | Function extension policy | |

Expand Down
51 changes: 51 additions & 0 deletions src/functions/hashing/md5.incn
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""
MD5 hash helper.

`md5` hashes a string expression and returns its lowercase hexadecimal digest.
"""

from function_registry import (
FunctionClass,
FunctionLifecycle,
FunctionNullBehavior,
deterministic_spec,
extension_mapping,
v0_1,
)
from functions.registry import function_registry, registered_application
from projection_builders import ColumnExpr
from substrait.function_extensions import MD5_FUNCTION_ANCHOR


@function_registry.add("md5", deterministic_spec(
FunctionClass.Scalar,
FunctionLifecycle(since=v0_1, changed=[], deprecated=None),
FunctionNullBehavior.DependsOnInputs,
extension_mapping("md5", MD5_FUNCTION_ANCHOR),
))
pub def md5(expr: ColumnExpr) -> ColumnExpr:
"""
Build an MD5 hexadecimal digest expression.

Examples:
user_digest = md5(col("user_id"))

Parameters:
expr: String expression whose UTF-8 bytes should be hashed.
"""
return registered_application("md5", [expr])


module tests:
from projection_builders import (
ColumnExprKind,
col,
column_expr_argument_count,
column_expr_function_name,
column_expr_kind,
)
def test_md5_builds_registered_application() -> None:
expr = md5(col("payload"))
assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction
assert column_expr_function_name(expr) == "md5"
assert column_expr_argument_count(expr) == 1
76 changes: 76 additions & 0 deletions src/functions/hashing/sha2.incn
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
"""
SHA-2 compatibility helper.

`sha2(expr, bits)` rewrites to the matching concrete SHA-2 helper for supported digest lengths.
"""

from rust::incan_stdlib::errors import raise_value_error
from function_registry import (
FunctionClass,
FunctionDeterminism,
FunctionErrorBehavior,
FunctionLifecycle,
FunctionNullBehavior,
compatibility_alias_spec,
core_function_namespace,
rewrite_mapping,
v0_1,
)
from functions.hashing.sha224 import sha224
from functions.hashing.sha256 import sha256
from functions.hashing.sha384 import sha384
from functions.hashing.sha512 import sha512
from functions.registry import function_registry
from projection_builders import ColumnExpr


@function_registry.add("sha2", compatibility_alias_spec(
core_function_namespace(),
FunctionClass.Scalar,
["sha224", "sha256", "sha384", "sha512"],
FunctionLifecycle(since=v0_1, changed=[], deprecated=None),
FunctionDeterminism.Deterministic,
FunctionNullBehavior.DependsOnInputs,
FunctionErrorBehavior.InvalidInputDiagnostic,
rewrite_mapping("sha2(expr, bits) -> sha224/sha256/sha384/sha512(expr) for supported literal bit lengths"),
))
pub def sha2(expr: ColumnExpr, bit_length: int) -> ColumnExpr:
"""
Build a SHA-2 hexadecimal digest expression for a supported digest length.

Examples:
user_digest = sha2(col("user_id"), 256)

Parameters:
expr: String expression whose UTF-8 bytes should be hashed.
bit_length: Supported digest size: 224, 256, 384, or 512.
"""
if bit_length == 224:
return sha224(expr)
if bit_length == 256:
return sha256(expr)
if bit_length == 384:
return sha384(expr)
if bit_length == 512:
return sha512(expr)
return raise_value_error("sha2 bit_length must be one of 224, 256, 384, or 512")


module tests:
from std.testing import assert_raises
from projection_builders import (
ColumnExprKind,
col,
column_expr_argument_count,
column_expr_function_name,
column_expr_kind,
)
def test_sha2_rewrites_to_supported_sha2_helper() -> None:
expr = sha2(col("payload"), 256)
assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction
assert column_expr_function_name(expr) == "sha256"
assert column_expr_argument_count(expr) == 1
def _call_sha2_with_unsupported_length() -> None:
sha2(col("payload"), 1)
def test_sha2_rejects_unsupported_bit_length() -> None:
assert_raises[ValueError](_call_sha2_with_unsupported_length)
51 changes: 51 additions & 0 deletions src/functions/hashing/sha224.incn
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""
SHA-224 hash helper.

`sha224` hashes a string expression and returns its lowercase hexadecimal digest.
"""

from function_registry import (
FunctionClass,
FunctionLifecycle,
FunctionNullBehavior,
deterministic_spec,
extension_mapping,
v0_1,
)
from functions.registry import function_registry, registered_application
from projection_builders import ColumnExpr
from substrait.function_extensions import SHA224_FUNCTION_ANCHOR


@function_registry.add("sha224", deterministic_spec(
FunctionClass.Scalar,
FunctionLifecycle(since=v0_1, changed=[], deprecated=None),
FunctionNullBehavior.DependsOnInputs,
extension_mapping("sha224", SHA224_FUNCTION_ANCHOR),
))
pub def sha224(expr: ColumnExpr) -> ColumnExpr:
"""
Build a SHA-224 hexadecimal digest expression.

Examples:
payload_digest = sha224(col("payload"))

Parameters:
expr: String expression whose UTF-8 bytes should be hashed.
"""
return registered_application("sha224", [expr])


module tests:
from projection_builders import (
ColumnExprKind,
col,
column_expr_argument_count,
column_expr_function_name,
column_expr_kind,
)
def test_sha224_builds_registered_application() -> None:
expr = sha224(col("payload"))
assert column_expr_kind(expr) == ColumnExprKind.ScalarFunction
assert column_expr_function_name(expr) == "sha224"
assert column_expr_argument_count(expr) == 1
Loading
Loading