Conversation
There was a problem hiding this comment.
Pull request overview
This PR exposes several upstream DataFusion aggregate functions that were previously missing from the Python API, and adds unit tests to validate the new bindings.
Changes:
- Expose
groupingandpercentile_contfrom Rust bindings to the Python API. - Add Python-level wrapper for
var_populationas an alias ofvar_pop. - Add unit tests covering
percentile_cont,grouping, andvar_population.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
crates/core/src/functions.rs |
Enables the grouping aggregate binding and adds a new percentile_cont pyfunction export. |
python/datafusion/functions.py |
Adds public Python wrappers/exports for grouping, percentile_cont, and var_population. |
python/tests/test_functions.py |
Adds unit tests for the newly exposed aggregate functions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ation Expose upstream DataFusion aggregate functions that were not yet available in the Python API. Closes apache#1454. - grouping: returns grouping set membership indicator (rewritten by the ResolveGroupingFunction analyzer rule before physical planning) - percentile_cont: computes exact percentile using continuous interpolation (unlike approx_percentile_cont which uses t-digest) - var_population: alias for var_pop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0c8dc78 to
d16cff1
Compare
ntjohnson1
left a comment
There was a problem hiding this comment.
Grouping examples would make it a lot clearer. Without looking around or thinking about the description more deeply the usage isn't clear to me.
| Args: | ||
| expression: The column to check grouping status for | ||
| distinct: If True, compute on distinct values only | ||
| filter: If provided, only compute against rows for which the filter is True |
python/tests/test_functions.py
Outdated
| assert result.column(0)[0].as_py() == 3.0 | ||
|
|
||
|
|
||
| def test_percentile_cont_with_filter(): |
There was a problem hiding this comment.
NIT: Would be nicer to just be a parametrized test since the only diff is the filter and the result
python/tests/test_functions.py
Outdated
| assert result.column(0)[0].as_py() == 3.5 | ||
|
|
||
|
|
||
| def test_grouping(): |
There was a problem hiding this comment.
If you add more examples then probably don't need to also add tests here but this seems to only test one of the grouping cases
Add docstring example to grouping(), parametrize percentile_cont tests, and add multi-column grouping test case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expose ROLLUP, CUBE, and GROUPING SETS via the DataFrame API by adding static methods on GroupingSet that construct the corresponding Expr variants. Update grouping() docstring and tests to use the new API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add user documentation for GroupingSet.rollup, .cube, and .grouping_sets with Pokemon dataset examples. Document the upstream alias limitation (apache/datafusion#21411) in both the grouping() docstring and the aggregation user guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@ntjohnson1 I expanded the PR quite a bit to properly support grouping sets and I also updated the online documentation to explain it in a lot more detail. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @staticmethod | ||
| def rollup(*exprs: Expr) -> Expr: | ||
| """Create a ``ROLLUP`` grouping set for use with ``aggregate()``. | ||
|
|
||
| ``ROLLUP`` generates all prefixes of the given column list as | ||
| grouping sets. For example, ``rollup(a, b)`` produces grouping | ||
| sets ``(a, b)``, ``(a)``, and ``()`` (grand total). | ||
|
|
||
| This is equivalent to ``GROUP BY ROLLUP(a, b)`` in SQL. | ||
|
|
||
| Args: | ||
| *exprs: Column expressions to include in the rollup. | ||
|
|
||
| Examples: | ||
| >>> import pyarrow as pa | ||
| >>> import datafusion as dfn | ||
| >>> from datafusion.expr import GroupingSet | ||
| >>> ctx = dfn.SessionContext() | ||
| >>> df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]}) | ||
| >>> result = df.aggregate( | ||
| ... [GroupingSet.rollup(dfn.col("a"))], | ||
| ... [dfn.functions.sum(dfn.col("b")).alias("s"), | ||
| ... dfn.functions.grouping(dfn.col("a"))], | ||
| ... ).sort(dfn.col("a").sort(nulls_first=False)) | ||
| >>> batches = result.collect() | ||
| >>> pa.concat_arrays([b.column("s") for b in batches]).to_pylist() | ||
| [30, 30, 60] | ||
|
|
||
| See Also: | ||
| :py:meth:`cube`, :py:meth:`grouping_sets`, | ||
| :py:func:`~datafusion.functions.grouping` | ||
| """ | ||
| args = [e.expr for e in exprs] | ||
| return Expr(expr_internal.GroupingSet.rollup(*args)) | ||
|
|
There was a problem hiding this comment.
The new GroupingSet factory methods only accept Expr objects (e.g., GroupingSet.rollup(col("a"))). DataFrame.aggregate() accepts both Expr and str column names, so for API consistency it would be helpful if GroupingSet.rollup/cube/grouping_sets also accepted str and converted via the existing expr.py helper (_to_raw_expr / Expr.column). This avoids surprising TypeError when users try GroupingSet.rollup("a").
| def percentile_cont( | ||
| sort_expression: Expr | SortExpr, | ||
| percentile: float, | ||
| filter: Expr | None = None, | ||
| ) -> Expr: | ||
| """Computes the exact percentile of input values using continuous interpolation. | ||
|
|
||
| Unlike :py:func:`approx_percentile_cont`, this function computes the exact | ||
| percentile value rather than an approximation. | ||
|
|
||
| If using the builder functions described in ref:`_aggregation` this function ignores | ||
| the options ``order_by``, ``null_treatment``, and ``distinct``. | ||
|
|
||
| Args: | ||
| sort_expression: Values for which to find the percentile | ||
| percentile: This must be between 0.0 and 1.0, inclusive | ||
| filter: If provided, only compute against rows for which the filter is True | ||
|
|
||
| Examples: | ||
| >>> ctx = dfn.SessionContext() | ||
| >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]}) | ||
| >>> result = df.aggregate( | ||
| ... [], [dfn.functions.percentile_cont( | ||
| ... dfn.col("a"), 0.5 | ||
| ... ).alias("v")]) | ||
| >>> result.collect_column("v")[0].as_py() | ||
| 3.0 | ||
|
|
||
| >>> result = df.aggregate( | ||
| ... [], [dfn.functions.percentile_cont( | ||
| ... dfn.col("a"), 0.5, | ||
| ... filter=dfn.col("a") > dfn.lit(1.0), | ||
| ... ).alias("v")]) | ||
| >>> result.collect_column("v")[0].as_py() | ||
| 3.5 | ||
| """ | ||
| sort_expr_raw = sort_or_default(sort_expression) | ||
| filter_raw = filter.expr if filter is not None else None | ||
| return Expr(f.percentile_cont(sort_expr_raw, percentile, filter=filter_raw)) | ||
|
|
There was a problem hiding this comment.
Issues/PR rationale mention exposing percentile_cont/quantile_cont, but only percentile_cont is added. Consider adding a quantile_cont Python alias (and exporting it in all) so users can call the function using either upstream name, and add a small unit test to cover the alias.
| 9. Approximation Functions | ||
| - :py:func:`datafusion.functions.approx_distinct` | ||
| - :py:func:`datafusion.functions.approx_median` | ||
| - :py:func:`datafusion.functions.approx_percentile_cont` | ||
| - :py:func:`datafusion.functions.approx_percentile_cont_with_weight` | ||
| 10. Grouping Set Functions | ||
| - :py:func:`datafusion.functions.grouping` | ||
| - :py:meth:`datafusion.expr.GroupingSet.rollup` | ||
| - :py:meth:`datafusion.expr.GroupingSet.cube` | ||
| - :py:meth:`datafusion.expr.GroupingSet.grouping_sets` |
There was a problem hiding this comment.
The docs’ aggregate function list was updated to mention grouping sets, but it doesn’t include the newly exposed percentile_cont or var_population functions. Please add these to the appropriate sections (e.g., percentile_cont near the other percentile functions, and var_population as an alias alongside var_pop) so the user guide matches the Python API.
ntjohnson1
left a comment
There was a problem hiding this comment.
The second pass helped a lot. Just a nit on some minor formatting on the examples.
| f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")]) | ||
|
|
||
|
|
||
| Grouping Sets |
|
|
||
| Examples: | ||
| >>> import pyarrow as pa | ||
| >>> import datafusion as dfn |
There was a problem hiding this comment.
NIT: dfn is globally available to doctests. We could potentially consider adding pyarrow as pa as well
| >>> batches = result.collect() | ||
| >>> pa.concat_arrays([b.column("s") for b in batches]).to_pylist() | ||
| [30, 30, 60] |
There was a problem hiding this comment.
NIT: Is there a reason you can't just do collect_column here and print that directly?
| >>> batches = result.collect() | ||
| >>> pa.concat_arrays([b.column(2) for b in batches]).to_pylist() |
There was a problem hiding this comment.
Similar note that the batch manipulation might not be needed
| >>> batches = result.collect() | ||
| >>> pa.concat_arrays( | ||
| ... [b.column("s") for b in batches]).to_pylist() |
There was a problem hiding this comment.
Last echo on the concat arrays bit
| ... [dfn.functions.sum(dfn.col("b")).alias("s"), | ||
| ... dfn.functions.grouping(dfn.col("a"))], | ||
| ... ).sort(dfn.col("a").sort(nulls_first=False)) | ||
| >>> batches = result.collect() |
Which issue does this PR close?
Closes #861
Closes #925
Closes #1454
Rationale for this change
These functions exist upstream but were not exposed to the Python API
What changes are included in this PR?
Expose functions to Python API
Add unit tests
Are there any user-facing changes?
New addition only.