Add missing aggregate functions by timsaucer · Pull Request #1471 · apache/datafusion-python

timsaucer · 2026-03-31T17:11:08Z

Which issue does this PR close?

Closes #861
Closes #925
Closes #1454

Rationale for this change

These functions exist upstream but were not exposed to the Python API

What changes are included in this PR?

Expose functions to Python API
Add unit tests

Are there any user-facing changes?

New addition only.

Copilot

Pull request overview

This PR exposes several upstream DataFusion aggregate functions that were previously missing from the Python API, and adds unit tests to validate the new bindings.

Changes:

Expose grouping and percentile_cont from Rust bindings to the Python API.
Add Python-level wrapper for var_population as an alias of var_pop.
Add unit tests covering percentile_cont, grouping, and var_population.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`crates/core/src/functions.rs`	Enables the `grouping` aggregate binding and adds a new `percentile_cont` pyfunction export.
`python/datafusion/functions.py`	Adds public Python wrappers/exports for `grouping`, `percentile_cont`, and `var_population`.
`python/tests/test_functions.py`	Adds unit tests for the newly exposed aggregate functions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/datafusion/functions.py

python/tests/test_functions.py

…ation Expose upstream DataFusion aggregate functions that were not yet available in the Python API. Closes apache#1454. - grouping: returns grouping set membership indicator (rewritten by the ResolveGroupingFunction analyzer rule before physical planning) - percentile_cont: computes exact percentile using continuous interpolation (unlike approx_percentile_cont which uses t-digest) - var_population: alias for var_pop Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ntjohnson1

Grouping examples would make it a lot clearer. Without looking around or thinking about the description more deeply the usage isn't clear to me.

ntjohnson1 · 2026-04-06T11:20:04Z

python/datafusion/functions.py

+    Args:
+        expression: The column to check grouping status for
+        distinct: If True, compute on distinct values only
+        filter: If provided, only compute against rows for which the filter is True


No examples

ntjohnson1 · 2026-04-06T11:21:27Z

python/tests/test_functions.py

+    assert result.column(0)[0].as_py() == 3.0
+
+
+def test_percentile_cont_with_filter():


NIT: Would be nicer to just be a parametrized test since the only diff is the filter and the result

ntjohnson1 · 2026-04-06T11:23:29Z

python/tests/test_functions.py

+    assert result.column(0)[0].as_py() == 3.5
+
+
+def test_grouping():


If you add more examples then probably don't need to also add tests here but this seems to only test one of the grouping cases

Add docstring example to grouping(), parametrize percentile_cont tests, and add multi-column grouping test case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expose ROLLUP, CUBE, and GROUPING SETS via the DataFrame API by adding static methods on GroupingSet that construct the corresponding Expr variants. Update grouping() docstring and tests to use the new API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ctly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add user documentation for GroupingSet.rollup, .cube, and .grouping_sets with Pokemon dataset examples. Document the upstream alias limitation (apache/datafusion#21411) in both the grouping() docstring and the aggregation user guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timsaucer · 2026-04-06T16:32:08Z

@ntjohnson1 I expanded the PR quite a bit to properly support grouping sets and I also updated the online documentation to explain it in a lot more detail.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T16:38:18Z

python/datafusion/expr.py

+    @staticmethod
+    def rollup(*exprs: Expr) -> Expr:
+        """Create a ``ROLLUP`` grouping set for use with ``aggregate()``.
+
+        ``ROLLUP`` generates all prefixes of the given column list as
+        grouping sets. For example, ``rollup(a, b)`` produces grouping
+        sets ``(a, b)``, ``(a)``, and ``()`` (grand total).
+
+        This is equivalent to ``GROUP BY ROLLUP(a, b)`` in SQL.
+
+        Args:
+            *exprs: Column expressions to include in the rollup.
+
+        Examples:
+            >>> import pyarrow as pa
+            >>> import datafusion as dfn
+            >>> from datafusion.expr import GroupingSet
+            >>> ctx = dfn.SessionContext()
+            >>> df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]})
+            >>> result = df.aggregate(
+            ...     [GroupingSet.rollup(dfn.col("a"))],
+            ...     [dfn.functions.sum(dfn.col("b")).alias("s"),
+            ...      dfn.functions.grouping(dfn.col("a"))],
+            ... ).sort(dfn.col("a").sort(nulls_first=False))
+            >>> batches = result.collect()
+            >>> pa.concat_arrays([b.column("s") for b in batches]).to_pylist()
+            [30, 30, 60]
+
+        See Also:
+            :py:meth:`cube`, :py:meth:`grouping_sets`,
+            :py:func:`~datafusion.functions.grouping`
+        """
+        args = [e.expr for e in exprs]
+        return Expr(expr_internal.GroupingSet.rollup(*args))
+


The new GroupingSet factory methods only accept Expr objects (e.g., GroupingSet.rollup(col("a"))). DataFrame.aggregate() accepts both Expr and str column names, so for API consistency it would be helpful if GroupingSet.rollup/cube/grouping_sets also accepted str and converted via the existing expr.py helper (_to_raw_expr / Expr.column). This avoids surprising TypeError when users try GroupingSet.rollup("a").

Copilot · 2026-04-06T16:38:19Z

python/datafusion/functions.py

+def percentile_cont(
+    sort_expression: Expr | SortExpr,
+    percentile: float,
+    filter: Expr | None = None,
+) -> Expr:
+    """Computes the exact percentile of input values using continuous interpolation.
+
+    Unlike :py:func:`approx_percentile_cont`, this function computes the exact
+    percentile value rather than an approximation.
+
+    If using the builder functions described in ref:`_aggregation` this function ignores
+    the options ``order_by``, ``null_treatment``, and ``distinct``.
+
+    Args:
+        sort_expression: Values for which to find the percentile
+        percentile: This must be between 0.0 and 1.0, inclusive
+        filter: If provided, only compute against rows for which the filter is True
+
+    Examples:
+        >>> ctx = dfn.SessionContext()
+        >>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.percentile_cont(
+        ...         dfn.col("a"), 0.5
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        3.0
+
+        >>> result = df.aggregate(
+        ...     [], [dfn.functions.percentile_cont(
+        ...         dfn.col("a"), 0.5,
+        ...         filter=dfn.col("a") > dfn.lit(1.0),
+        ...     ).alias("v")])
+        >>> result.collect_column("v")[0].as_py()
+        3.5
+    """
+    sort_expr_raw = sort_or_default(sort_expression)
+    filter_raw = filter.expr if filter is not None else None
+    return Expr(f.percentile_cont(sort_expr_raw, percentile, filter=filter_raw))
+


Issues/PR rationale mention exposing percentile_cont/quantile_cont, but only percentile_cont is added. Consider adding a quantile_cont Python alias (and exporting it in all) so users can call the function using either upstream name, and add a small unit test to cover the alias.

Copilot · 2026-04-06T16:38:19Z

docs/source/user-guide/common-operations/aggregations.rst

 9. Approximation Functions
    - :py:func:`datafusion.functions.approx_distinct`
    - :py:func:`datafusion.functions.approx_median`
    - :py:func:`datafusion.functions.approx_percentile_cont`
    - :py:func:`datafusion.functions.approx_percentile_cont_with_weight`
+10. Grouping Set Functions
+    - :py:func:`datafusion.functions.grouping`
+    - :py:meth:`datafusion.expr.GroupingSet.rollup`
+    - :py:meth:`datafusion.expr.GroupingSet.cube`
+    - :py:meth:`datafusion.expr.GroupingSet.grouping_sets`


The docs’ aggregate function list was updated to mention grouping sets, but it doesn’t include the newly exposed percentile_cont or var_population functions. Please add these to the appropriate sections (e.g., percentile_cont near the other percentile functions, and var_population as an alias alongside var_pop) so the user guide matches the Python API.

ntjohnson1

The second pass helped a lot. Just a nit on some minor formatting on the examples.

ntjohnson1 · 2026-04-06T17:10:14Z

docs/source/user-guide/common-operations/aggregations.rst

        f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


+Grouping Sets


Much clearer!

ntjohnson1 · 2026-04-06T17:11:58Z

python/datafusion/expr.py

+
+        Examples:
+            >>> import pyarrow as pa
+            >>> import datafusion as dfn


NIT: dfn is globally available to doctests. We could potentially consider adding pyarrow as pa as well

ntjohnson1 · 2026-04-06T17:13:46Z

python/datafusion/expr.py

+            >>> batches = result.collect()
+            >>> pa.concat_arrays([b.column("s") for b in batches]).to_pylist()
+            [30, 30, 60]


NIT: Is there a reason you can't just do collect_column here and print that directly?

ntjohnson1 · 2026-04-06T17:14:07Z

python/datafusion/expr.py

+            >>> batches = result.collect()
+            >>> pa.concat_arrays([b.column(2) for b in batches]).to_pylist()


Similar note that the batch manipulation might not be needed

ntjohnson1 · 2026-04-06T17:14:38Z

python/datafusion/expr.py

+            >>> batches = result.collect()
+            >>> pa.concat_arrays(
+            ...     [b.column("s") for b in batches]).to_pylist()


Last echo on the concat arrays bit

ntjohnson1 · 2026-04-06T17:15:14Z

python/datafusion/functions.py

+        ...     [dfn.functions.sum(dfn.col("b")).alias("s"),
+        ...      dfn.functions.grouping(dfn.col("a"))],
+        ... ).sort(dfn.col("a").sort(nulls_first=False))
+        >>> batches = result.collect()


timsaucer requested a review from Copilot April 3, 2026 19:16

Copilot started reviewing on behalf of timsaucer April 3, 2026 19:17 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

python/datafusion/functions.py Outdated Show resolved Hide resolved

python/datafusion/functions.py Outdated Show resolved Hide resolved

python/tests/test_functions.py Outdated Show resolved Hide resolved

timsaucer and others added 2 commits April 3, 2026 16:04

Fix grouping() distinct parameter type for API consistency

d16cff1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timsaucer force-pushed the feat/expose-agg-fns branch from 0c8dc78 to d16cff1 Compare April 3, 2026 20:22

timsaucer marked this pull request as ready for review April 5, 2026 12:38

ntjohnson1 approved these changes Apr 6, 2026

View reviewed changes

timsaucer and others added 7 commits April 6, 2026 09:00

Merge branch 'main' into feat/expose-agg-fns

271e86b

Improve aggregate function tests and docstrings per review feedback

0303716

Add docstring example to grouping(), parametrize percentile_cont tests, and add multi-column grouping test case. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove _GroupingSetInternal alias, use expr_internal.GroupingSet dire…

c52b49c

…ctly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parametrize grouping set tests for rollup and cube

c9183dd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add grouping sets note to DataFrame.aggregate() docstring

18ab457

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

timsaucer requested review from Copilot and ntjohnson1 April 6, 2026 16:31

Copilot started reviewing on behalf of timsaucer April 6, 2026 16:32 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

ntjohnson1 approved these changes Apr 6, 2026

View reviewed changes

		assert result.column(0)[0].as_py() == 3.0


		def test_percentile_cont_with_filter():

		assert result.column(0)[0].as_py() == 3.5


		def test_grouping():

		f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


		Grouping Sets

		>>> batches = result.collect()
		>>> pa.concat_arrays([b.column(2) for b in batches]).to_pylist()

Conversation

timsaucer commented Mar 31, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ntjohnson1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ntjohnson1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants