Skip to content

Add missing aggregate functions#1471

Open
timsaucer wants to merge 9 commits intoapache:mainfrom
timsaucer:feat/expose-agg-fns
Open

Add missing aggregate functions#1471
timsaucer wants to merge 9 commits intoapache:mainfrom
timsaucer:feat/expose-agg-fns

Conversation

@timsaucer
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #861
Closes #925
Closes #1454

Rationale for this change

These functions exist upstream but were not exposed to the Python API

What changes are included in this PR?

Expose functions to Python API
Add unit tests

Are there any user-facing changes?

New addition only.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR exposes several upstream DataFusion aggregate functions that were previously missing from the Python API, and adds unit tests to validate the new bindings.

Changes:

  • Expose grouping and percentile_cont from Rust bindings to the Python API.
  • Add Python-level wrapper for var_population as an alias of var_pop.
  • Add unit tests covering percentile_cont, grouping, and var_population.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
crates/core/src/functions.rs Enables the grouping aggregate binding and adds a new percentile_cont pyfunction export.
python/datafusion/functions.py Adds public Python wrappers/exports for grouping, percentile_cont, and var_population.
python/tests/test_functions.py Adds unit tests for the newly exposed aggregate functions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

timsaucer and others added 2 commits April 3, 2026 16:04
…ation

Expose upstream DataFusion aggregate functions that were not yet
available in the Python API. Closes apache#1454.

- grouping: returns grouping set membership indicator (rewritten by
  the ResolveGroupingFunction analyzer rule before physical planning)
- percentile_cont: computes exact percentile using continuous
  interpolation (unlike approx_percentile_cont which uses t-digest)
- var_population: alias for var_pop

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer force-pushed the feat/expose-agg-fns branch from 0c8dc78 to d16cff1 Compare April 3, 2026 20:22
@timsaucer timsaucer marked this pull request as ready for review April 5, 2026 12:38
Copy link
Copy Markdown
Contributor

@ntjohnson1 ntjohnson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping examples would make it a lot clearer. Without looking around or thinking about the description more deeply the usage isn't clear to me.

Args:
expression: The column to check grouping status for
distinct: If True, compute on distinct values only
filter: If provided, only compute against rows for which the filter is True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No examples

assert result.column(0)[0].as_py() == 3.0


def test_percentile_cont_with_filter():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Would be nicer to just be a parametrized test since the only diff is the filter and the result

assert result.column(0)[0].as_py() == 3.5


def test_grouping():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add more examples then probably don't need to also add tests here but this seems to only test one of the grouping cases

timsaucer and others added 7 commits April 6, 2026 09:00
Add docstring example to grouping(), parametrize percentile_cont tests,
and add multi-column grouping test case.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expose ROLLUP, CUBE, and GROUPING SETS via the DataFrame API by adding
static methods on GroupingSet that construct the corresponding Expr
variants. Update grouping() docstring and tests to use the new API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ctly

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add user documentation for GroupingSet.rollup, .cube, and
.grouping_sets with Pokemon dataset examples. Document the upstream
alias limitation (apache/datafusion#21411) in both the grouping()
docstring and the aggregation user guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer requested review from Copilot and ntjohnson1 April 6, 2026 16:31
@timsaucer
Copy link
Copy Markdown
Member Author

@ntjohnson1 I expanded the PR quite a bit to properly support grouping sets and I also updated the online documentation to explain it in a lot more detail.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1447 to +1481
@staticmethod
def rollup(*exprs: Expr) -> Expr:
"""Create a ``ROLLUP`` grouping set for use with ``aggregate()``.

``ROLLUP`` generates all prefixes of the given column list as
grouping sets. For example, ``rollup(a, b)`` produces grouping
sets ``(a, b)``, ``(a)``, and ``()`` (grand total).

This is equivalent to ``GROUP BY ROLLUP(a, b)`` in SQL.

Args:
*exprs: Column expressions to include in the rollup.

Examples:
>>> import pyarrow as pa
>>> import datafusion as dfn
>>> from datafusion.expr import GroupingSet
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1, 1, 2], "b": [10, 20, 30]})
>>> result = df.aggregate(
... [GroupingSet.rollup(dfn.col("a"))],
... [dfn.functions.sum(dfn.col("b")).alias("s"),
... dfn.functions.grouping(dfn.col("a"))],
... ).sort(dfn.col("a").sort(nulls_first=False))
>>> batches = result.collect()
>>> pa.concat_arrays([b.column("s") for b in batches]).to_pylist()
[30, 30, 60]

See Also:
:py:meth:`cube`, :py:meth:`grouping_sets`,
:py:func:`~datafusion.functions.grouping`
"""
args = [e.expr for e in exprs]
return Expr(expr_internal.GroupingSet.rollup(*args))

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new GroupingSet factory methods only accept Expr objects (e.g., GroupingSet.rollup(col("a"))). DataFrame.aggregate() accepts both Expr and str column names, so for API consistency it would be helpful if GroupingSet.rollup/cube/grouping_sets also accepted str and converted via the existing expr.py helper (_to_raw_expr / Expr.column). This avoids surprising TypeError when users try GroupingSet.rollup("a").

Copilot uses AI. Check for mistakes.
Comment on lines +4312 to +4351
def percentile_cont(
sort_expression: Expr | SortExpr,
percentile: float,
filter: Expr | None = None,
) -> Expr:
"""Computes the exact percentile of input values using continuous interpolation.

Unlike :py:func:`approx_percentile_cont`, this function computes the exact
percentile value rather than an approximation.

If using the builder functions described in ref:`_aggregation` this function ignores
the options ``order_by``, ``null_treatment``, and ``distinct``.

Args:
sort_expression: Values for which to find the percentile
percentile: This must be between 0.0 and 1.0, inclusive
filter: If provided, only compute against rows for which the filter is True

Examples:
>>> ctx = dfn.SessionContext()
>>> df = ctx.from_pydict({"a": [1.0, 2.0, 3.0, 4.0, 5.0]})
>>> result = df.aggregate(
... [], [dfn.functions.percentile_cont(
... dfn.col("a"), 0.5
... ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3.0

>>> result = df.aggregate(
... [], [dfn.functions.percentile_cont(
... dfn.col("a"), 0.5,
... filter=dfn.col("a") > dfn.lit(1.0),
... ).alias("v")])
>>> result.collect_column("v")[0].as_py()
3.5
"""
sort_expr_raw = sort_or_default(sort_expression)
filter_raw = filter.expr if filter is not None else None
return Expr(f.percentile_cont(sort_expr_raw, percentile, filter=filter_raw))

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issues/PR rationale mention exposing percentile_cont/quantile_cont, but only percentile_cont is added. Consider adding a quantile_cont Python alias (and exporting it in all) so users can call the function using either upstream name, and add a small unit test to cover the alias.

Copilot uses AI. Check for mistakes.
Comment on lines 373 to +382
9. Approximation Functions
- :py:func:`datafusion.functions.approx_distinct`
- :py:func:`datafusion.functions.approx_median`
- :py:func:`datafusion.functions.approx_percentile_cont`
- :py:func:`datafusion.functions.approx_percentile_cont_with_weight`
10. Grouping Set Functions
- :py:func:`datafusion.functions.grouping`
- :py:meth:`datafusion.expr.GroupingSet.rollup`
- :py:meth:`datafusion.expr.GroupingSet.cube`
- :py:meth:`datafusion.expr.GroupingSet.grouping_sets`
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs’ aggregate function list was updated to mention grouping sets, but it doesn’t include the newly exposed percentile_cont or var_population functions. Please add these to the appropriate sections (e.g., percentile_cont near the other percentile functions, and var_population as an alias alongside var_pop) so the user guide matches the Python API.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@ntjohnson1 ntjohnson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second pass helped a lot. Just a nit on some minor formatting on the examples.

f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


Grouping Sets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer!


Examples:
>>> import pyarrow as pa
>>> import datafusion as dfn
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: dfn is globally available to doctests. We could potentially consider adding pyarrow as pa as well

Comment on lines +1471 to +1473
>>> batches = result.collect()
>>> pa.concat_arrays([b.column("s") for b in batches]).to_pylist()
[30, 30, 60]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Is there a reason you can't just do collect_column here and print that directly?

Comment on lines +1509 to +1510
>>> batches = result.collect()
>>> pa.concat_arrays([b.column(2) for b in batches]).to_pylist()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar note that the batch manipulation might not be needed

Comment on lines +1553 to +1555
>>> batches = result.collect()
>>> pa.concat_arrays(
... [b.column("s") for b in batches]).to_pylist()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last echo on the concat arrays bit

... [dfn.functions.sum(dfn.col("b")).alias("s"),
... dfn.functions.grouping(dfn.col("a"))],
... ).sort(dfn.col("a").sort(nulls_first=False))
>>> batches = result.collect()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants