Spark: Support variant_get predicate pushdown for file skipping by qlong · Pull Request #15385 · apache/iceberg

qlong · 2026-02-20T18:26:28Z

This is to support manifest-based file skipping for variant columns.

Changes:

SparkV2Filters: Convert variant_get/try_variant_get to Expressions.extract()
Spark3Util.describe: Output extract terms as variant_get() for EXPLAIN

Tests:

Added unit tests
Manual e2e testing with spark-sql built with dependence PRs, verified variant_get is pushdown to iceberg for file skipping. Verified that files is skipped from Spark history.

The PR depends on:

Do not merge until Api: Support variant extract and fix manifest bounds byte order #15384 is merged.
[SPARK-55617] Add VariantGet to V2ExpressionBuilder for DSv2 filter pushdown spark#54394: Spark side change to add VariantGet to DSv2 filter

Related issue:

Variant Data Type Support #10392

CI status (expected until #15384)

CI fails because testDescribeExtractExpression expects bracket paths in Spark3Util.describe() output, which requires #15384’s UnboundExtract + PathUtil.toNormalizedPath() on main. Without that API, main still stores/prints dot paths, so the test fails; cancelled matrix jobs are fail-fast side effects. Merge #15384 first, then rebase and re-run CI here.

huaxingao · 2026-02-25T05:19:44Z

cc @aihuaxu

github-actions · 2026-03-31T00:27:39Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

steveloughran · 2026-03-31T09:16:29Z

Not stale! we need this! queries on variants are pretty bad right now, and skipping files can start to recover that.

steveloughran · 2026-04-29T20:38:46Z

This is why my benchmarking of rowgroup filtering aren't working...needed for the variant_get() passthrough

nssalian

Mostly looks good otherwise. Left a nit comment. @huaxingao @aihuaxu PTAL

steveloughran · 2026-04-30T14:27:35Z

i'm going to pull this into my rowgroup filtering pr as

it's needed
that's actually how this work is validated

…kipping copilot's solution to why pushdown wasn't working, independent of qlong's apache#15385 I plan to take qlong's and pull what is extra from this one.

soumilshah1995 · 2026-05-19T23:37:56Z

Sharing a small test I completed on my local laptop.
I pulled the branch, built the JARs, and then ran the build locally.

Although this was done on my local machine with a small dataset, I will run tests on ~400GB of data next. I do see improvements in the results so far.

steveloughran · 2026-05-20T14:36:01Z

FYI I have evidence of this working in my latest benchmark runs.

soumilshah1995 · 2026-05-20T15:25:58Z

FYI I have evidence of this working in my latest benchmark runs.

can you share them links please ? @qlong steveloughran

qlong · 2026-05-20T17:56:17Z

@soumilshah1995 i verified file skipping by looking at spark UI. If you have any shredded columns, the performance could be worse compared to unshredded or plain json. See #16448

soumilshah1995 · 2026-05-20T19:29:53Z

@soumilshah1995 i verified file skipping by looking at spark UI. If you have any shredded columns, the performance could be worse compared to unshredded or plain json. See #16448

Yes you are correct my test says same unshredded still perform better then shredding

- SparkV2Filters: Convert variant_get/try_variant_get to Expressions.extract() - Spark3Util.describe: Output extract terms as variant_get() for EXPLAIN - Add tests for both Depends on Spark PR: - apache#15384 - apache/spark#54394

…kipping copilot's solution to why pushdown wasn't working, independent of qlong's apache#15385 I plan to take qlong's and pull what is extra from this one.

github-actions Bot added the spark label Feb 20, 2026

qlong mentioned this pull request Feb 20, 2026

[SPARK-55617] Add VariantGet to V2ExpressionBuilder for DSv2 filter pushdown apache/spark#54394

Open

This was referenced Feb 28, 2026

Variant Data Type Support #10392

Open

Support row group skipping for shredded variant columns #15510

Open

github-actions Bot added the stale label Mar 31, 2026

github-actions Bot removed the stale label Apr 1, 2026

nssalian reviewed Apr 30, 2026

View reviewed changes

Comment thread spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkV2Filters.java

steveloughran mentioned this pull request Apr 30, 2026

Core, Spark: Performant queries over (shredded) Variant data #16172

Open

3 tasks

steveloughran added a commit to steveloughran/iceberg that referenced this pull request Apr 30, 2026

Combine with apache#15385 changes; tests are main diff.

2098c8e

steveloughran added a commit to steveloughran/iceberg that referenced this pull request May 11, 2026

Combine with apache#15385 changes; tests are main diff.

b698c26

qlong force-pushed the variant-file-skipping-sparkv2filters branch 2 times, most recently from a1bf0b6 to e926014 Compare May 18, 2026 18:45

github-actions Bot added API core labels May 18, 2026

qlong force-pushed the variant-file-skipping-sparkv2filters branch from e926014 to 044867f Compare May 19, 2026 23:08

qlong force-pushed the variant-file-skipping-sparkv2filters branch from 044867f to 2fa7c02 Compare May 20, 2026 14:32

qlong mentioned this pull request May 20, 2026

Spark: implement SupportsPushDownVariantExtractions for shredded variant column pruning (plan rewrite) #16448

Open

3 tasks

qlong force-pushed the variant-file-skipping-sparkv2filters branch 3 times, most recently from 0701d8e to b5eeafc Compare May 21, 2026 01:49

github-actions Bot removed the API label May 21, 2026

qlong force-pushed the variant-file-skipping-sparkv2filters branch from b5eeafc to bfa668d Compare May 21, 2026 03:30

qlong force-pushed the variant-file-skipping-sparkv2filters branch from bfa668d to c12ae73 Compare May 21, 2026 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Support variant_get predicate pushdown for file skipping#15385

Spark: Support variant_get predicate pushdown for file skipping#15385
qlong wants to merge 1 commit into
apache:mainfrom
qlong:variant-file-skipping-sparkv2filters

qlong commented Feb 20, 2026 •

edited

Loading

Uh oh!

huaxingao commented Feb 25, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

steveloughran commented Mar 31, 2026

Uh oh!

steveloughran commented Apr 29, 2026

Uh oh!

nssalian left a comment

Uh oh!

Uh oh!

steveloughran commented Apr 30, 2026

Uh oh!

soumilshah1995 commented May 19, 2026

Uh oh!

steveloughran commented May 20, 2026

Uh oh!

soumilshah1995 commented May 20, 2026 •

edited

Loading

Uh oh!

qlong commented May 20, 2026

Uh oh!

soumilshah1995 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

qlong commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI status (expected until #15384)

Uh oh!

huaxingao commented Feb 25, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

steveloughran commented Mar 31, 2026

Uh oh!

steveloughran commented Apr 29, 2026

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

steveloughran commented Apr 30, 2026

Uh oh!

soumilshah1995 commented May 19, 2026

Uh oh!

steveloughran commented May 20, 2026

Uh oh!

soumilshah1995 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qlong commented May 20, 2026

Uh oh!

soumilshah1995 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qlong commented Feb 20, 2026 •

edited

Loading

soumilshah1995 commented May 20, 2026 •

edited

Loading