Skip to content

Spark: Support variant_get predicate pushdown for file skipping#15385

Open
qlong wants to merge 1 commit into
apache:mainfrom
qlong:variant-file-skipping-sparkv2filters
Open

Spark: Support variant_get predicate pushdown for file skipping#15385
qlong wants to merge 1 commit into
apache:mainfrom
qlong:variant-file-skipping-sparkv2filters

Conversation

@qlong
Copy link
Copy Markdown

@qlong qlong commented Feb 20, 2026

This is to support manifest-based file skipping for variant columns.

Changes:

  • SparkV2Filters: Convert variant_get/try_variant_get to Expressions.extract()
  • Spark3Util.describe: Output extract terms as variant_get() for EXPLAIN

Tests:

  • Added unit tests
  • Manual e2e testing with spark-sql built with dependence PRs, verified variant_get is pushdown to iceberg for file skipping. Verified that files is skipped from Spark history.

The PR depends on:

  1. Do not merge until Api: Support variant extract and fix manifest bounds byte order #15384 is merged.
  2. [SPARK-55617] Add VariantGet to V2ExpressionBuilder for DSv2 filter pushdown spark#54394: Spark side change to add VariantGet to DSv2 filter

Related issue:

  1. Variant Data Type Support #10392

CI status (expected until #15384)

CI fails because testDescribeExtractExpression expects bracket paths in Spark3Util.describe() output, which requires #15384’s UnboundExtract + PathUtil.toNormalizedPath() on main. Without that API, main still stores/prints dot paths, so the test fails; cancelled matrix jobs are fail-fast side effects. Merge #15384 first, then rebase and re-run CI here.

@huaxingao
Copy link
Copy Markdown
Contributor

cc @aihuaxu

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Mar 31, 2026
@steveloughran
Copy link
Copy Markdown
Contributor

Not stale! we need this! queries on variants are pretty bad right now, and skipping files can start to recover that.

@github-actions github-actions Bot removed the stale label Apr 1, 2026
@steveloughran
Copy link
Copy Markdown
Contributor

This is why my benchmarking of rowgroup filtering aren't working...needed for the variant_get() passthrough

Copy link
Copy Markdown
Contributor

@nssalian nssalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good otherwise. Left a nit comment. @huaxingao @aihuaxu PTAL

@steveloughran
Copy link
Copy Markdown
Contributor

i'm going to pull this into my rowgroup filtering pr as

  1. it's needed
  2. that's actually how this work is validated

steveloughran added a commit to steveloughran/iceberg that referenced this pull request Apr 30, 2026
…kipping

copilot's solution to why pushdown wasn't working, independent of
qlong's apache#15385

I plan to take qlong's and pull what is extra from this one.
steveloughran added a commit to steveloughran/iceberg that referenced this pull request Apr 30, 2026
steveloughran added a commit to steveloughran/iceberg that referenced this pull request May 11, 2026
…kipping

copilot's solution to why pushdown wasn't working, independent of
qlong's apache#15385

I plan to take qlong's and pull what is extra from this one.
steveloughran added a commit to steveloughran/iceberg that referenced this pull request May 11, 2026
steveloughran added a commit to steveloughran/iceberg that referenced this pull request May 14, 2026
…kipping

copilot's solution to why pushdown wasn't working, independent of
qlong's apache#15385

I plan to take qlong's and pull what is extra from this one.
@qlong qlong force-pushed the variant-file-skipping-sparkv2filters branch 2 times, most recently from a1bf0b6 to e926014 Compare May 18, 2026 18:45
@qlong qlong force-pushed the variant-file-skipping-sparkv2filters branch from e926014 to 044867f Compare May 19, 2026 23:08
@soumilshah1995
Copy link
Copy Markdown

Sharing a small test I completed on my local laptop.
I pulled the branch, built the JARs, and then ran the build locally.

image

Although this was done on my local machine with a small dataset, I will run tests on ~400GB of data next. I do see improvements in the results so far.

@qlong qlong force-pushed the variant-file-skipping-sparkv2filters branch from 044867f to 2fa7c02 Compare May 20, 2026 14:32
@steveloughran
Copy link
Copy Markdown
Contributor

FYI I have evidence of this working in my latest benchmark runs.

@soumilshah1995
Copy link
Copy Markdown

soumilshah1995 commented May 20, 2026

FYI I have evidence of this working in my latest benchmark runs.

can you share them links please ? @qlong steveloughran

@qlong
Copy link
Copy Markdown
Author

qlong commented May 20, 2026

@soumilshah1995 i verified file skipping by looking at spark UI. If you have any shredded columns, the performance could be worse compared to unshredded or plain json. See #16448

file_skipping

@soumilshah1995
Copy link
Copy Markdown

@soumilshah1995 i verified file skipping by looking at spark UI. If you have any shredded columns, the performance could be worse compared to unshredded or plain json. See #16448

file_skipping

Yes you are correct my test says same unshredded still perform better then shredding

@qlong qlong force-pushed the variant-file-skipping-sparkv2filters branch 3 times, most recently from 0701d8e to b5eeafc Compare May 21, 2026 01:49
@github-actions github-actions Bot removed the API label May 21, 2026
@qlong qlong force-pushed the variant-file-skipping-sparkv2filters branch from b5eeafc to bfa668d Compare May 21, 2026 03:30
- SparkV2Filters: Convert variant_get/try_variant_get to
  Expressions.extract()
- Spark3Util.describe: Output extract terms as variant_get() for EXPLAIN
- Add tests for both

Depends on Spark PR:
- apache#15384
- apache/spark#54394
@qlong qlong force-pushed the variant-file-skipping-sparkv2filters branch from bfa668d to c12ae73 Compare May 21, 2026 03:41
steveloughran added a commit to steveloughran/iceberg that referenced this pull request May 21, 2026
…kipping

copilot's solution to why pushdown wasn't working, independent of
qlong's apache#15385

I plan to take qlong's and pull what is extra from this one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants