Spark: Support variant_get predicate pushdown for file skipping#15385
Spark: Support variant_get predicate pushdown for file skipping#15385qlong wants to merge 1 commit into
Conversation
|
cc @aihuaxu |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
Not stale! we need this! queries on variants are pretty bad right now, and skipping files can start to recover that. |
|
This is why my benchmarking of rowgroup filtering aren't working...needed for the variant_get() passthrough |
nssalian
left a comment
There was a problem hiding this comment.
Mostly looks good otherwise. Left a nit comment. @huaxingao @aihuaxu PTAL
|
i'm going to pull this into my rowgroup filtering pr as
|
…kipping copilot's solution to why pushdown wasn't working, independent of qlong's apache#15385 I plan to take qlong's and pull what is extra from this one.
…kipping copilot's solution to why pushdown wasn't working, independent of qlong's apache#15385 I plan to take qlong's and pull what is extra from this one.
…kipping copilot's solution to why pushdown wasn't working, independent of qlong's apache#15385 I plan to take qlong's and pull what is extra from this one.
a1bf0b6 to
e926014
Compare
e926014 to
044867f
Compare
044867f to
2fa7c02
Compare
|
FYI I have evidence of this working in my latest benchmark runs. |
can you share them links please ? @qlong steveloughran |
|
@soumilshah1995 i verified file skipping by looking at spark UI. If you have any shredded columns, the performance could be worse compared to unshredded or plain json. See #16448
|
Yes you are correct my test says same unshredded still perform better then shredding |
0701d8e to
b5eeafc
Compare
b5eeafc to
bfa668d
Compare
- SparkV2Filters: Convert variant_get/try_variant_get to Expressions.extract() - Spark3Util.describe: Output extract terms as variant_get() for EXPLAIN - Add tests for both Depends on Spark PR: - apache#15384 - apache/spark#54394
bfa668d to
c12ae73
Compare
…kipping copilot's solution to why pushdown wasn't working, independent of qlong's apache#15385 I plan to take qlong's and pull what is extra from this one.



This is to support manifest-based file skipping for variant columns.
Changes:
Tests:
The PR depends on:
Related issue:
CI status (expected until #15384)
CI fails because testDescribeExtractExpression expects bracket paths in Spark3Util.describe() output, which requires #15384’s UnboundExtract + PathUtil.toNormalizedPath() on main. Without that API, main still stores/prints dot paths, so the test fails; cancelled matrix jobs are fail-fast side effects. Merge #15384 first, then rebase and re-run CI here.