Core, Spark: Add JMH benchmarks for Variants by steveloughran · Pull Request #15629 · apache/iceberg

steveloughran · 2026-03-13T21:25:53Z

core:VariantSerializationBenchmark

Separate benchmarks for

serializing a prebuilt object
deserializing

Variables are:

depth: [shallow, nested, deep-nested]
percentage of fields shed [0, 33, 67, 100]

spark-4.1:IcebergSourceVariantReadBenchmark

Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.

Key findings:

although it has the smallest file size, parquet files with shredded variants have significantly worse performance when working with the variant structs than unshredded.
Avro is best for the variant data, though all operations will have to read the entire file, operations on other columns are (as expected) slower.
Filtering is the operation which is slow. Projecting on an variant column, shredded or unshredded, is as fast as projecting normal parquet column.

I'm not reaching any conclusion why this is the case. I am looking at improving the performance of reconstructing string fields in parquet-java as those benchmark show needless byte-string-byte conversion. For the iceberg benchmark and layers below, I think knowing where issues like is enough of a change.

Writeup

See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.

steveloughran · 2026-04-01T10:09:59Z

@rashworld-max still a WiP I'm afraid. Need to know I'm measuring the right thing. Also I can't tell from your profile whether or not you are a human.

rdblue · 2026-04-10T17:10:18Z


-class ParquetVariantUtil {
+@VisibleForTesting
+public final class ParquetVariantUtil {


Is it possible to relocate the tests rather than expose this? We can do this, but generally prefer not to if we can avoid it.

happy to do that...I have done it in the parquet PR

steveloughran · 2026-04-14T20:05:11Z

I think this is ready for review. I've got the initial results and it's good for PRs like #3477 to be able to before/after benchmarks.

More stuff can go in later; I've outlined them in my report. Equality deletes would be a fun one

manuzhang · 2026-04-16T15:37:32Z

+   */
+  private long materializeNonEmpty(String operation, Dataset<?> ds) {
+    LOG.info("{} table={}", operation, tableType);
+    final long count = ds.count();


Spark doesn't need to evaluate projection to count records.

I needed something to do the entire compute and count() worked. Otherwise it's evaluate every row and feed to a black hole. What do you prefer?

Spark can count records without evaluating projection so it's not really testing the projection here.

it seemed to work, but I will get and discard each record instead

Rashworld-max

@manuzhang fyi, now using the same sequence as the IcebergSourceBenchmark superclass, with the retention of the count for use in assertions

final long count = ds.queryExecution().toRdd().toJavaRDD().count(); blackhole.consume(count);

manuzhang · 2026-04-16T15:40:29Z

+      "variant_get(nested, '$.varcategory', 'int')";
+
+  /** Get the ID field from inside the variant: {@value}. */
+  private static final String VARIANT_GET_NESTED_ID = "variant_get(nested, '$.varid', 'int')";


should this be int64 as in the comments above?

will review

steveloughran · 2026-05-08T17:11:28Z

switching to draft again as I'm reworking the benchmark to show variant rowgroup filtering of shredded variants works (#15510), with changes including

move to single thread spark worker (less variance in results)
going to a single large file with multiple rowgroups
exploring the difference between variant_get(...) is 5 and variant_get(...) in (5) to see if spark is treating them differently.

Needs an extra PR in iceberg from qlong and a snaspshot of spark 4.1 with his changes for spark to pass variant_get down

So, surprisingly complex. If I can show the chain works then it's time to start with the feature merges in spark and then here.

This branch will merge without direct dependency, it's just a key goal of the spark benchmark is "show pushdown working". It's not ready to merge unless it can do that

Fixes apache#15628 Core: benchmark of variant creation and ser/deser costs. Separate benchmarks for * building * serializing a prebuilt object * deserializing Variables are: - fields: [1000, 10000] - depth: [shallow, nested] - percentage of fields shed [0, 33, 67, 100] Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development. A final merge should switch to fork(1). Spark: Full test of predicate pushdown of variants - avro - parquet unshredded - parquet shredded For this to return useful numbers, requires PRs for - Passing down variant_get between spark and iceberg - ParquetRowGroupFilter to filter on shredded variants. contains Add some more benchmarks Change-Id: I4231280f08cf63db5960ecb79301ae9458b35272

* file skipping can be observed as out of range equals/element tests are skipped completely * varcat filters still very slow Changes - blackhole consumption of rows - wiring up to ParquetMetricsRowGroupFilter.resetShreddedMetricsCounter() shows the filtering is happening on shredded files - cutting back on category count and attempting to change structure of file Note: this is the benchmark branch, and has had the assertions on counters and counter reset cut; all the other wiring up for the assertions is present. Change-Id: I686be12d51b13d2048b631b8cf198651012cc474

Allows for assertions in tests and in benchmarks that rowgroup skipping is taking place. Needed as there's not much tangible speedup, yet Change-Id: I8c03eb33d2d3d8a2139c347e6a72a7284e627f62

...which shows the configuration changes needed for data to be saved to multiple rowgroups. File size shrunk; increasing iterations of runs. Change-Id: Icb622958a068ed67de5bb895d88a9aa1713d2b11

- increasing category count reduces # of matches on the single category, so amplifying shredding filtering advantage - and varcat select with a range > 0 and < 1. That's the same as the = 1 and `in (1)`, selections, but with two scans of the values. Change-Id: Ib2b139697e235cb4674503784c6c909a5c460d1a

github-actions Bot added the core label Mar 13, 2026

steveloughran changed the title ~~Add JMH benchmarks for Variants~~ Core: Add JMH benchmarks for Variants Mar 13, 2026

steveloughran force-pushed the pr/benchmark-variant branch from 7c4f806 to 2be00b9 Compare March 13, 2026 21:51

steveloughran marked this pull request as draft March 16, 2026 16:52

steveloughran mentioned this pull request Mar 16, 2026

Improve benchmark docs page coverage and formatting #15623

Open

github-actions Bot added spark parquet labels Mar 20, 2026

steveloughran closed this Mar 23, 2026

steveloughran reopened this Mar 24, 2026

steveloughran changed the title ~~Core: Add JMH benchmarks for Variants~~ Core, Spark: Add JMH benchmarks for Variants Mar 24, 2026

This was referenced Mar 24, 2026

Spark: Support writing shredded variant in Iceberg-Spark #14297

Merged

Variant Data Type Support #10392

Open

GH-3451. Add a JMH benchmark for variants apache/parquet-java#3452

Merged

rashworld-max approved these changes Mar 31, 2026

View reviewed changes

steveloughran force-pushed the pr/benchmark-variant branch from 70c69f8 to 25c6f29 Compare April 10, 2026 15:23

rdblue reviewed Apr 10, 2026

View reviewed changes

steveloughran marked this pull request as ready for review April 14, 2026 20:03

manuzhang reviewed Apr 27, 2026

View reviewed changes

rashworld-max approved these changes Apr 29, 2026

View reviewed changes

github-actions Bot added the build label Apr 30, 2026

steveloughran marked this pull request as draft May 8, 2026 17:02

steveloughran mentioned this pull request May 13, 2026

Spark: Add compaction only benchmark - rewrite data files #16219

Merged

steveloughran force-pushed the pr/benchmark-variant branch 2 times, most recently from 20d7bda to 2690e5c Compare May 18, 2026 17:41

steveloughran force-pushed the pr/benchmark-variant branch from 53dc868 to 34f347c Compare May 18, 2026 17:46

steveloughran added 3 commits May 21, 2026 19:32

ParquetMetricsRowGroupFilter enhancement: now counting RGs skipped.

0a2d28d

Allows for assertions in tests and in benchmarks that rowgroup skipping is taking place. Needed as there's not much tangible speedup, yet Change-Id: I8c03eb33d2d3d8a2139c347e6a72a7284e627f62

Benchmark dumps rowgroup info of shredded columns

8e84230

...which shows the configuration changes needed for data to be saved to multiple rowgroups. File size shrunk; increasing iterations of runs. Change-Id: Icb622958a068ed67de5bb895d88a9aa1713d2b11

Conversation

steveloughran commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

core:VariantSerializationBenchmark

spark-4.1:IcebergSourceVariantReadBenchmark

Writeup

Uh oh!

steveloughran commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steveloughran commented Mar 13, 2026 •

edited

Loading

steveloughran commented Apr 1, 2026 •

edited

Loading

steveloughran Apr 12, 2026 •

edited

Loading