Skip to content

Core, Spark: Add JMH benchmarks for Variants#15629

Draft
steveloughran wants to merge 5 commits into
apache:mainfrom
steveloughran:pr/benchmark-variant
Draft

Core, Spark: Add JMH benchmarks for Variants#15629
steveloughran wants to merge 5 commits into
apache:mainfrom
steveloughran:pr/benchmark-variant

Conversation

@steveloughran
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran commented Mar 13, 2026

Fixes #15628

core:VariantSerializationBenchmark

Separate benchmarks for

  • serializing a prebuilt object
  • deserializing

Variables are:

  • depth: [shallow, nested, deep-nested]
  • percentage of fields shed [0, 33, 67, 100]

spark-4.1:IcebergSourceVariantReadBenchmark

Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.

Key findings:

  • although it has the smallest file size, parquet files with shredded variants have significantly worse performance when working with the variant structs than unshredded.
  • Avro is best for the variant data, though all operations will have to read the entire file, operations on other columns are (as expected) slower.
  • Filtering is the operation which is slow. Projecting on an variant column, shredded or unshredded, is as fast as projecting normal parquet column.

I'm not reaching any conclusion why this is the case. I am looking at improving the performance of reconstructing string fields in parquet-java as those benchmark show needless byte-string-byte conversion. For the iceberg benchmark and layers below, I think knowing where issues like is enough of a change.

Writeup

See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.

@github-actions github-actions Bot added the core label Mar 13, 2026
@steveloughran steveloughran changed the title Add JMH benchmarks for Variants Core: Add JMH benchmarks for Variants Mar 13, 2026
@steveloughran steveloughran marked this pull request as draft March 16, 2026 16:52
@steveloughran steveloughran reopened this Mar 24, 2026
@steveloughran steveloughran changed the title Core: Add JMH benchmarks for Variants Core, Spark: Add JMH benchmarks for Variants Mar 24, 2026
@steveloughran
Copy link
Copy Markdown
Contributor Author

steveloughran commented Apr 1, 2026

@rashworld-max still a WiP I'm afraid. Need to know I'm measuring the right thing. Also I can't tell from your profile whether or not you are a human.


class ParquetVariantUtil {
@VisibleForTesting
public final class ParquetVariantUtil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to relocate the tests rather than expose this? We can do this, but generally prefer not to if we can avoid it.

Copy link
Copy Markdown
Contributor Author

@steveloughran steveloughran Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to do that...I have done it in the parquet PR

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho

@steveloughran steveloughran marked this pull request as ready for review April 14, 2026 20:03
@steveloughran
Copy link
Copy Markdown
Contributor Author

I think this is ready for review. I've got the initial results and it's good for PRs like #3477 to be able to before/after benchmarks.

More stuff can go in later; I've outlined them in my report. Equality deletes would be a fun one

*/
private long materializeNonEmpty(String operation, Dataset<?> ds) {
LOG.info("{} table={}", operation, tableType);
final long count = ds.count();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark doesn't need to evaluate projection to count records.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed something to do the entire compute and count() worked. Otherwise it's evaluate every row and feed to a black hole. What do you prefer?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark can count records without evaluating projection so it's not really testing the projection here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seemed to work, but I will get and discard each record instead

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rashworld-max

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuzhang fyi, now using the same sequence as the IcebergSourceBenchmark superclass, with the retention of the count for use in assertions

    final long count = ds.queryExecution().toRdd().toJavaRDD().count();
    blackhole.consume(count);

"variant_get(nested, '$.varcategory', 'int')";

/** Get the ID field from inside the variant: {@value}. */
private static final String VARIANT_GET_NESTED_ID = "variant_get(nested, '$.varid', 'int')";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be int64 as in the comments above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review

@github-actions github-actions Bot added the build label Apr 30, 2026
@steveloughran steveloughran marked this pull request as draft May 8, 2026 17:02
@steveloughran
Copy link
Copy Markdown
Contributor Author

switching to draft again as I'm reworking the benchmark to show variant rowgroup filtering of shredded variants works (#15510), with changes including

  • move to single thread spark worker (less variance in results)
  • going to a single large file with multiple rowgroups
  • exploring the difference between variant_get(...) is 5 and variant_get(...) in (5) to see if spark is treating them differently.

Needs an extra PR in iceberg from qlong and a snaspshot of spark 4.1 with his changes for spark to pass variant_get down

So, surprisingly complex. If I can show the chain works then it's time to start with the feature merges in spark and then here.

This branch will merge without direct dependency, it's just a key goal of the spark benchmark is "show pushdown working". It's not ready to merge unless it can do that

Fixes apache#15628

Core: benchmark of variant creation and ser/deser costs.

Separate benchmarks for
* building
* serializing a prebuilt object
* deserializing

Variables are:
 - fields: [1000, 10000]
 - depth: [shallow, nested]
 - percentage of fields shed [0, 33, 67, 100]

Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development.
A final merge should switch to fork(1).

Spark:

Full test of predicate pushdown of variants
- avro
- parquet unshredded
- parquet shredded

For this to return useful numbers, requires PRs for
- Passing down variant_get between spark and iceberg
- ParquetRowGroupFilter to filter on shredded variants.

contains

Add some more benchmarks

Change-Id: I4231280f08cf63db5960ecb79301ae9458b35272
@steveloughran steveloughran force-pushed the pr/benchmark-variant branch from 53dc868 to 34f347c Compare May 18, 2026 17:46
* file skipping can be observed as out of range equals/element tests are skipped completely
* varcat filters still very slow

Changes
- blackhole consumption of rows
- wiring up to ParquetMetricsRowGroupFilter.resetShreddedMetricsCounter() shows the filtering is
  happening on shredded files
- cutting back on category count and attempting to change structure of file

Note: this is the benchmark branch, and has had the assertions on counters and counter reset
cut; all the other wiring up for the assertions is present.

Change-Id: I686be12d51b13d2048b631b8cf198651012cc474
Allows for assertions in tests and in benchmarks that rowgroup skipping is taking place.

Needed as there's not much tangible speedup, yet

Change-Id: I8c03eb33d2d3d8a2139c347e6a72a7284e627f62
...which shows the configuration changes needed for data to be saved to multiple rowgroups.

File size shrunk; increasing iterations of runs.

Change-Id: Icb622958a068ed67de5bb895d88a9aa1713d2b11
- increasing category count reduces # of matches on the single category, so amplifying shredding filtering advantage
- and varcat select with a range > 0 and < 1. That's the same as the = 1 and `in (1)`, selections, but with two scans of the values.

Change-Id: Ib2b139697e235cb4674503784c6c909a5c460d1a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core, Spark: Add JMH benchmarks for Variants

4 participants