Core, Spark: Add JMH benchmarks for Variants#15629
Conversation
7c4f806 to
2be00b9
Compare
|
@rashworld-max still a WiP I'm afraid. Need to know I'm measuring the right thing. Also I can't tell from your profile whether or not you are a human. |
70c69f8 to
25c6f29
Compare
|
|
||
| class ParquetVariantUtil { | ||
| @VisibleForTesting | ||
| public final class ParquetVariantUtil { |
There was a problem hiding this comment.
Is it possible to relocate the tests rather than expose this? We can do this, but generally prefer not to if we can avoid it.
There was a problem hiding this comment.
happy to do that...I have done it in the parquet PR
|
I think this is ready for review. I've got the initial results and it's good for PRs like #3477 to be able to before/after benchmarks. More stuff can go in later; I've outlined them in my report. Equality deletes would be a fun one |
| */ | ||
| private long materializeNonEmpty(String operation, Dataset<?> ds) { | ||
| LOG.info("{} table={}", operation, tableType); | ||
| final long count = ds.count(); |
There was a problem hiding this comment.
Spark doesn't need to evaluate projection to count records.
There was a problem hiding this comment.
I needed something to do the entire compute and count() worked. Otherwise it's evaluate every row and feed to a black hole. What do you prefer?
There was a problem hiding this comment.
Spark can count records without evaluating projection so it's not really testing the projection here.
There was a problem hiding this comment.
it seemed to work, but I will get and discard each record instead
There was a problem hiding this comment.
@manuzhang fyi, now using the same sequence as the IcebergSourceBenchmark superclass, with the retention of the count for use in assertions
final long count = ds.queryExecution().toRdd().toJavaRDD().count();
blackhole.consume(count);
| "variant_get(nested, '$.varcategory', 'int')"; | ||
|
|
||
| /** Get the ID field from inside the variant: {@value}. */ | ||
| private static final String VARIANT_GET_NESTED_ID = "variant_get(nested, '$.varid', 'int')"; |
There was a problem hiding this comment.
should this be int64 as in the comments above?
|
switching to draft again as I'm reworking the benchmark to show variant rowgroup filtering of shredded variants works (#15510), with changes including
Needs an extra PR in iceberg from qlong and a snaspshot of spark 4.1 with his changes for spark to pass variant_get down So, surprisingly complex. If I can show the chain works then it's time to start with the feature merges in spark and then here. This branch will merge without direct dependency, it's just a key goal of the spark benchmark is "show pushdown working". It's not ready to merge unless it can do that |
20d7bda to
2690e5c
Compare
Fixes apache#15628 Core: benchmark of variant creation and ser/deser costs. Separate benchmarks for * building * serializing a prebuilt object * deserializing Variables are: - fields: [1000, 10000] - depth: [shallow, nested] - percentage of fields shed [0, 33, 67, 100] Note: the current benchmark does NOT for the JVM, as it allows for fast iterative development. A final merge should switch to fork(1). Spark: Full test of predicate pushdown of variants - avro - parquet unshredded - parquet shredded For this to return useful numbers, requires PRs for - Passing down variant_get between spark and iceberg - ParquetRowGroupFilter to filter on shredded variants. contains Add some more benchmarks Change-Id: I4231280f08cf63db5960ecb79301ae9458b35272
53dc868 to
34f347c
Compare
* file skipping can be observed as out of range equals/element tests are skipped completely * varcat filters still very slow Changes - blackhole consumption of rows - wiring up to ParquetMetricsRowGroupFilter.resetShreddedMetricsCounter() shows the filtering is happening on shredded files - cutting back on category count and attempting to change structure of file Note: this is the benchmark branch, and has had the assertions on counters and counter reset cut; all the other wiring up for the assertions is present. Change-Id: I686be12d51b13d2048b631b8cf198651012cc474
Allows for assertions in tests and in benchmarks that rowgroup skipping is taking place. Needed as there's not much tangible speedup, yet Change-Id: I8c03eb33d2d3d8a2139c347e6a72a7284e627f62
...which shows the configuration changes needed for data to be saved to multiple rowgroups. File size shrunk; increasing iterations of runs. Change-Id: Icb622958a068ed67de5bb895d88a9aa1713d2b11
- increasing category count reduces # of matches on the single category, so amplifying shredding filtering advantage - and varcat select with a range > 0 and < 1. That's the same as the = 1 and `in (1)`, selections, but with two scans of the values. Change-Id: Ib2b139697e235cb4674503784c6c909a5c460d1a
Fixes #15628
core:VariantSerializationBenchmark
Separate benchmarks for
Variables are:
spark-4.1:IcebergSourceVariantReadBenchmark
Generate Avro, unshedded Parquet and shedded Parquet tables with the same variant data and then compare performance for basic filter and project operations against the normal columns and the variant fields.
Key findings:
I'm not reaching any conclusion why this is the case. I am looking at improving the performance of reconstructing string fields in parquet-java as those benchmark show needless byte-string-byte conversion. For the iceberg benchmark and layers below, I think knowing where issues like is enough of a change.
Writeup
See https://steveloughran.github.io/benchmarking-variants/ for the writeup and the interactive benchmark results of Iceberg and Parquet benchmarks.