Flink: Support writing shredded variant in Flink by Guosmilesmile · Pull Request #15596 · apache/iceberg

Guosmilesmile · 2026-03-12T08:09:38Z

This PR is mainly to add support in Flink for writing shredding-variant data to Iceberg tables, based on #14297.

This PR is based on #14297 and will be adjusted in sync with it.

Guosmilesmile · 2026-05-07T08:36:30Z

Hi @aihuaxu @nssalian @pvary @mxm . Since the Spark part has been merged, the Flink part has been adjusted accordingly. If you have time, please help review it.

Thanks!
GuoYu.

pvary · 2026-05-08T13:49:46Z

+        .tableProperty(TableProperties.PARQUET_SHRED_VARIANTS)
+        .defaultValue(TableProperties.PARQUET_SHRED_VARIANTS_DEFAULT)


How will we handle when ORC supports shredding variants?

Good catch . I rename shred-variants to parquet-shred-variants to clarify this feature is only support parquet . If orc support this, then we can add another config.

Let's do parquet for now since we followed that pattern for the Spark implementation.

pvary · 2026-05-08T14:07:37Z

-                FlinkParquetReaders.buildReader(icebergSchema, fileSchema, idToConstant)));
+                FlinkParquetReaders.buildReader(icebergSchema, fileSchema, idToConstant),
+            new FlinkVariantShreddingAnalyzer(),
+            (row, rowType) -> new RowDataSerializer(rowType).copy(row)));


Isn't this costly to recreate every time when we copy a row?

It will increase the cost, but without copying, there would be issues with data corruption when buffer data. We ran into this during early development, and the unit tests can reproduce it.

Can we reuse the RowDataSerializer?

With the current BiFunction, (row, rowType) -> new RowDataSerializer(rowType).copy(row) creates a new RowDataSerializer for every buffered row (default buffer = 100). This construction is not free, as it involves walking rowType.getChildren(), building a TypeSerializer[] via InternalSerializers.create, a BinaryRowDataSerializer, and a RowData.FieldGetter[]. Since the engine schema is fixed for the entire file, a factory allows us to build it once and reuse it. Using the Factory Pattern, we can avoid recreating the serializer for a given table schema with every incoming record.

Yes, we can use Function<S, UnaryOperator<D>> instead of BiFunction<D, S, D> to implement this.

+1. We should be able to reuse RowDataSerializer so we don't need to create new instance for every row.

talatuyarer's Comments

Guosmilesmile · 2026-05-20T02:49:05Z

@pvary @talatuyarer @nssalian @aihuaxu Hey all, I rebased main but ran into some CI failures. Looks like a new check was added recently that doesn't allow modifying the ParquetFormatModel parameter types directly.

As a workaround, I added a new method createWithCopyFuncFactory in ParquetFormatModel. The original create method now delegates to it, so the Spark code stays untouched, while FlinkFormatModels calls createWithCopyFuncFactory explicitly.

Would really appreciate it if you could help take another look at these changes. Thanks a lot!

java.method.parameterTypeChanged: The type of the parameter changed from 'java.util.function.UnaryOperator<D extends java.lang.Object>' to 'java.util.function.Function<S extends java.lang.Object, java.util.function.UnaryOperator<D extends java.lang.Object>>'.

old: parameter <D, S> org.apache.iceberg.parquet.ParquetFormatModel<D, S, org.apache.iceberg.parquet.ParquetValueReader<?>> org.apache.iceberg.parquet.ParquetFormatModel<D, S, R>::create(java.lang.Class<D>, java.lang.Class<S>, org.apache.iceberg.formats.BaseFormatModel.WriterFunction<org.apache.iceberg.parquet.ParquetValueWriter<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.formats.BaseFormatModel.ReaderFunction<org.apache.iceberg.parquet.ParquetValueReader<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.parquet.VariantShreddingAnalyzer<D, S>, ===java.util.function.UnaryOperator<D>===)
new: parameter <D, S> org.apache.iceberg.parquet.ParquetFormatModel<D, S, org.apache.iceberg.parquet.ParquetValueReader<?>> org.apache.iceberg.parquet.ParquetFormatModel<D, S, R>::create(java.lang.Class<D>, java.lang.Class<S>, org.apache.iceberg.formats.BaseFormatModel.WriterFunction<org.apache.iceberg.parquet.ParquetValueWriter<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.formats.BaseFormatModel.ReaderFunction<org.apache.iceberg.parquet.ParquetValueReader<?>, S, org.apache.parquet.schema.MessageType>, org.apache.iceberg.parquet.VariantShreddingAnalyzer<D, S>, ===java.util.function.Function<S, java.util.function.UnaryOperator<D>>===)

https://github.com/apache/iceberg/actions/runs/26136650474/job/76873207059?pr=15596

Guosmilesmile · 2026-05-20T08:20:29Z

After discussing with @pvary , we decided to keep the same create name and go with deprecating the old method instead.

pvary · 2026-05-21T05:35:33Z

  }

+  /**
+   * @deprecated Will be removed in 1.12.0; use {@link #create(Class, Class, WriterFunction,


Since the 1.11.0 is released now, this change will be in 1.12.0 and we will remove it in 1.13.0.

Please adjust the comment

Thanks for pointing it out. Adjusted it now.

pvary · 2026-05-21T05:40:07Z

     */
    private FileAppender<D> buildShreddedAppender() {
+      UnaryOperator<D> copyFunc = copyFuncFactory.apply(engineSchema);
+      Preconditions.checkState(copyFunc != null, "copyFunc must not return null");


Checking only the copyFunc seems a bit odd to me. Should we check the factory first?

Ok, add a check for copyFuncFactory first.

pvary · 2026-05-21T15:00:04Z

@nssalian: Any more comments?

nssalian

one nit in the documentation description. rest looks good.

nssalian · 2026-05-21T15:27:43Z

 | write-parallelism                       | Upstream operator parallelism              | Overrides the writer parallelism                                                                                                                |
 | uid-suffix                              | As per table property                      | Overrides the uid suffix used in the underlying IcebergSink for this table                                                                      |
+| shred-variants                          | Table write.parquet.shred-variants         | Overrides this table's shred variants for this write |
+| variant-inference-buffer-size           | Table write.parquet.variant-inference-buffer-size | Overrides this table's variant inference buffer size this write |


Suggested change

| variant-inference-buffer-size | Table write.parquet.variant-inference-buffer-size | Overrides this table's variant inference buffer size this write |

| variant-inference-buffer-size | Table write.parquet.variant-inference-buffer-size | Overrides this table's variant inference buffer size for this write |

Thanks . Add it now.

nssalian

Thanks for all the work @Guosmilesmile. lgtm

Guosmilesmile · 2026-05-22T00:12:19Z

Seem unrelate error in kafka connect . Retrigger CI.

TestIntegrationDynamicTable > testIcebergSink(String) > [1] null FAILED
    java.lang.AssertionError at IntegrationTestBase.java:237

https://github.com/apache/iceberg/actions/runs/26236304974/job/77210319934?pr=15596

pvary · 2026-05-22T04:39:43Z

Merged to main.
Thanks @Guosmilesmile for the PR and @nssalian, @talatuyarer and @aihuaxu for the reviews!

Guosmilesmile marked this pull request as draft March 12, 2026 08:09

github-actions Bot added spark parquet flink ORC labels Mar 12, 2026

Guosmilesmile mentioned this pull request Mar 12, 2026

Spark: Support writing shredded variant in Iceberg-Spark #14297

Merged

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 15ff223 to 5b448b9 Compare March 12, 2026 08:22

github-actions Bot removed the ORC label Mar 12, 2026

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 2 times, most recently from 8f6198a to b03caf6 Compare March 12, 2026 08:59

Guosmilesmile changed the title ~~Core,Flink: Support writing shredded variant in Flink~~ Flink: Support writing shredded variant in Flink Mar 12, 2026

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 3 times, most recently from 88045e1 to cbfa8c2 Compare March 13, 2026 07:17

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from fae2814 to f3a2fba Compare March 24, 2026 05:50

github-actions Bot added the core label Mar 24, 2026

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 4 times, most recently from b07b00b to c95d78f Compare March 24, 2026 08:36

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch 3 times, most recently from fc8c45a to b116f25 Compare April 1, 2026 01:50

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from b116f25 to 650cb7a Compare April 10, 2026 09:42

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 770d9c4 to 7d48389 Compare May 7, 2026 06:55

Guosmilesmile marked this pull request as ready for review May 7, 2026 08:34

pvary reviewed May 8, 2026

View reviewed changes

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 63ae5ae to 0f2ae10 Compare May 9, 2026 05:33

Guosmilesmile added 10 commits May 20, 2026 09:59

fix spark 4.0

d8463a6

Fix RowDataSerializer create every row

fc8a53e

move set param to after

3c1ff52

Address aihua's Comment

41d832c

Address Comments

41eeeeb

Address

31a8bc0

talatuyarer's Comments

Address Peter's Comments

8a5c995

Update the Flink config to prioritize the table-level setting.

cee4e35

rename to unused

d87d3ca

rename to unused for spark4.0

03fb902

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from d0fe6b7 to 03fb902 Compare May 20, 2026 02:00

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from 0cdb80e to e8ad73e Compare May 20, 2026 07:45

Add new create method in ParquetFormatModel

9480782

Guosmilesmile force-pushed the flink_shredded_varisnt_fileformat branch from e8ad73e to 9480782 Compare May 20, 2026 07:50

pvary reviewed May 21, 2026

View reviewed changes

Adjust deprecated doc and add check for copyFuncFactory

60ccdbe

pvary approved these changes May 21, 2026

View reviewed changes

nssalian reviewed May 21, 2026

View reviewed changes

update doc

f46fe6a

nssalian approved these changes May 21, 2026

View reviewed changes

Guosmilesmile closed this May 22, 2026

Guosmilesmile reopened this May 22, 2026

pvary merged commit 10ba4ee into apache:main May 22, 2026
108 of 110 checks passed

Guosmilesmile deleted the flink_shredded_varisnt_fileformat branch May 22, 2026 04:40

		.tableProperty(TableProperties.PARQUET_SHRED_VARIANTS)
		.defaultValue(TableProperties.PARQUET_SHRED_VARIANTS_DEFAULT)

	\| variant-inference-buffer-size \| Table write.parquet.variant-inference-buffer-size \| Overrides this table's variant inference buffer size this write \|
	\| variant-inference-buffer-size \| Table write.parquet.variant-inference-buffer-size \| Overrides this table's variant inference buffer size for this write \|

Conversation

Guosmilesmile commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Guosmilesmile commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile commented May 20, 2026

Uh oh!

Guosmilesmile commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary commented May 21, 2026

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

Guosmilesmile commented May 22, 2026

Uh oh!

Uh oh!

pvary commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Guosmilesmile commented Mar 12, 2026 •

edited

Loading