Arrow: Fix vectorized reads of decimal columns with default values by harperjiang · Pull Request #16501 · apache/iceberg

harperjiang · 2026-05-21T05:40:02Z

The vectorized Arrow reader fails to allocate a vector for any decimal column that has an initialDefault or writeDefault on its Iceberg field. Reads through VectorizedTableScanIterable throw:

java.lang.IllegalArgumentException: Cannot cast default value to FIXED: <default>
  at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)

(fixed[N] for FIXED_LEN_BYTE_ARRAY-backed decimal, with the same shape.)

VectorizedArrowReader#getPhysicalType rewrites a decimal Iceberg field to its underlying physical type (fixed[N]) so the right Arrow vector class is allocated. It does this with Types.NestedField.from(logicalType).ofType(type).build(), which copies the field's initialDefault / writeDefault onto the new physical type.

NestedField's constructor then runs castDefault(literal, type), which calls DecimalLiteral.to(FixedType) — that conversion is not defined for decimal literals and returns null, tripping the Preconditions.checkArgument in castDefault.

The defaults are semantically tied to the logical (decimal) view of the column and should not flow to the physical representation — the physical type is an implementation detail used only to size the Arrow vector. The fix constructs the physical field with a fresh Types.NestedField.builder(), carrying over only id, name, optionality, and doc, and omitting both defaults.

The bug only surfaces when the column is not dictionary-encoded, because allocateDictEncodedVector does not call getPhysicalType. The new test disables dictionary encoding to make the regression deterministic.

Testing

Added TestArrowReader#testDecimalWithDefaultIsReadByVectorizedReader, which:

creates a v3 table with a DECIMAL(5, 2) column carrying both initialDefault and writeDefault,
writes a Parquet file using INT32-backed decimal with dictionary encoding disabled, and
reads via VectorizedTableScanIterable and asserts the raw INT32 values.

Without the fix the test fails at vector allocation; with the fix all rows are read correctly.

pvary · 2026-05-21T13:23:03Z

+   * IllegalArgumentException: Cannot cast default value to ...}.
+   */
+  @Test
+  public void testDecimalWithDefaultIsReadByVectorizedReader() throws Exception {


How does arrow reader handle the default values now?

So the short answer is that when the parquet schema is visited by the vectorized read builder and identifies there's a field not in the parquet file but in the table schema with a default , a Constant vector reader is created with the default value.

My main point here was, that I don't see an end-to-end test here, and I was wondering if the coverage was there.

amogh-jahagirdar

Thanks @harperjiang I think I agree with the fix, just a comment on where the tests should be (I also think this would apply for UUID)

amogh-jahagirdar · 2026-05-21T15:53:07Z

+   * IllegalArgumentException: Cannot cast default value to ...}.
+   */
+  @Test
+  public void testDecimalWithDefaultIsReadByVectorizedReader() throws Exception {


So the short answer is that when the parquet schema is visited by the vectorized read builder and identifies there's a field not in the parquet file but in the table schema with a default , a Constant vector reader is created with the default value.

amogh-jahagirdar · 2026-05-21T15:55:43Z

  }

+  /**
+   * Regression test: a decimal column whose Iceberg field carries an initialDefault/writeDefault


Is there a way to update the existing TestParquetVectorizedReads. That already implements a mixin supportsDefaultValues() and already runs through decimal type. I think the issue as you pointed out is specifically when it's not dictionary encoded (which for decimal I'd actually expect to generally be the case that it is not dictionary encoded). Maybe in that class when we produce the writer there's a way to pass through options that disable dictionary encoding?

I'd also look at goldenFilesAndEncodings in that existing test class

Thanks @amogh-jahagirdar. Moved the test to spark/v4.1/.../TestParquetVectorizedReads.java and added a getParquetWriterWithoutDictionary helper next to the existing writer helpers. The new testDecimalWithDefaultValueNotDictionaryEncoded covers all three decimal physical encodings ( INT32, INT64, FIXED), and goes through the regular assertRecordsMatch path.

Looked at goldenFilesAndEncodings but that doesn't cover decimal currently. Adding decimal to the list will intro extra changes. Happy to do it as a follow up.

On UUID: traced the path and the bug doesn't apply. getPhysicalType only rewrites the field when primitive.getLogicalTypeAnnotation() instanceof DecimalLogicalTypeAnnotation; ArrowSchemaUtil on UUIDType produces FixedSizeBinary(16) directly, which is already the vector class the reader populates, so no surrogate Iceberg type is needed.

I would also like to see direc Arrow tests in a follow-up. I think ít is very bad practice to test fixes in the Arrow module with tests running in Spark modules.

+1 to having a test in the arrow module

harperjiang · 2026-05-21T23:54:55Z

Thanks @harperjiang I think I agree with the fix, just a comment on where the tests should be (I also think this would apply for UUID)

Thanks @amogh-jahagirdar ! Moved the test cases and commented on UUID. (Details in the conversation) Please kindly review again at your convenience.

nastra · 2026-05-22T08:21:59Z

        // Use FixedSizeBinaryVector for binary backed decimal
        type = Types.FixedType.ofLength(primitive.getTypeLength());
      }
-      physicalType = Types.NestedField.from(logicalType).ofType(type).build();


what about keeping the original call but nulling out the defaults?

.withInitialDefault(null) .withWriteDefault(null)

I think that would be more explicit and show the intent

nastra

left some comments but I agree with the direction of the fix

wip

7983e0d

github-actions Bot added the arrow label May 21, 2026

harperjiang mentioned this pull request May 21, 2026

Arrow: Vectorized reads of decimal columns with default values fail with IllegalArgumentException #16502

Open

3 tasks

pvary reviewed May 21, 2026

View reviewed changes

amogh-jahagirdar reviewed May 21, 2026

View reviewed changes

address comments

5037e58

github-actions Bot added the spark label May 21, 2026

harperjiang requested a review from amogh-jahagirdar May 21, 2026 23:53

nastra reviewed May 22, 2026

View reviewed changes

nastra approved these changes May 22, 2026

View reviewed changes

Conversation

harperjiang commented May 21, 2026

Testing

Uh oh!

pvary May 21, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 21, 2026

Choose a reason for hiding this comment

Uh oh!

pvary May 22, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 21, 2026

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar May 21, 2026

Choose a reason for hiding this comment

Uh oh!

harperjiang May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary May 22, 2026

Choose a reason for hiding this comment

Uh oh!

nastra May 22, 2026

Choose a reason for hiding this comment

Uh oh!

harperjiang commented May 21, 2026

Uh oh!

nastra May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amogh-jahagirdar May 21, 2026 •

edited

Loading

harperjiang May 21, 2026 •

edited

Loading

nastra May 22, 2026 •

edited

Loading