Skip to content

Arrow: Fix vectorized reads of decimal columns with default values#16501

Open
harperjiang wants to merge 2 commits into
apache:mainfrom
harperjiang:fixdefvalue
Open

Arrow: Fix vectorized reads of decimal columns with default values#16501
harperjiang wants to merge 2 commits into
apache:mainfrom
harperjiang:fixdefvalue

Conversation

@harperjiang
Copy link
Copy Markdown

The vectorized Arrow reader fails to allocate a vector for any decimal column that has an initialDefault or writeDefault on its Iceberg field. Reads through VectorizedTableScanIterable throw:

java.lang.IllegalArgumentException: Cannot cast default value to FIXED: <default>
  at org.apache.iceberg.types.Types$NestedField.castDefault(Types.java:892)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.getPhysicalType(VectorizedArrowReader.java:255)
  at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateFieldVector(VectorizedArrowReader.java:228)

(fixed[N] for FIXED_LEN_BYTE_ARRAY-backed decimal, with the same shape.)

VectorizedArrowReader#getPhysicalType rewrites a decimal Iceberg field to its underlying physical type (fixed[N]) so the right Arrow vector class is allocated. It does this with Types.NestedField.from(logicalType).ofType(type).build(), which copies the field's initialDefault / writeDefault onto the new physical type.

NestedField's constructor then runs castDefault(literal, type), which calls DecimalLiteral.to(FixedType) — that conversion is not defined for decimal literals and returns null, tripping the Preconditions.checkArgument in castDefault.

The defaults are semantically tied to the logical (decimal) view of the column and should not flow to the physical representation — the physical type is an implementation detail used only to size the Arrow vector. The fix constructs the physical field with a fresh Types.NestedField.builder(), carrying over only id, name, optionality, and doc, and omitting both defaults.

The bug only surfaces when the column is not dictionary-encoded, because allocateDictEncodedVector does not call getPhysicalType. The new test disables dictionary encoding to make the regression deterministic.

Testing

Added TestArrowReader#testDecimalWithDefaultIsReadByVectorizedReader, which:

  • creates a v3 table with a DECIMAL(5, 2) column carrying both initialDefault and writeDefault,
  • writes a Parquet file using INT32-backed decimal with dictionary encoding disabled, and
  • reads via VectorizedTableScanIterable and asserts the raw INT32 values.

Without the fix the test fails at vector allocation; with the fix all rows are read correctly.

* IllegalArgumentException: Cannot cast default value to ...}.
*/
@Test
public void testDecimalWithDefaultIsReadByVectorizedReader() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does arrow reader handle the default values now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the short answer is that when the parquet schema is visited by the vectorized read builder and identifies there's a field not in the parquet file but in the table schema with a default , a Constant vector reader is created with the default value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main point here was, that I don't see an end-to-end test here, and I was wondering if the coverage was there.

Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harperjiang I think I agree with the fix, just a comment on where the tests should be (I also think this would apply for UUID)

* IllegalArgumentException: Cannot cast default value to ...}.
*/
@Test
public void testDecimalWithDefaultIsReadByVectorizedReader() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the short answer is that when the parquet schema is visited by the vectorized read builder and identifies there's a field not in the parquet file but in the table schema with a default , a Constant vector reader is created with the default value.

}

/**
* Regression test: a decimal column whose Iceberg field carries an initialDefault/writeDefault
Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to update the existing TestParquetVectorizedReads. That already implements a mixin supportsDefaultValues() and already runs through decimal type. I think the issue as you pointed out is specifically when it's not dictionary encoded (which for decimal I'd actually expect to generally be the case that it is not dictionary encoded). Maybe in that class when we produce the writer there's a way to pass through options that disable dictionary encoding?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also look at goldenFilesAndEncodings in that existing test class

Copy link
Copy Markdown
Author

@harperjiang harperjiang May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar. Moved the test to spark/v4.1/.../TestParquetVectorizedReads.java and added a getParquetWriterWithoutDictionary helper next to the existing writer helpers. The new testDecimalWithDefaultValueNotDictionaryEncoded covers all three decimal physical encodings ( INT32, INT64, FIXED), and goes through the regular assertRecordsMatch path.

Looked at goldenFilesAndEncodings but that doesn't cover decimal currently. Adding decimal to the list will intro extra changes. Happy to do it as a follow up.

On UUID: traced the path and the bug doesn't apply. getPhysicalType only rewrites the field when primitive.getLogicalTypeAnnotation() instanceof DecimalLogicalTypeAnnotation; ArrowSchemaUtil on UUIDType produces FixedSizeBinary(16) directly, which is already the vector class the reader populates, so no surrogate Iceberg type is needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see direc Arrow tests in a follow-up. I think ít is very bad practice to test fixes in the Arrow module with tests running in Spark modules.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to having a test in the arrow module

@harperjiang
Copy link
Copy Markdown
Author

Thanks @harperjiang I think I agree with the fix, just a comment on where the tests should be (I also think this would apply for UUID)

Thanks @amogh-jahagirdar ! Moved the test cases and commented on UUID. (Details in the conversation) Please kindly review again at your convenience.

// Use FixedSizeBinaryVector for binary backed decimal
type = Types.FixedType.ofLength(primitive.getTypeLength());
}
physicalType = Types.NestedField.from(logicalType).ofType(type).build();
Copy link
Copy Markdown
Contributor

@nastra nastra May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about keeping the original call but nulling out the defaults?

.withInitialDefault(null)
.withWriteDefault(null)

I think that would be more explicit and show the intent

Copy link
Copy Markdown
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments but I agree with the direction of the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants