Arrow: Fix vectorized reads of decimal columns with default values#16501
Arrow: Fix vectorized reads of decimal columns with default values#16501harperjiang wants to merge 2 commits into
Conversation
| * IllegalArgumentException: Cannot cast default value to ...}. | ||
| */ | ||
| @Test | ||
| public void testDecimalWithDefaultIsReadByVectorizedReader() throws Exception { |
There was a problem hiding this comment.
How does arrow reader handle the default values now?
There was a problem hiding this comment.
So the short answer is that when the parquet schema is visited by the vectorized read builder and identifies there's a field not in the parquet file but in the table schema with a default , a Constant vector reader is created with the default value.
There was a problem hiding this comment.
My main point here was, that I don't see an end-to-end test here, and I was wondering if the coverage was there.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @harperjiang I think I agree with the fix, just a comment on where the tests should be (I also think this would apply for UUID)
| * IllegalArgumentException: Cannot cast default value to ...}. | ||
| */ | ||
| @Test | ||
| public void testDecimalWithDefaultIsReadByVectorizedReader() throws Exception { |
There was a problem hiding this comment.
So the short answer is that when the parquet schema is visited by the vectorized read builder and identifies there's a field not in the parquet file but in the table schema with a default , a Constant vector reader is created with the default value.
| } | ||
|
|
||
| /** | ||
| * Regression test: a decimal column whose Iceberg field carries an initialDefault/writeDefault |
There was a problem hiding this comment.
Is there a way to update the existing TestParquetVectorizedReads. That already implements a mixin supportsDefaultValues() and already runs through decimal type. I think the issue as you pointed out is specifically when it's not dictionary encoded (which for decimal I'd actually expect to generally be the case that it is not dictionary encoded). Maybe in that class when we produce the writer there's a way to pass through options that disable dictionary encoding?
There was a problem hiding this comment.
I'd also look at goldenFilesAndEncodings in that existing test class
There was a problem hiding this comment.
Thanks @amogh-jahagirdar. Moved the test to spark/v4.1/.../TestParquetVectorizedReads.java and added a getParquetWriterWithoutDictionary helper next to the existing writer helpers. The new testDecimalWithDefaultValueNotDictionaryEncoded covers all three decimal physical encodings ( INT32, INT64, FIXED), and goes through the regular assertRecordsMatch path.
Looked at goldenFilesAndEncodings but that doesn't cover decimal currently. Adding decimal to the list will intro extra changes. Happy to do it as a follow up.
On UUID: traced the path and the bug doesn't apply. getPhysicalType only rewrites the field when primitive.getLogicalTypeAnnotation() instanceof DecimalLogicalTypeAnnotation; ArrowSchemaUtil on UUIDType produces FixedSizeBinary(16) directly, which is already the vector class the reader populates, so no surrogate Iceberg type is needed.
There was a problem hiding this comment.
I would also like to see direc Arrow tests in a follow-up. I think ít is very bad practice to test fixes in the Arrow module with tests running in Spark modules.
There was a problem hiding this comment.
+1 to having a test in the arrow module
Thanks @amogh-jahagirdar ! Moved the test cases and commented on UUID. (Details in the conversation) Please kindly review again at your convenience. |
| // Use FixedSizeBinaryVector for binary backed decimal | ||
| type = Types.FixedType.ofLength(primitive.getTypeLength()); | ||
| } | ||
| physicalType = Types.NestedField.from(logicalType).ofType(type).build(); |
There was a problem hiding this comment.
what about keeping the original call but nulling out the defaults?
.withInitialDefault(null)
.withWriteDefault(null)
I think that would be more explicit and show the intent
nastra
left a comment
There was a problem hiding this comment.
left some comments but I agree with the direction of the fix
The vectorized Arrow reader fails to allocate a vector for any decimal column that has an
initialDefaultorwriteDefaulton its Iceberg field. Reads throughVectorizedTableScanIterablethrow:(
fixed[N]for FIXED_LEN_BYTE_ARRAY-backed decimal, with the same shape.)VectorizedArrowReader#getPhysicalTyperewrites a decimal Iceberg field to its underlying physical type (fixed[N]) so the right Arrow vector class is allocated. It does this withTypes.NestedField.from(logicalType).ofType(type).build(), which copies the field'sinitialDefault/writeDefaultonto the new physical type.NestedField's constructor then runscastDefault(literal, type), which callsDecimalLiteral.to(FixedType)— that conversion is not defined for decimal literals and returnsnull, tripping thePreconditions.checkArgumentincastDefault.The defaults are semantically tied to the logical (decimal) view of the column and should not flow to the physical representation — the physical type is an implementation detail used only to size the Arrow vector. The fix constructs the physical field with a fresh
Types.NestedField.builder(), carrying over onlyid,name,optionality, anddoc, and omitting both defaults.The bug only surfaces when the column is not dictionary-encoded, because
allocateDictEncodedVectordoes not callgetPhysicalType. The new test disables dictionary encoding to make the regression deterministic.Testing
Added
TestArrowReader#testDecimalWithDefaultIsReadByVectorizedReader, which:DECIMAL(5, 2)column carrying bothinitialDefaultandwriteDefault,VectorizedTableScanIterableand asserts the raw INT32 values.Without the fix the test fails at vector allocation; with the fix all rows are read correctly.