GH-49470: [C++][Gandiva] Fix crashes in substring_index and truncate with extreme integer values#49471
Merged
zanmato1984 merged 2 commits intoapache:mainfrom Mar 15, 2026
Conversation
…ncate with extreme integer values
|
|
lriggs
reviewed
Mar 9, 2026
lriggs
reviewed
Mar 11, 2026
zanmato1984
approved these changes
Mar 13, 2026
lriggs
approved these changes
Mar 14, 2026
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit d6ce56e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Member
|
@github-actions crossbow submit test-r-linux-sanitizers |
|
Revision: 8b17dff Submitted crossbow builds: ursacomputing/crossbow @ actions-90c6287765
|
thisisnic
pushed a commit
to thisisnic/arrow
that referenced
this pull request
Apr 6, 2026
…ncate with extreme integer values (apache#49471) ### Rationale for this change Two Gandiva functions crash when called with extreme integer parameter values: 1. `substring_index(VARCHAR, VARCHAR, INT)` crashes with SIGBUS when count is `INT_MIN` 2. `truncate(BIGINT, INT)` crashes with SIGSEGV when scale is `INT_MAX` or `INT_MIN` ### What changes are included in this PR? **substring_index fix** (`gdv_string_function_stubs.cc`): - Replace `abs(cnt)` with safe `int64_t` computation to avoid undefined behavior when `cnt == INT_MIN` **truncate fix** (`precompiled/extended_math_ops.cc`): - Return input unchanged for positive scales (no-op for integers) - Return 0 for scales < -38 to prevent out-of-bounds access in `GetScaleMultiplier` ### Are these changes tested? Yes. Added coverage for `INT_MAX`/`INT_MIN` values in `gdv_function_stubs_test.cc` and `extended_math_ops_test.cc`. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** These changes fix crashes caused by: - `abs(INT_MIN)` triggering undefined behavior (integer overflow) in `substring_index` - Out-of-bounds array access in `GetScaleMultiplier` when `truncate` receives extreme scale values * GitHub Issue: apache#49470 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
lriggs
added a commit
to dremio/arrow
that referenced
this pull request
Apr 24, 2026
…eGH-49421 Backport several upstream fixes. (#136) * apacheGH-49420: [C++][Gandiva] Fix castVARCHAR memory allocation and len<=0 handling (apache#49421) ### Rationale for this change The `castVARCHAR` functions in Gandiva have memory allocation inefficiencies and missing edge case handling. See apacheGH-49420 for details. ### What changes are included in this PR? **Functional fixes:** - `bool`: Remove unused 5-byte arena allocation; return string literal directly - `int32`/`int64`: Add handling for `len=0` (return empty string) and `len<0` (set error) **Memory allocation optimizations:** - `int32`/`int64`: Allocate fixed small buffer (11/20 bytes) directly in arena, use optimized digit-pair conversion writing right-to-left, then `memmove` to align. Returns `min(len, actual_size)` bytes. - `date64`: Allocate only `min(len, 10)` bytes upfront (output is always "YYYY-MM-DD") - `float32`/`float64`: Allocate only `min(len, 24)` bytes upfront (max output length) **Code cleanup:** - Extract common code into helper macros to reduce duplication ### Are these changes tested? Yes. Added tests for `len=0` and `len<0` edge cases for int64, date64, float32, float64, and bool types. All existing Gandiva tests pass. Adhoc performance benchmarking was performed both via direct expression evaluation as well as via query execution via Dremio. ### Are there any user-facing changes? No. Users will see reduced memory usage and proper error messages for invalid len parameter values. Note: Error messages for negative `len` remain different between precompiled ("Output buffer length can't be negative") and interpreted ("Buffer length cannot be negative") code paths, preserving existing behavior. * GitHub Issue: apache#49420 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-49438: [C++][Gandiva] Optimize LPAD/RPAD functions (apache#49439) ### Rationale for this change The `lpad_utf8_int32_utf8` and `rpad_utf8_int32_utf8` functions have performance inefficiency and a potential memory safety issue: 1. **Performance**: Single-byte fills iterate character-by-character when `memset` would suffice. Multi-byte fills use O(n) iterations instead of O(log n) with a doubling strategy. 2. **Memory safety**: When the fill string is longer than the padding space needed, the code could write more bytes than allocated. Fixed preventatively. ### What changes are included in this PR? 1. **Memory safety fix**: Use `std::min(fill_text_len, total_fill_bytes)` for the initial copy to prevent overflow 2. **Fast path**: Add single-byte fill optimization using `memset` 3. **General path**: Replace character-by-character loop with doubling strategy for multi-byte fills 4. **Tests**: Add comprehensive tests for the new code paths ### Are these changes tested? Yes. Added tests covering: - Large UTF-8 fill characters (4-byte emoji, 3-byte Chinese) - Single-byte fill boundaries (1 char and 65536 char padding) - Content verification for fill patterns - Doubling strategy boundaries - Partial fill scenarios (fill text longer than padding needed) ### Are there any user-facing changes? No. * GitHub Issue: apache#49438 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-49441: [C++][Gandiva] Add rand_integer function (apache#49442) ### Rationale for this change Add `rand_integer` function to Gandiva to generate random integers, complementing the existing `rand`/`random` functions that generate random doubles. This provides native integer random number generation and offers a more efficient alternative to `CAST(rand() * range AS INT)`. ### What changes are included in this PR? - Add `RandomIntegerGeneratorHolder` class following the existing `RandomGeneratorHolder` pattern - Implement three function signatures: - `rand_integer()` → int32 in range [INT32_MIN, INT32_MAX] - `rand_integer(int32 range)` → int32 in range [0, range-1] - `rand_integer(int32 min, int32 max)` → int32 in range [min, max] inclusive - Add parameter validation (range > 0, min <= max) at expression compilation time - Add 8 unit tests covering all signatures and edge cases - Use `std::uniform_int_distribution<int32_t>` with Mersenne Twister engine ### Are these changes tested? Yes, added 8 unit tests in `random_generator_holder_test.cc`: - `NoParams` - verifies full int32 range - `WithRange` - verifies [0, range-1] bounds - `WithMinMax` - verifies [min, max] inclusive bounds - `WithNegativeMinMax` - verifies negative range handling - `InvalidRangeZero` - verifies range=0 is rejected - `InvalidRangeNegative` - verifies negative range is rejected - `InvalidMinGreaterThanMax` - verifies min > max is rejected - `NullRangeDefaultsToOne` - verifies null parameter handling ### Are there any user-facing changes? Yes, this adds a new `rand_integer` function to Gandiva with three signatures as described above. * GitHub Issue: apache#49441 * apacheGH-49454: [C++][Gandiva] Fix castVARCHAR_timestamp for pre-epoch timestamps (apache#49455) ### Rationale for this change apacheGH-49454 castVARCHAR_timestamp_int64 produces negative milliseconds for pre-epoch timestamps ### What changes are included in this PR? Fixed `castVARCHAR_timestamp_int64` to correctly handle pre-epoch timestamps (before 1970-01-01). The issue was that using `in % MILLIS_IN_SEC` on negative timestamps produces negative milliseconds, resulting in output like `"0107-10-17 12:20:03.-10"`. ### Are these changes tested? Yes, added 4 new test cases covering pre-epoch timestamps with milliseconds ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This fixes a bug that caused **incorrect data to be produced** when casting pre-epoch timestamps to VARCHAR in Gandiva. Previously, timestamps before 1970-01-01 with non-zero milliseconds would produce invalid output with negative millisecond values (e.g., `"0107-10-17 12:20:03.-10"` instead of `"0107-10-17 12:20:03.900"`). * GitHub Issue: apache#49454 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-49470: [C++][Gandiva] Fix crashes in substring_index and truncate with extreme integer values (apache#49471) ### Rationale for this change Two Gandiva functions crash when called with extreme integer parameter values: 1. `substring_index(VARCHAR, VARCHAR, INT)` crashes with SIGBUS when count is `INT_MIN` 2. `truncate(BIGINT, INT)` crashes with SIGSEGV when scale is `INT_MAX` or `INT_MIN` ### What changes are included in this PR? **substring_index fix** (`gdv_string_function_stubs.cc`): - Replace `abs(cnt)` with safe `int64_t` computation to avoid undefined behavior when `cnt == INT_MIN` **truncate fix** (`precompiled/extended_math_ops.cc`): - Return input unchanged for positive scales (no-op for integers) - Return 0 for scales < -38 to prevent out-of-bounds access in `GetScaleMultiplier` ### Are these changes tested? Yes. Added coverage for `INT_MAX`/`INT_MIN` values in `gdv_function_stubs_test.cc` and `extended_math_ops_test.cc`. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** These changes fix crashes caused by: - `abs(INT_MIN)` triggering undefined behavior (integer overflow) in `substring_index` - Out-of-bounds array access in `GetScaleMultiplier` when `truncate` receives extreme scale values * GitHub Issue: apache#49470 Authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com> --------- Signed-off-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com> Co-authored-by: Dmitry Chirkov <dmitry.chirkov@dremio.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Two Gandiva functions crash when called with extreme integer parameter values:
substring_index(VARCHAR, VARCHAR, INT)crashes with SIGBUS when count isINT_MINtruncate(BIGINT, INT)crashes with SIGSEGV when scale isINT_MAXorINT_MINWhat changes are included in this PR?
substring_index fix (
gdv_string_function_stubs.cc):abs(cnt)with safeint64_tcomputation to avoid undefined behavior whencnt == INT_MINtruncate fix (
precompiled/extended_math_ops.cc):GetScaleMultiplierAre these changes tested?
Yes. Added coverage for
INT_MAX/INT_MINvalues ingdv_function_stubs_test.ccandextended_math_ops_test.cc.Are there any user-facing changes?
No.
This PR contains a "Critical Fix". These changes fix crashes caused by:
abs(INT_MIN)triggering undefined behavior (integer overflow) insubstring_indexGetScaleMultiplierwhentruncatereceives extreme scale values