Skip to content
This repository was archived by the owner on Apr 30, 2026. It is now read-only.

Commit 2d0052a

Browse files
committed
fix: dict-form by: rejects alias collisions with input columns
ray_table_add_col appends a new column without dedup'ing schema names, and ray_table_get_col returns the FIRST matching column by id. So when a dict-form by: reused an alias that already existed in the input table — e.g. (table [o G V] ...) (select {from: t by: {o: G} s: (sum V)}) my previous commit happily added a second column named 'o, but the group-by key scanner then returned the ORIGINAL 'o column, silently grouping on the wrong data. Two cases to distinguish: 1. Trivial self-alias {x: x}: the user just wants the existing column as a group key with the same name. Skip the eval/add step entirely and use the input column directly. 2. Any other collision {x: <anything-but-x>}: error out with a clear domain message. Users must pick a distinct alias or drop the shadowed input column first. Also guards against within-dict duplicate keys ({x: A x: B}) by tracking the names added in this loop. Regression tests: - shadow case raises "domain" error - trivial self-alias {G: G} groups on G as expected - xbar-bucket dict case still passes (no regression) 652/652 tests pass.
1 parent 4f857f2 commit 2d0052a

2 files changed

Lines changed: 44 additions & 0 deletions

File tree

src/ops/query.c

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -890,6 +890,31 @@ ray_t* ray_select_fn(ray_t** args, int64_t n) {
890890
fail_err = ray_error("domain", "by-dict key must be a symbol name");
891891
failed = true; break;
892892
}
893+
/* Collision check: if the dict key already exists in the
894+
* input table, ray_table_add_col would append a second
895+
* column with the same name and ray_table_get_col finds
896+
* the ORIGINAL, so the group-by would silently scan the
897+
* user's existing column instead of our materialised
898+
* one. Allow two specific exceptions:
899+
* - {x: x} (trivial self-alias — existing col already
900+
* is what we want)
901+
* - a later dict key reusing a name we just added in
902+
* this loop (guarded separately via i_added[]). */
903+
bool already_in_tbl = (ray_table_get_col(tbl, k->i64) != NULL);
904+
bool trivial_self = (v->type == -RAY_SYM && v->i64 == k->i64);
905+
bool we_added_it = false;
906+
for (int64_t j = 0; j < i && !we_added_it; j++)
907+
if (d_elems[j * 2]->i64 == k->i64) we_added_it = true;
908+
if (already_in_tbl && !trivial_self && !we_added_it) {
909+
fail_err = ray_error("domain",
910+
"by-dict alias shadows an existing input column");
911+
failed = true; break;
912+
}
913+
if (trivial_self && !we_added_it) {
914+
/* No eval / no add: just group on the existing col. */
915+
sv_data[i] = k->i64;
916+
continue;
917+
}
893918
ray_t* col_vec = ray_eval(v);
894919
if (!col_vec || RAY_IS_ERR(col_vec)) {
895920
fail_err = col_vec ? col_vec : ray_error("domain", "by-dict val eval");

test/test_lang_rf.inc

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5447,6 +5447,25 @@ static MunitResult test_rf_group(const void* params, void* fixture) {
54475447
ASSERT_EQ("(at (at rX 's) 3)", "5"); /* 5 */
54485448
ASSERT_EQ("(at (at rX 's) 4)", "6"); /* 6 */
54495449

5450+
/* Shadow protection: a dict alias that already exists as an
5451+
* input column would cause ray_table_add_col to append a
5452+
* second column with the same name; ray_table_get_col finds
5453+
* the original and the group-by would silently scan the
5454+
* wrong data. Reject with a clear domain error. */
5455+
ASSERT_ER(
5456+
"(set tSh (table [o G V] (list [1 2 3] [a a b] [10 20 30])))"
5457+
"(select {from: tSh by: {o: G} s: (sum V)})", "domain");
5458+
5459+
/* Trivial self-alias {X: X} is a no-op rename onto the column
5460+
* that's already there — allowed and produces the expected
5461+
* group-by on the existing column. */
5462+
ASSERT_EQ(
5463+
"(set tTS (table [G V] (list [a a b] [1 2 3])))"
5464+
"(set rTS (select {from: tTS by: {G: G} s: (sum V)}))"
5465+
"(count rTS)", "2");
5466+
ASSERT_EQ("(at (at rTS 's) 0)", "3"); /* a: 1+2 */
5467+
ASSERT_EQ("(at (at rTS 's) 1)", "3"); /* b: 3 */
5468+
54505469
// ========== GROUP BY STRING COLUMN ==========
54515470
// Tests for grouping by list of strings (TYPE_LIST of TYPE_C8)
54525471
// This tests the fix for parallel radix partitioning bug where identical strings

0 commit comments

Comments
 (0)