Optimize memory usage of DataStore#553
Open
wudidapaopao wants to merge 8 commits intochdb-io:mainfrom
Open
Conversation
- Column pruning, SQL groupby/fillna/union pushdown to reduce memory - Detect overlapping non-key columns in merge() to fallback for suffixes - Fallback to pandas merge when sort=True or validate is set - Add 43 merge/join compatibility tests
Prevents ClickHouse ILLEGAL_AGGREGATION when WHERE column name conflicts with aggregate alias (e.g. filter→assign→groupby chain).
UNION ALL resets index, incompatible with pandas concat default (ignore_index=False) which preserves original indices.
Filter NULL/empty groups after SQL GROUP BY to match pandas dropna=True. Let exceptions propagate instead of silently falling back.
Skip the single-SQL GROUP BY optimization when execution_engine is explicitly set to 'pandas', avoiding floating-point precision differences between chDB and pandas aggregation.
Replace the broad has_existing_lazy_ops guard with targeted detection of WHERE clauses that reference the aggregation column. This allows more queries (e.g., filter on column A + groupby agg on column B) to use the memory-saving single-SQL path while still avoiding chDB's ILLEGAL_AGGREGATION when WHERE and the aggregation alias collide.
There was a problem hiding this comment.
Pull request overview
This PR targets lower peak memory usage in DataStore by increasing SQL pushdown opportunities (JOIN/UNION/GROUP BY/value_counts) and by checkpointing execution results to release now-unused intermediate sources, while adding an extensive pandas-compatibility test suite for merge/join behaviors.
Changes:
- Add lazy SQL paths for
merge()(JOIN) andconcat()(UNION ALL) when inputs are SQL-backed and features are compatible. - Introduce column-pruning and single-SQL execution paths in
ColumnExpr/groupby to avoid materializing wide intermediates. - Add new merge/join pandas-compatibility test coverage, plus SQL helper support (
SubqueryTableFunction) and row-order SQL clause handling updates.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| datastore/tests/test_merge_join_compat.py | New pandas-compat coverage for merge/join edge cases, file-backed joins, and concat/union column-ordering. |
| datastore/table_functions.py | Adds SubqueryTableFunction to treat composite SQL (e.g., UNION) as a table source. |
| datastore/pandas_compat.py | Adds lazy SQL fast-paths for merge() and concat(), and checkpoints loc.__setitem__ mutations into a DataFrame source. |
| datastore/pandas_api.py | Adds a module-level concat() UNION ALL fast-path for DataStore inputs. |
| datastore/core.py | Improves SQL-source “pristine” detection, reduces retained memory after execution, improves columns metadata for SQL/join cases, and adds SQL UNION implementation in union(). |
| datastore/connection.py | Updates row-order preservation SQL rewrite to insert before SETTINGS as well as LIMIT/OFFSET. |
| datastore/column_expr.py | Adds column pruning for expression/value_counts/groupby and introduces a single-SQL groupby aggregation path for SQL-backed sources. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolved #552
Changelog category (leave one):
Changelog entry
Optimize DataStore memory usage by pushing more operations to SQL and releasing intermediate data earlier
Summary
This PR reduces DataStore peak memory consumption through five strategies:
merge()— avoids materializing both sides into Python DataFramesconcat()/union()— avoids loading all concatenation operands into memory simultaneouslySELECT *Additionally adds 43 merge/join pandas-compatibility tests and fixes several edge cases in the groupby SQL path.
Detailed Changes
1. SQL JOIN pushdown for
merge()(pandas_compat.py)Before:
ds1.merge(ds2, on='key')always materialized both DataStores into Python DataFrames, then calledpd.merge().After: When both sides are SQL-backed DataStores and no pandas-only feature (index join, suffixes, sort, validate) is needed,
merge()delegates toDataStore.join()which generates a SQL JOIN — neither side needs to be loaded into Python memory.Falls back to pandas merge when overlapping non-key columns require suffixes (
_x/_y), or whensort=True,validate,left_index/right_index, orindicatoris set.2. SQL UNION ALL pushdown for
concat()/union()(core.py,pandas_api.py,pandas_compat.py)Before:
concat([ds1, ds2, ds3])materialized all DataStores into DataFrames then calledpd.concat().After: When all inputs are SQL-backed DataStores with
ignore_index=Trueandjoin='outer', generates a(SELECT ...) UNION ALL (SELECT ...)query — data stays in the SQL engine.A new
SubqueryTableFunctionclass wraps composite SQL expressions (UNION, subqueries) as table sources. Each SELECT is parenthesized to safely handleORDER BY/LIMIT/SETTINGSclauses.3. Column pruning in expression evaluation (
column_expr.py)Before: Any expression like
ds['price'] * ds['quantity']calledds._execute()which doesSELECT *, loading all columns.After: Expression evaluation inspects the AST to find referenced
Fieldnodes, then generatesSELECT price, quantity FROM ...— only the needed columns are transferred.Same optimization applied to
value_counts():4. Single-SQL GROUP BY path (
column_expr.py)Before:
ds.groupby('dept')['salary'].mean()first executedds._execute()(full table download), then performed groupby+agg via chDB on the local DataFrame.After: For SQL-backed sources, generates a single
SELECT dept, AVG(salary) FROM ... GROUP BY deptquery — no intermediate full-table materialization.Handles edge cases:
execution_engine='pandas'to avoid floating-point precision differencesILLEGAL_AGGREGATION)dropna=Truebehavior5. SQL
fillna()optimization (column_expr.py)Before:
ds['col'].fillna(0)always added a lazy pandas operation, requiring full materialization.After: For numeric scalar fills on SQL-backed sources, uses SQL
ifNull(col, 0)which stays in the expression tree — no materialization needed.6. Execution result checkpointing (
core.py)Before: After executing a mixed SQL+pandas pipeline, the original source DataFrame was retained in memory alongside the result.
After: After successful execution,
_source_dfis updated to point to the result DataFrame, allowing the old source to be garbage collected.7.
.columnsmetadata optimization (core.py)Before: Accessing
ds.columnsafterassign()orjoin()triggered full execution.After: Derives column names from schema + computed columns metadata, or probes with
LIMIT 0for JOINs — avoids materializing full results.8. Test coverage (
test_merge_join_compat.py)43 new pandas-compatibility tests covering:
left_on/right_onwith different column namessortparameter in merge (fallback detection)from_df()DataStore merge (in-memory path)concat()/union()Documentation entry for user-facing changes
CI Settings
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Run these jobs only (required builds will be added automatically):
Deny these jobs:
Extra options:
Only specified batches in multi-batch jobs: