Skip to content

feat(query): virtual column support materialized cte#19902

Open
b41sh wants to merge 5 commits into
databendlabs:mainfrom
b41sh:feat-cte-virtual-column
Open

feat(query): virtual column support materialized cte#19902
b41sh wants to merge 5 commits into
databendlabs:mainfrom
b41sh:feat-cte-virtual-column

Conversation

@b41sh
Copy link
Copy Markdown
Member

@b41sh b41sh commented May 21, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Materialized CTEs can hide the original Variant column path from the virtual column rewrite. For example, after a CTE projects v['message'] AS message, downstream expressions such as message['attribute']['user_id'] are bound against the materialized CTE output instead of the original table column. As a result, the planner can no longer recover the full path v['message']['attribute']['user_id'], so virtual column pushdown is missed.

This also affects large chained CTE queries where a reused CTE is auto-materialized: downstream consumers may repeatedly read large intermediate Variant objects instead of extracting only the required nested fields earlier.

This PR adds a binder-side rewrite for auto-materialized CTEs that preserves static Variant path access across CTE boundaries.

The following examples show the original SQL and the rewritten SQL

CREATE TABLE t(v Variant) enable_virtual_column=true;

# original SQL
WITH logs AS (
    SELECT v['message'] AS message FROM t
)
SELECT message['attribute']['user_id'] FROM logs;

# rewrited SQL
WITH logs AS (
    SELECT v['message']['attribute']['user_id'] AS __databend_virtual_column__0 FROM t
)
SELECT __databend_virtual_column__0 FROM logs;

SETTINGS (
  enable_experimental_virtual_column = 1,
  enable_auto_materialize_cte = 1
)
EXPLAIN WITH logs AS (
    SELECT v['message'] AS message FROM t
)
SELECT message['attribute']['user_id'] FROM logs;
╭─────────────────────────────────────────────────────────────────╮
│                             explain                             │
│                              String                             │
├─────────────────────────────────────────────────────────────────┤
│ TableScan                                                       │
│ ├── table: default.default.t                                    │
│ ├── scan id: 0                                                  │
│ ├── output columns: [v['message']['attribute']['user_id'] (#2)] │
│ ├── read rows: 0                                                │
│ ├── read size: 0                                                │
│ ├── partitions total: 0                                         │
│ ├── partitions scanned: 0                                       │
│ ├── push downs: [filters: [], limit: NONE]                      │
│ ├── virtual columns: [v['message']['attribute']['user_id']]     │
│ └── estimated rows: 0.00                                        │
╰─────────────────────────────────────────────────────────────────╯

other changes

  • add_internal_column_into_expr and add_virtual_column_into_expr have been modified to add internal_column and virtual_column in batches, thereby avoiding redundant operations
  • The column_mapping of MaterializedCTERef has been changed from HashMap to BTreeMap to ensure the stability of the EXPLAIN results

fixes: #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions Bot added the pr-feature this PR introduces a new feature to the codebase label May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

🤖 CI Job Analysis

Workflow: 26674127669

📊 Summary

  • Total Jobs: 88
  • Failed Jobs: 2
  • Retryable: 0
  • Code Issues: 2

NO RETRY NEEDED

All failures appear to be code/test issues requiring manual fixes.

🔍 Job Details

  • linux / sqllogic / cluster (tpcds, 4c, hybrid): Not retryable (Code/Test)
  • linux / sqllogic / cluster (tpcds, 4c, http): Not retryable (Code/Test)

🤖 About

Automated analysis using job annotations to distinguish infrastructure issues (auto-retried) from code/test issues (manual fixes needed).

@b41sh b41sh force-pushed the feat-cte-virtual-column branch from c84d932 to c79fbf9 Compare May 29, 2026 04:11
@b41sh b41sh requested review from SkyFan2002 and sundy-li May 29, 2026 08:27
@b41sh b41sh marked this pull request as ready for review May 29, 2026 08:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c79fbf971f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/query/sql/src/planner/binder/virtual_column.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant