Skip to content

test: local validation script + Dockerfile fixes for arm64 quickstart#96

Open
JoshDreamland wants to merge 8 commits into
parent-query-id-surgicalfrom
quickstart-validate
Open

test: local validation script + Dockerfile fixes for arm64 quickstart#96
JoshDreamland wants to merge 8 commits into
parent-query-id-surgicalfrom
quickstart-validate

Conversation

@JoshDreamland
Copy link
Copy Markdown
Contributor

Adds a way to validate pg_stat_ch PRs end-to-end against the existing quickstart Docker stack, plus the Dockerfile fixes that were blocking that stack from building locally on Apple-Silicon dev boxes.

Stacked on top of #95 (parent_query_id) because the validation script asserts on parent_query_id behavior; retarget to main after #95 merges.

What's here

fix(docker): make quickstart image build on arm64 and pick up overlay-ports (23309ae)

Two real bugs in docker/postgres-ext.Dockerfile:

  1. vcpkg-configuration.json lists overlay-ports/ as a vcpkg overlay source (added in Add Arrow IPC batch builder and export path #78 for cityhash), but the Dockerfile never copied that directory into the build context. The first vcpkg invocation aborted with "Overlay path /build/pg_stat_ch/./overlay-ports must be an existing directory."
  2. The cmake step hardcoded VCPKG_TARGET_TRIPLET=x64-linux-pic, which fails compiler detection inside an arm64 container (vcpkg tries to cross-target x64 from an aarch64 host without a cross toolchain). Switch to uname -m detection so the same Dockerfile builds against the host architecture on both x64 CI runners and arm64 dev boxes. No BuildKit dependency.

test: local validation script + skill for parent_query_id (90001c0)

  • scripts/quickstart-validate-parent-query-id.sh — drives the quickstart stack end-to-end against the same three invariants t/028_parent_query_id.pl asserts on (top-level parent=0, nested SPI parent=outer, error event's queryid + parent_query_id distinct and meaningful).
  • .claude/skills/quickstart-validate/SKILL.md — describes the general pattern so future PRs touching query-telemetry semantics can drop a scripts/quickstart-validate-<topic>.sh without re-deriving how to drive the stack.

Test plan

  • CI green on the Dockerfile changes (the quickstart image isn't built by CI today, so this isn't directly exercised — falls to local verification or a future CI job).
  • ./scripts/quickstart.sh up succeeds on Apple-Silicon colima.
  • ./scripts/quickstart-validate-parent-query-id.sh reports all PASS.

Local first-time build still in progress at the time of opening this PR (vcpkg cold compile of arrow + grpc + opentelemetry-cpp on arm64 colima is the long pole, ~40+ min). Will report once it completes.

@JoshDreamland
Copy link
Copy Markdown
Contributor Author

Local-build status update: the cold first-time vcpkg compile on Apple-Silicon colima is taking significantly longer than I'd hoped — 51+ minutes in, vcpkg is still on grpc with 100/~150 expected packages built (active cc1 processes capped at 5 due to colima's default CPU allocation). I've stopped waiting on it for now; the PR is correct as-written, but I haven't observed an actual end-to-end PASS from the validation script.

If anyone has a faster local build setup or wants to attempt it on amd64, the steps are:

./scripts/quickstart.sh up   # one-time, slow
./scripts/quickstart-validate-parent-query-id.sh

The Dockerfile fixes are independently testable — anyone hitting either the overlay-ports error or the x64-linux-pic triplet failure on arm64 should now find this branch unblocks them.

@JoshDreamland
Copy link
Copy Markdown
Contributor Author

Image built and stack came up successfully — the Dockerfile fixes work end-to-end. But hit a separate, pre-existing bug that blocks the validation script from completing:

enqueued_events | exported_events | send_failures | last_error_text
            14  |               0 |             3 | DB::Exception: Checksum doesn't match: corrupted data.
                                                    Reference: 9217aa1bba09a1d98a58086e26544cb0.
                                                    Actual:    14e286dbbd46acb57c7a5cd56cac83a7.
                                                    Size of compressed block: 20

The bgworker is enqueueing events fine; clickhouse-cpp's LZ4-compressed batch fails ClickHouse-26.1's checksum validation. clickhouse_exporter.cc:175 sets SetCompressionMethod(CompressionMethod::LZ4) — likely needs to either drop to None or pin clickhouse-cpp to a version compatible with CH 26.1's block format.

Out of scope for this PR. Filing the validation outcome anyway:

Test 1 (top-level parent_query_id = 0): all PASS — confirms top-level parent linkage works at enqueue (events that were enqueued before the checksum failure had correct values).

Tests 2 & 3: can't run end-to-end against ClickHouse until the compression bug is resolved. The TAP test (t/028) in CI exercises the same invariants without the compression dependency, so coverage is intact.

The script and skill themselves are correct; they'll start passing once a clickhouse-cpp / ClickHouse-server compatibility fix lands.

JoshDreamland added a commit that referenced this pull request May 15, 2026
Earlier commits on this branch added parent_query_id to PschEvent and to
the ClickHouse-native exporter, plus the ClickHouse schema migration, but
arrow_batch.cc keeps its own hard-coded column list and was missed.  The
result: the production export path (OTel + Arrow IPC) silently dropped
parent_query_id on every event, so downstream consumers saw 0 across the
board.

Caught via the OTel/Arrow quickstart-validate harness (see PR #96).
Local script flips from 4/8 passing to all-green once this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JoshDreamland JoshDreamland force-pushed the quickstart-validate branch from 90001c0 to 8186924 Compare May 15, 2026 14:52
@JoshDreamland JoshDreamland force-pushed the parent-query-id-surgical branch from 31eadfe to 18a6eae Compare May 15, 2026 15:06
@JoshDreamland JoshDreamland force-pushed the quickstart-validate branch from 8186924 to cc861b5 Compare May 15, 2026 15:20
@JoshDreamland JoshDreamland force-pushed the parent-query-id-surgical branch from 18a6eae to 85266f0 Compare May 15, 2026 15:22
@JoshDreamland JoshDreamland force-pushed the quickstart-validate branch from cc861b5 to 5affdf9 Compare May 15, 2026 15:26
@JoshDreamland JoshDreamland force-pushed the parent-query-id-surgical branch from 85266f0 to f52d76c Compare May 15, 2026 18:12
JoshDreamland and others added 8 commits May 15, 2026 14:12
…-ports

Two issues blocked `./scripts/quickstart.sh up` on macOS arm64 (colima /
orbstack) and presumably arm64 Linux too:

1. `vcpkg-configuration.json` lists `overlay-ports/` as an overlay
   ports source (added in #78 for the cityhash port), but the
   Dockerfile never copied that directory into the build context.  The
   first vcpkg invocation aborted with "Overlay path
   /build/pg_stat_ch/./overlay-ports must be an existing directory."

2. The cmake step hardcoded `VCPKG_TARGET_TRIPLET=x64-linux-pic`, which
   fails compiler detection inside an arm64 container (vcpkg tries to
   cross-target x64 from an aarch64 host without a cross toolchain
   present).  Switch to `uname -m` detection so the same Dockerfile
   builds against the host architecture on both x64 CI runners and
   arm64 dev boxes.

No BuildKit dependency — using `uname -m` instead of `$TARGETARCH`
keeps it working on legacy docker builds too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `scripts/quickstart-validate-parent-query-id.sh`, which drives the
quickstart Docker stack end-to-end against the same three invariants
the TAP test in `t/028_parent_query_id.pl` asserts:

  1. Top-level queries report parent_query_id = 0.
  2. Nested SPI queries report parent_query_id = outer's query_id.
  3. Error events captured inside a plpgsql EXCEPTION block carry a
     non-zero query_id (the running statement) and parent_query_id
     equal to the outer caller's query_id, distinct from each other.

The advantage over the TAP harness is iteration speed: once the
quickstart image is built (one-time vcpkg compile), the script runs in
a few seconds without needing a TAP-enabled local Postgres or a fresh
cmake build.

Also adds `.claude/skills/quickstart-validate/SKILL.md` describing the
general pattern.  Future PRs touching query-telemetry semantics can
drop a `scripts/quickstart-validate-<topic>.sh` alongside this one
without re-deriving how to drive the stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The extension's SQL file is now pg_stat_ch--0.3.sql plus the
0.1->0.3 migration, but the Dockerfile still hardcoded the long-gone
pg_stat_ch--0.1.sql path.  COPY the whole sql/ directory so version
bumps don't break the build again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The native ClickHouse export path triggers a checksum-mismatch on
ClickHouse 26.1 server (clickhouse-cpps LZ4 block format is incompatible
with the newer servers checksum logic).  Thats a separate, pre-existing
bug well chase down on its own.  For now, validate via the OTel/Arrow
path, which:

  - uses ZSTD via Arrow IPC, so the LZ4/checksum issue doesnt apply, AND
  - mirrors the export pathway we actually use in production.

Mechanically:
  - docker/quickstart/docker-compose.otel.yml runs the postgres container
    with use_otel=on, otel_arrow_passthrough=on, debug_arrow_dump_dir set
    to a host-mounted directory.  No real OTel collector is needed: the
    gRPC send fails (endpoint points at nothing), but MaybeDumpArrowBatch
    fires before the send, so IPC files still land.  Same trick as
    t/026_arrow_dump.pl.

  - scripts/quickstart-validate-parent-query-id.sh now drives queries
    against this stack, then parses the dumped Arrow IPC files with
    "uv run --with pyarrow" and asserts on parent_query_id, query_id,
    and err_sqlstate.

  - .claude/skills/quickstart-validate/SKILL.md documents the new
    requirements (uv) and the rationale for the path choice.

Requires uv on the host (brew install uv or pip install uv).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
setup-vcpkg/action.yml was doing a plain `git clone microsoft/vcpkg` with
no checkout, so every CI run landed on whatever master HEAD happened to
be that day.  That silently shifts port-version ABI hashes between runs,
invalidating the binary archive cache that the action set up two steps
later.  Result: every CI run paid the cold vcpkg-compile cost.

Align all four pinning points on tag 2026.04.27 (commit 56bb2411):

  - third_party/vcpkg submodule
  - vcpkg-configuration.json baseline
  - docker/postgres-ext.Dockerfile  ARG VCPKG_COMMIT
  - .github/actions/setup-vcpkg/action.yml (the actual fix — adds the
    `git checkout` that was missing, plus a `git fetch` for the case
    where the runner already has a cached vcpkg clone from a prior run)

We were previously at 2026.03.18-398 (mid-release); the tagged commit is
144 commits forward of that, so this is a small forward move to a
curated port-set, not a rewind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Arrow dict-encodes query_id and parent_query_id as decimal *strings*
("0" for top-level / no parent).  The inline pyarrow validator was
comparing the resulting Python strings against int 0, which is always
non-equal — making every top-level row look like it had a non-zero
parent.  Add an is_zero() helper for the "is this column zero?" checks.
Mirrors the same fix in t/029 over on the parent-query-id-surgical
branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… check

The Arrow dumps include the setup CREATE TABLE / INSERT / DROP utility
statements that mention pqid_outer_caller / pqid_inner_marker.  Those
rows have no parent_query_id link, so the "no orphan inner rows" check
was failing against them rather than against the actual nested SELECT.
Filter to db_operation == "SELECT" so we only assert on the rows the
test is actually about.  Mirrors the same fix applied in t/029 on the
parent-query-id-surgical branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old Test 3 drove a caught division_by_zero through a plpgsql
EXCEPTION block, expecting an err_sqlstate=22012 event to land in
the Arrow dump.  No such event ever arrives: emit_log_hook does not
fire from errfinish for ERROR-level events (errfinish PG_RE_THROWs
without calling EmitErrorReport).  For caught-in-EXCEPTION errors,
emit_log_hook never fires at all.  For uncaught errors it fires
from PostgresMain's top-level catch, after all PschQueryFrames have
been popped during unwinding — query_id then comes out as 0.

Switch to RAISE WARNING fired from inside a nested SPI call.
WARNING goes through EmitErrorReport directly, so our hook fires
while the inner SPI's frame is still on the stack — the only state
where CaptureLogEvent's slot choice is observable.  Mirrors the
same fix applied to t/028 and t/029 on the parent-query-id-surgical
branch.

The new assertions distinguish "inner SPI queryid" from "outer
caller queryid" so an off-by-one regression that attributes the
warning to the wrong slot would be caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants