test: local validation script + Dockerfile fixes for arm64 quickstart#96
test: local validation script + Dockerfile fixes for arm64 quickstart#96JoshDreamland wants to merge 8 commits into
Conversation
|
Local-build status update: the cold first-time vcpkg compile on Apple-Silicon colima is taking significantly longer than I'd hoped — 51+ minutes in, vcpkg is still on grpc with 100/~150 expected packages built (active cc1 processes capped at 5 due to colima's default CPU allocation). I've stopped waiting on it for now; the PR is correct as-written, but I haven't observed an actual end-to-end PASS from the validation script. If anyone has a faster local build setup or wants to attempt it on amd64, the steps are: ./scripts/quickstart.sh up # one-time, slow
./scripts/quickstart-validate-parent-query-id.shThe Dockerfile fixes are independently testable — anyone hitting either the overlay-ports error or the x64-linux-pic triplet failure on arm64 should now find this branch unblocks them. |
|
Image built and stack came up successfully — the Dockerfile fixes work end-to-end. But hit a separate, pre-existing bug that blocks the validation script from completing: The bgworker is enqueueing events fine; clickhouse-cpp's LZ4-compressed batch fails ClickHouse-26.1's checksum validation. clickhouse_exporter.cc:175 sets Out of scope for this PR. Filing the validation outcome anyway: Test 1 (top-level parent_query_id = 0): all PASS — confirms top-level parent linkage works at enqueue (events that were enqueued before the checksum failure had correct values). Tests 2 & 3: can't run end-to-end against ClickHouse until the compression bug is resolved. The TAP test ( The script and skill themselves are correct; they'll start passing once a clickhouse-cpp / ClickHouse-server compatibility fix lands. |
Earlier commits on this branch added parent_query_id to PschEvent and to the ClickHouse-native exporter, plus the ClickHouse schema migration, but arrow_batch.cc keeps its own hard-coded column list and was missed. The result: the production export path (OTel + Arrow IPC) silently dropped parent_query_id on every event, so downstream consumers saw 0 across the board. Caught via the OTel/Arrow quickstart-validate harness (see PR #96). Local script flips from 4/8 passing to all-green once this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
90001c0 to
8186924
Compare
31eadfe to
18a6eae
Compare
8186924 to
cc861b5
Compare
18a6eae to
85266f0
Compare
cc861b5 to
5affdf9
Compare
85266f0 to
f52d76c
Compare
…-ports Two issues blocked `./scripts/quickstart.sh up` on macOS arm64 (colima / orbstack) and presumably arm64 Linux too: 1. `vcpkg-configuration.json` lists `overlay-ports/` as an overlay ports source (added in #78 for the cityhash port), but the Dockerfile never copied that directory into the build context. The first vcpkg invocation aborted with "Overlay path /build/pg_stat_ch/./overlay-ports must be an existing directory." 2. The cmake step hardcoded `VCPKG_TARGET_TRIPLET=x64-linux-pic`, which fails compiler detection inside an arm64 container (vcpkg tries to cross-target x64 from an aarch64 host without a cross toolchain present). Switch to `uname -m` detection so the same Dockerfile builds against the host architecture on both x64 CI runners and arm64 dev boxes. No BuildKit dependency — using `uname -m` instead of `$TARGETARCH` keeps it working on legacy docker builds too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `scripts/quickstart-validate-parent-query-id.sh`, which drives the
quickstart Docker stack end-to-end against the same three invariants
the TAP test in `t/028_parent_query_id.pl` asserts:
1. Top-level queries report parent_query_id = 0.
2. Nested SPI queries report parent_query_id = outer's query_id.
3. Error events captured inside a plpgsql EXCEPTION block carry a
non-zero query_id (the running statement) and parent_query_id
equal to the outer caller's query_id, distinct from each other.
The advantage over the TAP harness is iteration speed: once the
quickstart image is built (one-time vcpkg compile), the script runs in
a few seconds without needing a TAP-enabled local Postgres or a fresh
cmake build.
Also adds `.claude/skills/quickstart-validate/SKILL.md` describing the
general pattern. Future PRs touching query-telemetry semantics can
drop a `scripts/quickstart-validate-<topic>.sh` alongside this one
without re-deriving how to drive the stack.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The extension's SQL file is now pg_stat_ch--0.3.sql plus the 0.1->0.3 migration, but the Dockerfile still hardcoded the long-gone pg_stat_ch--0.1.sql path. COPY the whole sql/ directory so version bumps don't break the build again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The native ClickHouse export path triggers a checksum-mismatch on
ClickHouse 26.1 server (clickhouse-cpps LZ4 block format is incompatible
with the newer servers checksum logic). Thats a separate, pre-existing
bug well chase down on its own. For now, validate via the OTel/Arrow
path, which:
- uses ZSTD via Arrow IPC, so the LZ4/checksum issue doesnt apply, AND
- mirrors the export pathway we actually use in production.
Mechanically:
- docker/quickstart/docker-compose.otel.yml runs the postgres container
with use_otel=on, otel_arrow_passthrough=on, debug_arrow_dump_dir set
to a host-mounted directory. No real OTel collector is needed: the
gRPC send fails (endpoint points at nothing), but MaybeDumpArrowBatch
fires before the send, so IPC files still land. Same trick as
t/026_arrow_dump.pl.
- scripts/quickstart-validate-parent-query-id.sh now drives queries
against this stack, then parses the dumped Arrow IPC files with
"uv run --with pyarrow" and asserts on parent_query_id, query_id,
and err_sqlstate.
- .claude/skills/quickstart-validate/SKILL.md documents the new
requirements (uv) and the rationale for the path choice.
Requires uv on the host (brew install uv or pip install uv).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
setup-vcpkg/action.yml was doing a plain `git clone microsoft/vcpkg` with
no checkout, so every CI run landed on whatever master HEAD happened to
be that day. That silently shifts port-version ABI hashes between runs,
invalidating the binary archive cache that the action set up two steps
later. Result: every CI run paid the cold vcpkg-compile cost.
Align all four pinning points on tag 2026.04.27 (commit 56bb2411):
- third_party/vcpkg submodule
- vcpkg-configuration.json baseline
- docker/postgres-ext.Dockerfile ARG VCPKG_COMMIT
- .github/actions/setup-vcpkg/action.yml (the actual fix — adds the
`git checkout` that was missing, plus a `git fetch` for the case
where the runner already has a cached vcpkg clone from a prior run)
We were previously at 2026.03.18-398 (mid-release); the tagged commit is
144 commits forward of that, so this is a small forward move to a
curated port-set, not a rewind.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Arrow dict-encodes query_id and parent_query_id as decimal *strings*
("0" for top-level / no parent). The inline pyarrow validator was
comparing the resulting Python strings against int 0, which is always
non-equal — making every top-level row look like it had a non-zero
parent. Add an is_zero() helper for the "is this column zero?" checks.
Mirrors the same fix in t/029 over on the parent-query-id-surgical
branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… check The Arrow dumps include the setup CREATE TABLE / INSERT / DROP utility statements that mention pqid_outer_caller / pqid_inner_marker. Those rows have no parent_query_id link, so the "no orphan inner rows" check was failing against them rather than against the actual nested SELECT. Filter to db_operation == "SELECT" so we only assert on the rows the test is actually about. Mirrors the same fix applied in t/029 on the parent-query-id-surgical branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old Test 3 drove a caught division_by_zero through a plpgsql EXCEPTION block, expecting an err_sqlstate=22012 event to land in the Arrow dump. No such event ever arrives: emit_log_hook does not fire from errfinish for ERROR-level events (errfinish PG_RE_THROWs without calling EmitErrorReport). For caught-in-EXCEPTION errors, emit_log_hook never fires at all. For uncaught errors it fires from PostgresMain's top-level catch, after all PschQueryFrames have been popped during unwinding — query_id then comes out as 0. Switch to RAISE WARNING fired from inside a nested SPI call. WARNING goes through EmitErrorReport directly, so our hook fires while the inner SPI's frame is still on the stack — the only state where CaptureLogEvent's slot choice is observable. Mirrors the same fix applied to t/028 and t/029 on the parent-query-id-surgical branch. The new assertions distinguish "inner SPI queryid" from "outer caller queryid" so an off-by-one regression that attributes the warning to the wrong slot would be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5affdf9 to
2992779
Compare
Adds a way to validate pg_stat_ch PRs end-to-end against the existing quickstart Docker stack, plus the Dockerfile fixes that were blocking that stack from building locally on Apple-Silicon dev boxes.
Stacked on top of #95 (parent_query_id) because the validation script asserts on parent_query_id behavior; retarget to main after #95 merges.
What's here
fix(docker): make quickstart image build on arm64 and pick up overlay-ports(23309ae)Two real bugs in
docker/postgres-ext.Dockerfile:vcpkg-configuration.jsonlistsoverlay-ports/as a vcpkg overlay source (added in Add Arrow IPC batch builder and export path #78 for cityhash), but the Dockerfile never copied that directory into the build context. The first vcpkg invocation aborted with "Overlay path /build/pg_stat_ch/./overlay-ports must be an existing directory."VCPKG_TARGET_TRIPLET=x64-linux-pic, which fails compiler detection inside an arm64 container (vcpkg tries to cross-target x64 from an aarch64 host without a cross toolchain). Switch touname -mdetection so the same Dockerfile builds against the host architecture on both x64 CI runners and arm64 dev boxes. No BuildKit dependency.test: local validation script + skill for parent_query_id(90001c0)scripts/quickstart-validate-parent-query-id.sh— drives the quickstart stack end-to-end against the same three invariantst/028_parent_query_id.plasserts on (top-level parent=0, nested SPI parent=outer, error event's queryid + parent_query_id distinct and meaningful)..claude/skills/quickstart-validate/SKILL.md— describes the general pattern so future PRs touching query-telemetry semantics can drop ascripts/quickstart-validate-<topic>.shwithout re-deriving how to drive the stack.Test plan
./scripts/quickstart.sh upsucceeds on Apple-Silicon colima../scripts/quickstart-validate-parent-query-id.shreports all PASS.Local first-time build still in progress at the time of opening this PR (vcpkg cold compile of arrow + grpc + opentelemetry-cpp on arm64 colima is the long pole, ~40+ min). Will report once it completes.