diff --git a/.claude/skills/quickstart-validate/SKILL.md b/.claude/skills/quickstart-validate/SKILL.md new file mode 100644 index 0000000..e168fda --- /dev/null +++ b/.claude/skills/quickstart-validate/SKILL.md @@ -0,0 +1,88 @@ +--- +name: quickstart-validate +description: Validate pg_stat_ch behavior end-to-end against the local quickstart Docker stack via the OTel/Arrow export path (with debug_arrow_dump_dir capturing Arrow IPC batches). Use when a change affects emitted event content (parent_query_id, cmd_type, err_sqlstate, buffer usage, etc.) and you want to confirm it round-trips through the production export pathway before pushing — i.e., the kind of verification the TAP test gives you but without needing a TAP-enabled local Postgres build. +--- + +# Quickstart-based PR validation (OTel/Arrow path) + +Drive pg_stat_ch through its production export path (OTel + Arrow IPC) against the local quickstart Postgres container, capture Arrow batches via `debug_arrow_dump_dir`, and assert on the contents with pyarrow. + +## Steps + +1. **Install `uv`** (one-time, for inline pyarrow): + ```bash + brew install uv # macOS + pip install uv # otherwise + ``` + +2. **Bring up the OTel/Arrow quickstart**: + ```bash + ./scripts/quickstart.sh up # builds image once, ~10-15 min cold + docker compose -f docker/quickstart/docker-compose.otel.yml up -d --wait # arrow-dump variant + ``` + The `docker-compose.otel.yml` postgres container is configured with + `use_otel=on`, `otel_arrow_passthrough=on`, and + `debug_arrow_dump_dir=/var/lib/postgresql/arrow-dump` (host-mounted to + `docker/quickstart/arrow-dump/`). The `otel_endpoint` points at a + non-existent collector — the gRPC send fails, but + `MaybeDumpArrowBatch()` writes the IPC file *before* the send, so we + still get fully-formed Arrow batches on the host. + +3. **Pick the validation script for the topic under test.** Convention: + - `scripts/quickstart-validate-parent-query-id.sh` — exercises parent_query_id linkage and the off-by-one in CaptureLogEvent. + - …add more as new PRs need them. + + If none matches, write one (see "Adding a new script" below). + +4. **Run it**: + ```bash + ./scripts/quickstart-validate-parent-query-id.sh + ``` + Auto-creates fixtures, runs queries, flushes, parses the Arrow IPC dumps with pyarrow, prints per-assertion PASS/FAIL, exits non-zero on any failure. + +5. **On failure, inspect the dumps directly**: + ```bash + ls docker/quickstart/arrow-dump/*.ipc + uv run --with pyarrow python -c " + import pyarrow.ipc, sys + with open(sys.argv[1], 'rb') as f: + t = pyarrow.ipc.open_stream(f).read_all() + print(t.to_pandas().to_string()) + " docker/quickstart/arrow-dump/.ipc + ``` + +6. **Tear down when done** (optional): + ```bash + docker compose -f docker/quickstart/docker-compose.otel.yml down + ``` + +## Why the OTel/Arrow path and not the native ClickHouse path + +`docker/quickstart/docker-compose.yml` (the ClickHouse-native variant) +currently can't be used end-to-end: clickhouse-cpp's LZ4-compressed +block format trips a checksum-mismatch on ClickHouse 26.1 server +(separate pre-existing bug, not pg_stat_ch's fault). The OTel/Arrow +path uses ZSTD via Arrow IPC and is unaffected, plus it mirrors the +export pathway actually used in production. + +## When to use this skill vs alternatives + +- **Quickstart-validate** (this skill): cheap, fast iteration on event-content semantics via the production export path. Best for "did my change produce the right rows downstream?" +- **`./scripts/run-tests.sh 18 regress`**: PG regression suite. Best for SQL-level functionality (does `pg_stat_ch_stats()` return the right shape, does the extension load, etc.). Doesn't exercise the export side. +- **`./scripts/run-tests.sh ../postgres/install_tap tap`**: TAP harness, which CI uses. Most thorough but requires a TAP-enabled local Postgres and a fresh local build. + +## Adding a new script + +Use `scripts/quickstart-validate-parent-query-id.sh` as a template. Each script should: + +1. `ensure_stack_up`: idempotently bring up the OTel/Arrow quickstart compose if it isn't running. +2. **Set up fixtures** with `pg_exec <> "$GITHUB_ENV" echo "$HOME/vcpkg" >> "$GITHUB_PATH" diff --git a/docker/postgres-ext.Dockerfile b/docker/postgres-ext.Dockerfile index 6dac61e..c1c3a7a 100644 --- a/docker/postgres-ext.Dockerfile +++ b/docker/postgres-ext.Dockerfile @@ -22,7 +22,8 @@ RUN apt-get update && apt-get install -y curl ca-certificates gnupg \ # Install vcpkg # Pin to the same commit as the vcpkg submodule for reproducible builds -ARG VCPKG_COMMIT=12159785447291b4069c82a3fe9c2770a393ac7f +# (vcpkg tag 2026.04.27). +ARG VCPKG_COMMIT=56bb2411609227288b70117ead2c47585ba07713 RUN git clone https://github.com/microsoft/vcpkg.git /opt/vcpkg \ && git -C /opt/vcpkg checkout "$VCPKG_COMMIT" \ && /opt/vcpkg/bootstrap-vcpkg.sh -disableMetrics @@ -34,6 +35,7 @@ WORKDIR /build/pg_stat_ch # Copy dependency manifests first for layer caching COPY vcpkg.json vcpkg-configuration.json ./ COPY triplets/ triplets/ +COPY overlay-ports/ overlay-ports/ COPY CMakeLists.txt ./ COPY cmake/ cmake/ COPY include/ include/ @@ -41,11 +43,15 @@ COPY src/ src/ COPY sql/ sql/ COPY pg_stat_ch.control ./ -RUN cmake -B build -G Ninja \ - -DCMAKE_BUILD_TYPE=RelWithDebInfo \ - -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake \ - -DVCPKG_TARGET_TRIPLET=x64-linux-pic \ - -DVCPKG_OVERLAY_TRIPLETS=/build/pg_stat_ch/triplets \ +# Pick the vcpkg triplet that matches the host architecture so the same +# Dockerfile builds on x64 CI runners and on arm64 dev boxes (Apple Silicon +# under colima/orbstack/etc.). +RUN TRIPLET="$([ "$(uname -m)" = "aarch64" ] && echo arm64-linux-pic || echo x64-linux-pic)" \ + && cmake -B build -G Ninja \ + -DCMAKE_BUILD_TYPE=RelWithDebInfo \ + -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake \ + -DVCPKG_TARGET_TRIPLET="$TRIPLET" \ + -DVCPKG_OVERLAY_TRIPLETS=/build/pg_stat_ch/triplets \ && cmake --build build --parallel $(nproc) FROM postgres:18-bookworm @@ -56,7 +62,7 @@ RUN apt-get update && apt-get install -y \ && rm -rf /var/lib/apt/lists/* COPY --from=builder /build/pg_stat_ch/build/pg_stat_ch.so /usr/lib/postgresql/18/lib/ -COPY --from=builder /build/pg_stat_ch/sql/pg_stat_ch--0.1.sql /usr/share/postgresql/18/extension/ +COPY --from=builder /build/pg_stat_ch/sql/ /usr/share/postgresql/18/extension/ COPY --from=builder /build/pg_stat_ch/pg_stat_ch.control /usr/share/postgresql/18/extension/ HEALTHCHECK --interval=10s --timeout=5s --start-period=30s --retries=3 \ diff --git a/docker/quickstart/arrow-dump/.gitignore b/docker/quickstart/arrow-dump/.gitignore new file mode 100644 index 0000000..d6b7ef3 --- /dev/null +++ b/docker/quickstart/arrow-dump/.gitignore @@ -0,0 +1,2 @@ +* +!.gitignore diff --git a/docker/quickstart/docker-compose.otel.yml b/docker/quickstart/docker-compose.otel.yml new file mode 100644 index 0000000..b08ac85 --- /dev/null +++ b/docker/quickstart/docker-compose.otel.yml @@ -0,0 +1,44 @@ +services: + # Postgres + pg_stat_ch configured to use the OTel/Arrow export path with + # debug_arrow_dump_dir set, so each Arrow IPC batch is written to a + # host-mounted directory. No real OTel collector is needed for validation: + # the gRPC send fails (otel_endpoint points at a non-existent collector), + # but MaybeDumpArrowBatch() runs *before* the send, so IPC files still land. + # See t/026_arrow_dump.pl for the same trick at the TAP layer. + postgres: + image: quickstart-postgres:latest + pull_policy: never + container_name: psch-quickstart-pg-otel + ports: + - "55432:5432" + environment: + POSTGRES_USER: postgres + POSTGRES_PASSWORD: postgres + POSTGRES_DB: postgres + command: + - "postgres" + - "-c" + - "shared_preload_libraries=pg_stat_ch" + - "-c" + - "pg_stat_ch.enabled=on" + - "-c" + - "pg_stat_ch.use_otel=on" + - "-c" + - "pg_stat_ch.otel_arrow_passthrough=on" + - "-c" + - "pg_stat_ch.otel_endpoint=localhost:14317" + - "-c" + - "pg_stat_ch.debug_arrow_dump_dir=/var/lib/postgresql/arrow-dump" + - "-c" + - "pg_stat_ch.flush_interval_ms=250" + - "-c" + - "pg_stat_ch.batch_max=1000" + - "-c" + - "track_io_timing=on" + volumes: + - ./arrow-dump:/var/lib/postgresql/arrow-dump + healthcheck: + test: ["CMD-SHELL", "pg_isready -U postgres"] + interval: 2s + timeout: 5s + retries: 10 diff --git a/scripts/quickstart-validate-parent-query-id.sh b/scripts/quickstart-validate-parent-query-id.sh new file mode 100755 index 0000000..e6aec91 --- /dev/null +++ b/scripts/quickstart-validate-parent-query-id.sh @@ -0,0 +1,264 @@ +#!/usr/bin/env bash +# End-to-end local validation for parent_query_id linkage. +# +# Drives pg_stat_ch through the OTel/Arrow export path (the production +# pathway) with debug_arrow_dump_dir set so each Arrow IPC batch is +# written to a host-mounted directory. We then parse the IPC files with +# pyarrow (via uv) and assert that: +# +# 1. Top-level queries report parent_query_id = 0. +# 2. Nested SPI queries report parent_query_id = outer's query_id. +# 3. Log events captured by emit_log_hook while a nested SPI query is +# on the stack carry query_id of the running (inner) statement and +# parent_query_id of the outer caller (catches CaptureLogEvent's +# off-by-one). We use RAISE WARNING for this rather than a caught +# ERROR: emit_log_hook does not fire from errfinish for ERROR-level +# events (errfinish PG_RE_THROWs before EmitErrorReport runs; +# emit_log_hook only fires later from PostgresMain's top-level +# catch, after all frames have been popped — or never, for errors +# caught inside a plpgsql EXCEPTION block). WARNING goes through +# EmitErrorReport directly so the frame stack is intact when our +# hook runs. +# +# Why not the native ClickHouse path? clickhouse-cpp's LZ4-compressed +# block format currently triggers a checksum-mismatch on ClickHouse 26.1 +# server (separate, pre-existing bug). The OTel/Arrow path uses ZSTD via +# Arrow IPC and is unaffected; it also mirrors what we ship in prod. +# +# Requirements: docker compose, uv (`brew install uv` / `pip install uv`). +# First-time `up` builds the postgres image (~10-15 min cold; cached afterwards). + +set -euo pipefail + +COMPOSE_FILE="docker/quickstart/docker-compose.otel.yml" +DUMP_DIR="docker/quickstart/arrow-dump" +PASSED=0 +FAILED=0 + +if ! command -v uv >/dev/null 2>&1; then + echo "error: uv not found in PATH. Install with 'brew install uv' or 'pip install uv'." >&2 + exit 1 +fi + +pg_exec() { + docker compose -f "$COMPOSE_FILE" exec -T postgres \ + psql -U postgres -d postgres -v ON_ERROR_STOP=1 "$@" +} + +ensure_stack_up() { + if ! docker compose -f "$COMPOSE_FILE" ps --status running 2>/dev/null | grep -q postgres; then + echo "OTel/Arrow quickstart stack not running. Bringing it up..." + # Reuse the image built by ./scripts/quickstart.sh up + if ! docker image inspect quickstart-postgres:latest >/dev/null 2>&1; then + echo " Building image first (one-time, ~10-15 min cold)..." + docker compose -f docker/quickstart/docker-compose.yml build postgres + fi + docker compose -f "$COMPOSE_FILE" up -d --wait + fi +} + +expect() { + local name="$1" expected="$2" actual="$3" comparator="${4:-=}" + local ok=0 + case "$comparator" in + '=') [[ "$actual" == "$expected" ]] && ok=1 ;; + '>=') [[ "$actual" -ge "$expected" ]] && ok=1 ;; + esac + if [[ "$ok" -eq 1 ]]; then + printf ' \033[32mPASS\033[0m %s\n' "$name" + PASSED=$((PASSED + 1)) + else + printf ' \033[31mFAIL\033[0m %s (expected %s %s, got %s)\n' \ + "$name" "$comparator" "$expected" "$actual" + FAILED=$((FAILED + 1)) + fi +} + +ensure_stack_up + +echo "Resetting state and setting up fixtures..." +rm -f "$DUMP_DIR"/*.ipc 2>/dev/null || true + +pg_exec <<'SQL' >/dev/null +CREATE EXTENSION IF NOT EXISTS pg_stat_ch; +SELECT pg_stat_ch_reset(); + +DROP TABLE IF EXISTS pqid_top_marker CASCADE; +DROP TABLE IF EXISTS pqid_inner_marker CASCADE; +DROP TABLE IF EXISTS pqid_warn_tbl CASCADE; +DROP FUNCTION IF EXISTS pqid_outer_caller(); +DROP FUNCTION IF EXISTS pqid_warn_outer(); +DROP FUNCTION IF EXISTS pqid_emit_warn(int); + +CREATE TABLE pqid_top_marker(x int); +CREATE TABLE pqid_inner_marker(x int); +INSERT INTO pqid_inner_marker VALUES (1); +CREATE TABLE pqid_warn_tbl(x int); +INSERT INTO pqid_warn_tbl VALUES (1); + +CREATE FUNCTION pqid_outer_caller() RETURNS int +LANGUAGE plpgsql AS $$ +DECLARE v int; +BEGIN + SELECT x INTO v FROM pqid_inner_marker; + RETURN v; +END$$; + +CREATE FUNCTION pqid_emit_warn(x int) RETURNS int +LANGUAGE plpgsql AS $$ +BEGIN + RAISE WARNING 'pqid_warn_marker' USING ERRCODE = '01001'; + RETURN x; +END$$; + +CREATE FUNCTION pqid_warn_outer() RETURNS int +LANGUAGE plpgsql AS $$ +DECLARE v int; +BEGIN + SELECT pqid_emit_warn(x) INTO v FROM pqid_warn_tbl; + RETURN v; +END$$; +SQL + +# Clear dumps generated by setup, then drive the test queries. +rm -f "$DUMP_DIR"/*.ipc 2>/dev/null || true +pg_exec -c "SELECT pg_stat_ch_reset();" >/dev/null + +echo "Driving test queries..." +pg_exec <<'SQL' >/dev/null +SELECT * FROM pqid_top_marker; +SELECT count(*) FROM pqid_top_marker; +SELECT pqid_outer_caller(); +SELECT pqid_warn_outer(); +SELECT pg_stat_ch_flush(); +SQL + +# Wait for IPC files to land. +echo "Waiting for Arrow IPC dumps..." +for _ in $(seq 1 30); do + if compgen -G "$DUMP_DIR"/*.ipc >/dev/null; then break; fi + sleep 1 +done + +# Parse the dumped Arrow IPC batches and run assertions. We emit one +# `KEY=VALUE` line per assertion result from the inline pyarrow script so +# the shell can pick them up without re-implementing Arrow parsing. +results=$(uv run --quiet --with pyarrow --python 3.12 - "$DUMP_DIR" <<'PY' +import glob, os, sys +import pyarrow as pa +import pyarrow.ipc as ipc + +dump_dir = sys.argv[1] +rows = [] +for path in sorted(glob.glob(os.path.join(dump_dir, "*.ipc"))): + with open(path, "rb") as f: + reader = ipc.open_stream(f) + table = reader.read_all() + for batch_dict in table.to_pylist(): + rows.append(batch_dict) + +def has(text, predicate=None): + out = [] + for r in rows: + q = r.get("query_text") or "" + if text in q and (predicate is None or predicate(r)): + out.append(r) + return out + +def by_query_id(qid): + return [r for r in rows if r.get("query_id") == qid] + +# Arrow dict-encodes query_id and parent_query_id as decimal *strings*; "0" +# means top-level / no parent. Compare against strings, not ints. +def is_zero(v): + return v is None or v == "" or v == "0" + +# Test 1: top-level marker queries report parent_query_id == 0 +top_rows = has("pqid_top_marker") +top_with_nonzero_parent = [r for r in top_rows if not is_zero(r.get("parent_query_id"))] +print(f"top_rows={len(top_rows)}") +print(f"top_nonzero_parent={len(top_with_nonzero_parent)}") + +# Test 2: nested SPI inner.parent_query_id matches outer.query_id. Filter +# to SELECT so the setup CREATE TABLE / INSERT / DROP utility statements +# (which mention the table) don't pollute the "no orphans" check. +outer_rows = [r for r in has("pqid_outer_caller") if r.get("db_operation") == "SELECT"] +inner_rows = [r for r in has("pqid_inner_marker") if r.get("db_operation") == "SELECT"] +linked = 0 +for inner in inner_rows: + p = inner.get("parent_query_id") + if not is_zero(p) and any(o.get("query_id") == p for o in outer_rows): + linked += 1 +inner_orphans = [r for r in inner_rows if is_zero(r.get("parent_query_id"))] +outer_self_parent = [r for r in outer_rows if not is_zero(r.get("parent_query_id"))] +print(f"outer_rows={len(outer_rows)}") +print(f"inner_rows={len(inner_rows)}") +print(f"inner_linked_to_outer={linked}") +print(f"inner_with_zero_parent={len(inner_orphans)}") +print(f"outer_with_nonzero_parent={len(outer_self_parent)}") + +# Test 3: log event captured inside nested SPI (via RAISE WARNING). +# The event has no query_text (CaptureLogEvent leaves it empty), so we +# identify it by err_sqlstate '01001' (our RAISE WARNING's custom code). +warn_outer = [r for r in has("pqid_warn_outer") if r.get("db_operation") == "SELECT"] +warn_inner = [r for r in has("pqid_emit_warn") if r.get("db_operation") == "SELECT"] +warn_outer_qids = {r.get("query_id") for r in warn_outer if not is_zero(r.get("query_id"))} +warn_inner_qids = {r.get("query_id") for r in warn_inner if not is_zero(r.get("query_id"))} +warn_rows = [r for r in rows if r.get("err_sqlstate") == "01001"] +warn_linked_to_outer = sum(1 for r in warn_rows + if r.get("parent_query_id") in warn_outer_qids) +warn_qid_is_inner = sum(1 for r in warn_rows if r.get("query_id") in warn_inner_qids) +warn_qid_is_outer = sum(1 for r in warn_rows if r.get("query_id") in warn_outer_qids) +warn_self_parent = sum(1 for r in warn_rows + if not is_zero(r.get("query_id")) + and r.get("query_id") == r.get("parent_query_id")) +print(f"warn_rows={len(warn_rows)}") +print(f"warn_linked_to_outer={warn_linked_to_outer}") +print(f"warn_qid_is_inner={warn_qid_is_inner}") +print(f"warn_qid_is_outer={warn_qid_is_outer}") +print(f"warn_self_parent={warn_self_parent}") +PY +) + +echo "$results" +echo + +# Pull individual values out. +get() { echo "$results" | awk -F= -v k="$1" '$1==k {print $2}'; } + +echo "Test 1: top-level queries report parent_query_id = 0" +expect "top-level rows landed" 1 "$(get top_rows)" '>=' +expect "top-level parent_query_id = 0" 0 "$(get top_nonzero_parent)" + +echo +echo "Test 2: nested SPI parent_query_id links to outer query_id" +expect "outer rows landed" 1 "$(get outer_rows)" '>=' +expect "inner rows landed" 1 "$(get inner_rows)" '>=' +expect "inner row joins outer via parent_query_id" 1 "$(get inner_linked_to_outer)" '>=' +expect "nested SPI is not reported as top-level" 0 "$(get inner_with_zero_parent)" +expect "outer call still reports parent_query_id = 0" 0 "$(get outer_with_nonzero_parent)" + +echo +echo "Test 3: log event inside nested SPI" +expect "RAISE WARNING log event landed" 1 "$(get warn_rows)" '>=' +expect "log event parent_query_id = outer's query_id" 1 "$(get warn_linked_to_outer)" '>=' +expect "log event query_id = inner SPI (the running)" 1 "$(get warn_qid_is_inner)" '>=' +expect "log event query_id is NOT outer (no off-by-one)" 0 "$(get warn_qid_is_outer)" +expect "log event query_id != parent_query_id" 0 "$(get warn_self_parent)" + +echo +echo "Cleaning up fixtures..." +pg_exec <<'SQL' >/dev/null +DROP FUNCTION IF EXISTS pqid_outer_caller(); +DROP FUNCTION IF EXISTS pqid_warn_outer(); +DROP FUNCTION IF EXISTS pqid_emit_warn(int); +DROP TABLE IF EXISTS pqid_top_marker; +DROP TABLE IF EXISTS pqid_inner_marker; +DROP TABLE IF EXISTS pqid_warn_tbl; +SQL + +echo +echo "===================================================" +printf "Result: \033[32m%d passed\033[0m, \033[31m%d failed\033[0m\n" "$PASSED" "$FAILED" +echo "===================================================" +[[ "$FAILED" -eq 0 ]] diff --git a/third_party/vcpkg b/third_party/vcpkg index 1215978..56bb241 160000 --- a/third_party/vcpkg +++ b/third_party/vcpkg @@ -1 +1 @@ -Subproject commit 12159785447291b4069c82a3fe9c2770a393ac7f +Subproject commit 56bb2411609227288b70117ead2c47585ba07713 diff --git a/vcpkg-configuration.json b/vcpkg-configuration.json index 2103a7c..fc9dc4f 100644 --- a/vcpkg-configuration.json +++ b/vcpkg-configuration.json @@ -1,7 +1,7 @@ { "default-registry": { "kind": "builtin", - "baseline": "12159785447291b4069c82a3fe9c2770a393ac7f" + "baseline": "56bb2411609227288b70117ead2c47585ba07713" }, "overlay-triplets": ["./triplets"], "overlay-ports": ["./overlay-ports"]