Skip to content

Commit a2fe99f

Browse files
feat(kernel): wire CUJ-gap fixes — staging fail-loud, error context, sync cancel (#825)
* feat(kernel): wire CUJ-gap fixes — staging fail-loud, error context, sync cancel Connector half of the API+CUJ audit fixes (kernel half: databricks-sql-kernel PR #121). Bumps KERNEL_REV to pick up the kernel surface. Staging fail-loud (kernel/client.py): - Volume/staging PUT/GET/REMOVE silently no-op'd on the kernel path (KernelResultSet.is_staging_operation is always False, so the connector's _handle_staging_operation never fired and no file was transferred). Detect the leading verb in execute_command and raise NotSupportedError so ETL fails loud instead of ingesting stale data. Error context (_errors.py): - Forward display_message / diagnostic_info / error_details_json (now exposed across the pyo3 boundary in #121) onto the re-raised PEP-249 exception, and populate ServerOperationError.context with "diagnostic-info" (Spark stack trace) + "operation-id" — matching the Thrift backend so callers reading err.context work identically. Sync cancel wiring (client.py, kernel/client.py): - cursor.cancel() was a silent no-op for the default blocking execute() (active_command_id is None until execute returns). The kernel backend now registers a detached StatementCanceller (keyed by the cursor) before the blocking execute and exposes cancel_running_cursor(cursor). Cursor.cancel() routes to that hook via getattr when there's no command id yet — opt-in, so Thrift/SEA backends are unaffected. Tests: unit (staging fail-loud, _is_staging_statement, cancel registry + routing, error context/diagnostic-info forwarding) and e2e (tz-aware TIMESTAMP, scientific DECIMAL, staging NotSupportedError, diagnostic-info context, cross-thread sync cancel interrupts a running query). All e2e verified live against dogfood with use_kernel=True. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * fix(kernel): address review — comment-prefixed staging + tolerant sync cancel Review fixes on PR #825 (+ KERNEL_REV bump to the amended kernel #121 which now folds the original message on attach failure and bounds error_details_json): P1 #1 — staging fail-loud missed comment-prefixed statements: _is_staging_statement took the first whitespace token without stripping SQL comments, so "-- upload\nPUT ..." / "/* c */ PUT ..." (common in ETL) classified as non-staging and slipped into the silent-no-op bug. Added _strip_leading_sql_comments (handles leading -- line and /* */ block comments, multiple/mixed) before extracting the verb. Tests for both comment forms, mixed, and verb-only-in-comment (must NOT match). P1 #2 — sync cancel could raise out of cursor.cancel(): cursor.cancel() is best-effort per PEP-249, but cancel_running_cursor re-raised a canceller failure (e.g. an early cancel before the server statement id is observed, or a transport hiccup) via the public cancel(). Now swallow+log and still return True (a canceller was present and attempted) so Cursor doesn't emit the misleading "no executing command" warning. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> * chore: re-pin KERNEL_REV to merged kernel #121 (cbeaf44) #121 (tz-aware/scientific param binds, error context, sync cancel + Ctrl-C) is merged to kernel main. Re-pin from the orphaned branch HEAD (f62d941) to the merged squash SHA (cbeaf44) — content-identical, but reachable from main so no orphan-SHA risk. Verified against a wheel built from cbeaf44: connector unit + kernel e2e (tz-aware TIMESTAMP, scientific DECIMAL, staging fail-loud incl. comment-prefixed, diagnostic-info context) all pass. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> --------- Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
1 parent 614dd91 commit a2fe99f

7 files changed

Lines changed: 526 additions & 10 deletions

File tree

KERNEL_REV

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
101aa465e71991eec98102bba77aad2f7ad8faed
1+
cbeaf44c66ebc57c5b3eed2a5c5d0290d4bf035b

src/databricks/sql/backend/kernel/_errors.py

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,23 @@ def reraise_kernel_error(exc: "_kernel.KernelError") -> "Error":
9797
"""
9898
code = getattr(exc, "code", "Unknown")
9999
cls = _CODE_TO_EXCEPTION.get(code, DatabaseError)
100-
new = cls(getattr(exc, "message", str(exc)))
100+
101+
# For ServerOperationError, reproduce the Thrift backend's
102+
# ``context`` dict so callers that read
103+
# ``err.context["diagnostic-info"]`` (the Spark stack trace) /
104+
# ``err.context["operation-id"]`` get the same shape on the kernel
105+
# path. ``diagnostic_info`` is forwarded from the kernel error (it
106+
# now crosses the PyO3 boundary; older wheels return ``None`` via
107+
# ``getattr``, so this degrades gracefully). Matches
108+
# thrift_backend.py's ServerOperationError construction.
109+
context = None
110+
if cls is ServerOperationError:
111+
context = {
112+
"operation-id": getattr(exc, "query_id", None),
113+
"diagnostic-info": getattr(exc, "diagnostic_info", None),
114+
}
115+
new = cls(getattr(exc, "message", str(exc)), context)
116+
101117
for attr in (
102118
"code",
103119
"sql_state",
@@ -106,6 +122,12 @@ def reraise_kernel_error(exc: "_kernel.KernelError") -> "Error":
106122
"http_status",
107123
"retryable",
108124
"query_id",
125+
# Extended server status now forwarded across the PyO3 boundary
126+
# (kernel #121). ``getattr(..., None)`` keeps this forward-safe
127+
# against an older wheel that doesn't set these attrs.
128+
"display_message",
129+
"diagnostic_info",
130+
"error_details_json",
109131
):
110132
setattr(new, attr, getattr(exc, attr, None))
111133
return new

src/databricks/sql/backend/kernel/client.py

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,56 @@
7171
# per-request skip-and-warn.
7272
_KERNEL_MANAGED_HEADERS = frozenset({"authorization", "x-databricks-org-id"})
7373

74+
# Leading verbs of SQL volume/staging statements. Detected by the
75+
# leading token (case-insensitive) so the kernel backend can fail loud
76+
# on staging ops it can't service — see ``execute_command``.
77+
_STAGING_VERBS = ("PUT", "GET", "REMOVE")
78+
79+
80+
def _strip_leading_sql_comments(sql: str) -> str:
81+
"""Strip leading whitespace and SQL comments (``-- …`` line and
82+
``/* … */`` block, possibly several) from ``sql``, returning the
83+
remainder.
84+
85+
Needed so staging detection sees the real leading verb: a
86+
comment-prefixed staging op (``-- upload\\nPUT …`` or
87+
``/* c */ PUT …``, common in ETL scripts) must still be classified
88+
as staging, or it would slip past the guard into the silent-no-op
89+
bug. Block comments do not nest in Databricks SQL, so a simple
90+
scan-to-``*/`` is correct.
91+
"""
92+
i = 0
93+
n = len(sql)
94+
while i < n:
95+
if sql[i].isspace():
96+
i += 1
97+
elif sql.startswith("--", i):
98+
# Line comment: skip to end of line (or string).
99+
nl = sql.find("\n", i)
100+
i = n if nl == -1 else nl + 1
101+
elif sql.startswith("/*", i):
102+
# Block comment: skip to closing */ (or end if unterminated).
103+
close = sql.find("*/", i + 2)
104+
i = n if close == -1 else close + 2
105+
else:
106+
break
107+
return sql[i:]
108+
109+
110+
def _is_staging_statement(operation: str) -> bool:
111+
"""True iff ``operation`` is a volume/staging statement (PUT / GET /
112+
REMOVE).
113+
114+
Strips leading whitespace + SQL comments first (so a comment-
115+
prefixed staging op is still caught), then matches the leading token
116+
only — so a normal query that merely *contains* the word (e.g.
117+
``SELECT 'GET' AS x``) isn't misflagged.
118+
"""
119+
stripped = _strip_leading_sql_comments(operation)
120+
# First whitespace-delimited token, uppercased.
121+
verb = stripped.split(None, 1)[0].upper() if stripped.strip() else ""
122+
return verb in _STAGING_VERBS
123+
74124

75125
# ─── Client ─────────────────────────────────────────────────────────────────
76126

@@ -172,6 +222,17 @@ def __init__(
172222
# path. Same lock as ``_async_handles``.
173223
self._closed_commands: Set[str] = set()
174224
self._async_handles_lock = threading.RLock()
225+
# Sync-execute cancellers keyed by ``id(cursor)``. A blocking
226+
# ``execute()`` sets ``cursor.active_command_id`` only AFTER it
227+
# returns, so a concurrent ``cursor.cancel()`` (the documented
228+
# cross-thread PEP-249 shape) has no command id to target while
229+
# the query runs. We register a detached kernel
230+
# ``StatementCanceller`` here just before the blocking call and
231+
# drop it after; ``cancel_running_cursor`` (invoked by
232+
# ``Cursor.cancel`` when there's no command id yet) fires it.
233+
# Guarded by its own lock — cancel can race execute teardown.
234+
self._sync_cancellers: Dict[int, Any] = {}
235+
self._sync_cancellers_lock = threading.RLock()
175236

176237
# ── Session lifecycle ──────────────────────────────────────────
177238

@@ -354,6 +415,24 @@ def execute_command(
354415
# ``_async_statements`` and closed by ``close_command``); the sync
355416
# path drops it in finally. ``close_stmt`` is the post-success
356417
# decision flag — it stays True on sync, flips to False on async.
418+
# Volume/staging (PUT/GET/REMOVE) is not supported on the kernel
419+
# path: the kernel returns the staging control row as a normal
420+
# result set (``KernelResultSet.is_staging_operation`` is always
421+
# False), so the connector's ``_handle_staging_operation`` never
422+
# fires and NO file is transferred. Rather than silently no-op
423+
# (the Thrift path performs the presigned-URL upload/download),
424+
# fail loud at the call site so ETL scripts don't ingest
425+
# stale/missing data. Detected by the leading SQL verb — the
426+
# only signal available pre-execute, since the kernel exposes no
427+
# staging marker today.
428+
if _is_staging_statement(operation):
429+
raise NotSupportedError(
430+
"Volume / staging operations (PUT / GET / REMOVE) are not "
431+
"supported on the kernel backend (use_kernel=True); the file "
432+
"transfer would silently not happen. Use the Thrift backend "
433+
"for staging operations."
434+
)
435+
357436
close_stmt = True
358437
try:
359438
try:
@@ -382,10 +461,23 @@ def execute_command(
382461
self._async_statements[command_id.guid] = stmt
383462
close_stmt = False
384463
return None
464+
# Register a detached canceller BEFORE the blocking
465+
# execute so a concurrent ``cursor.cancel()`` can reach
466+
# the running statement (its server id is populated mid-
467+
# execute). Keyed by ``id(cursor)`` since no command id
468+
# exists yet. Dropped in the finally.
469+
try:
470+
with self._sync_cancellers_lock:
471+
self._sync_cancellers[id(cursor)] = stmt.canceller()
472+
except Exception:
473+
# Canceller is best-effort; never block execute on it.
474+
pass
385475
executed = stmt.execute()
386476
except Exception as exc:
387477
raise _wrap_kernel_exception("execute_command", exc) from exc
388478
finally:
479+
with self._sync_cancellers_lock:
480+
self._sync_cancellers.pop(id(cursor), None)
389481
if close_stmt:
390482
# Sync path: ``Statement`` is a lifecycle owner separate
391483
# from the executed handle. Drop it here so the parent
@@ -422,6 +514,46 @@ def cancel_command(self, command_id: CommandId) -> None:
422514
except Exception as exc:
423515
raise _wrap_kernel_exception("cancel_command", exc) from exc
424516

517+
def cancel_running_cursor(self, cursor: "Cursor") -> bool:
518+
"""Cancel an in-flight SYNC ``execute()`` on ``cursor``.
519+
520+
Invoked by ``Cursor.cancel()`` when ``active_command_id`` is
521+
still ``None`` — i.e. a blocking ``execute()`` hasn't returned,
522+
so the command id isn't set yet but the server statement may be
523+
running. Fires the detached ``StatementCanceller`` registered in
524+
``execute_command`` before the blocking call.
525+
526+
Returns ``True`` if a canceller was found and fired (the
527+
statement was in flight), ``False`` otherwise so ``Cursor`` can
528+
emit its "no executing command" warning. Safe to call from
529+
another thread.
530+
531+
Tolerant by design: ``cursor.cancel()`` is a best-effort
532+
PEP-249 method (callers don't expect it to raise), so a cancel
533+
failure is logged and swallowed rather than propagated. This
534+
also covers the early-cancel window — a cancel arriving before
535+
the kernel has observed the server statement id is a no-op in
536+
the kernel canceller, but if it ever raised (e.g. a transport
537+
hiccup on the cancel RPC) we must not surface that out of
538+
``cancel()``. We still return ``True`` (a canceller was present
539+
and we attempted it) so ``Cursor`` doesn't emit the misleading
540+
"no executing command" warning.
541+
"""
542+
with self._sync_cancellers_lock:
543+
canceller = self._sync_cancellers.get(id(cursor))
544+
if canceller is None:
545+
return False
546+
try:
547+
canceller.cancel()
548+
except Exception:
549+
logger.warning(
550+
"cancel_running_cursor: best-effort cancel of in-flight "
551+
"sync statement failed; swallowing (cursor.cancel() is "
552+
"tolerant by PEP-249 contract)",
553+
exc_info=True,
554+
)
555+
return True
556+
425557
def close_command(self, command_id: CommandId) -> None:
426558
with self._async_handles_lock:
427559
handle = self._async_handles.pop(command_id.guid, None)

src/databricks/sql/client.py

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1704,11 +1704,22 @@ def cancel(self) -> None:
17041704
"""
17051705
if self.active_command_id is not None:
17061706
self.backend.cancel_command(self.active_command_id)
1707-
else:
1708-
logger.warning(
1709-
"Attempting to cancel a command, but there is no "
1710-
"currently executing command"
1711-
)
1707+
return
1708+
# No command id yet. A backend whose synchronous ``execute()``
1709+
# blocks without first publishing a command id (the kernel
1710+
# backend) may still have a server statement in flight. Such a
1711+
# backend exposes ``cancel_running_cursor(cursor)`` -> bool to
1712+
# cancel it; it returns True if something was actually
1713+
# cancelled. Opt-in via getattr so the Thrift / SEA backends
1714+
# (which set ``active_command_id`` before blocking) are
1715+
# unaffected.
1716+
cancel_running_cursor = getattr(self.backend, "cancel_running_cursor", None)
1717+
if cancel_running_cursor is not None and cancel_running_cursor(self):
1718+
return
1719+
logger.warning(
1720+
"Attempting to cancel a command, but there is no "
1721+
"currently executing command"
1722+
)
17121723

17131724
def close(self) -> None:
17141725
"""Close cursor"""

tests/e2e/test_kernel_backend.py

Lines changed: 118 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
import pytest
2525

2626
import databricks.sql as sql
27-
from databricks.sql.exc import DatabaseError
27+
from databricks.sql.exc import DatabaseError, NotSupportedError, ServerOperationError
2828

2929
# Skip the whole module unless the kernel wheel is genuinely installed.
3030
# ``pytest.importorskip`` alone isn't enough: the kernel unit tests inject a
@@ -361,3 +361,120 @@ def test_user_agent_entry_and_http_headers_round_trip(kernel_conn_params):
361361
with c.cursor() as cur:
362362
cur.execute("SELECT 1 AS n")
363363
assert cur.fetchall()[0][0] == 1
364+
365+
366+
# ── Parameter parity (tz-aware timestamp, scientific decimal) ──────
367+
368+
369+
def test_tz_aware_timestamp_parameter_binds(conn):
370+
"""A tz-aware datetime parameter (datetime with tzinfo) binds on
371+
the kernel path and resolves to the correct UTC instant. Previously
372+
rejected at bind on kernel; works on Thrift. (kernel #121)"""
373+
import datetime
374+
375+
tzdt = datetime.datetime(
376+
2026,
377+
5,
378+
15,
379+
18,
380+
0,
381+
0,
382+
tzinfo=datetime.timezone(datetime.timedelta(hours=5, minutes=30)),
383+
)
384+
with conn.cursor() as cur:
385+
cur.execute("SELECT ? AS ts", [tzdt])
386+
ts = cur.fetchall()[0][0]
387+
# 18:00 +05:30 == 12:30 UTC.
388+
assert (ts.hour, ts.minute) == (12, 30)
389+
390+
391+
def test_scientific_notation_decimal_parameter_binds(conn):
392+
"""A Decimal whose str() is exponential (e.g. 1E-7) binds on the
393+
kernel path. Previously rejected at bind; the server/Thrift accept
394+
scientific-notation decimal literals. (kernel #121)"""
395+
import decimal
396+
from databricks.sql.parameters.native import DecimalParameter
397+
398+
with conn.cursor() as cur:
399+
# 1E+2 == 100, round-trips cleanly at scale 0.
400+
cur.execute("SELECT ? AS d", [DecimalParameter(decimal.Decimal("1E+2"))])
401+
assert int(cur.fetchall()[0][0]) == 100
402+
403+
404+
# ── Staging / volume — fail loud, not silent no-op ────────────────
405+
406+
407+
def test_staging_put_raises_not_supported(conn):
408+
"""A PUT (volume/staging) statement fails loud on the kernel path
409+
rather than silently no-opping (which would make ETL ingest
410+
stale/missing data). (CUJ-gap audit)"""
411+
with conn.cursor() as cur:
412+
with pytest.raises(NotSupportedError, match="staging"):
413+
cur.execute("PUT '/tmp/x.csv' INTO '/Volumes/main/default/v/x.csv'")
414+
415+
416+
def test_comment_prefixed_staging_put_raises_not_supported(conn):
417+
"""A comment-prefixed staging op (common in ETL scripts) must also
418+
fail loud — the leading-verb detection strips SQL comments first, so
419+
it can't slip past into the silent-no-op bug (PR #825 review #1)."""
420+
with conn.cursor() as cur:
421+
with pytest.raises(NotSupportedError, match="staging"):
422+
cur.execute(
423+
"-- upload the daily extract\n"
424+
"PUT '/tmp/x.csv' INTO '/Volumes/main/default/v/x.csv'"
425+
)
426+
427+
428+
# ── Error fidelity — diagnostic-info reaches .context ─────────────
429+
430+
431+
def test_server_error_exposes_diagnostic_info_context(conn):
432+
"""A server-side query failure surfaces as ServerOperationError
433+
with the Spark diagnostic context in ``.context['diagnostic-info']``
434+
— Thrift parity (kernel #121 forwards diagnostic_info across pyo3;
435+
the connector populates .context)."""
436+
with conn.cursor() as cur:
437+
with pytest.raises(ServerOperationError) as exc_info:
438+
cur.execute("SELECT * FROM definitely_not_a_table_xyz_kernel_e2e")
439+
err = exc_info.value
440+
# context shape matches Thrift; diagnostic-info may be None if
441+
# the server didn't attach one, but the KEY must exist.
442+
assert "diagnostic-info" in err.context
443+
assert "operation-id" in err.context
444+
445+
446+
# ── Sync cancel (cursor.cancel() from another thread) ─────────────
447+
448+
449+
def test_sync_cancel_interrupts_blocking_execute(conn):
450+
"""cursor.cancel() from another thread cancels a long-running
451+
blocking execute() on the kernel path. Previously a silent no-op
452+
(active_command_id was None until execute returned). (kernel #121
453+
StatementCanceller + connector cancel_running_cursor wiring)"""
454+
import threading
455+
import time
456+
457+
cur = conn.cursor()
458+
459+
# The kernel publishes the server statement id once the initial
460+
# POST returns — within the server's default wait window (~10s).
461+
# Cancel after that so the canceller has an id to target; a cancel
462+
# before then is a no-op by design (id not yet known). Pick a query
463+
# that runs well past this so the cancel lands mid-flight.
464+
def cancel_after_delay():
465+
time.sleep(15.0)
466+
cur.cancel()
467+
468+
t = threading.Thread(target=cancel_after_delay)
469+
t.start()
470+
try:
471+
# Cancel should make execute() raise rather than run to
472+
# completion — proving the server-side statement was cancelled.
473+
with pytest.raises(Exception):
474+
cur.execute(
475+
"SELECT count(*) FROM range(0, 1000000000000) "
476+
"WHERE pow(rand(), 2) < 0.5 AND sqrt(id) > 1"
477+
)
478+
finally:
479+
t.join()
480+
cur.close()

0 commit comments

Comments
 (0)