Skip to content

Commit a5817f9

Browse files
test(e2e): disable telemetry process-wide to deflake retry tests
Follow-up to #822, which was insufficient. The retry tests in PySQLRetryTestsMixin patch urllib3 globally (HTTPSConnectionPool._get_conn / _validate_conn) and assert the connection's own request was retried exactly N times. #822 disabled telemetry on the retry tests' OWN connections, but the flake persisted: under pytest-xdist, MANY other e2e tests run in the SAME worker process before the retry tests, and each opens a connection with telemetry enabled by default. Telemetry is a process-global singleton (TelemetryClientFactory) with a shared executor and a daemon flush thread, so those prior tests leave real TelemetryClient instances and background threads/futures registered in the worker. When a retry test installs its global urllib3 mock, an in-flight or subsequently-fired telemetry POST to /telemetry-ext from a *previous* test's client goes through the same mocked pool and inflates the mock call_count past the expected value -> flaky AssertionError. CI logs on #825 show /telemetry-ext retries firing inside the failing retry test even though that test's own connection had telemetry disabled. Fix: add an autouse, function-scoped fixture in a new tests/e2e/conftest.py that force-installs NoopTelemetryClient for every e2e connection (and drains the factory before/after each test). With no real TelemetryClient ever created in the worker, there are no background telemetry threads or futures to leak into any test's urllib3 mock. This removes the flake at its source deterministically rather than draining racy background threads. The telemetry suite (tests/e2e/test_telemetry_e2e.py) asserts real telemetry behavior and is marked @pytest.mark.serial while managing its own factory state, so the fixture explicitly skips serial-marked tests. Verified against dogfood: - A normal connection opened with telemetry defaulting ON gets a NoopTelemetryClient (no real executor/flush thread). - serial-marked tests keep the real initialize_telemetry_client; non-serial tests get the no-op installer. - test_telemetry_e2e.py still passes (12 tests, real telemetry). - Full PySQLRetryTestsMixin passes (36 tests). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
1 parent a2fe99f commit a5817f9

1 file changed

Lines changed: 107 additions & 0 deletions

File tree

tests/e2e/conftest.py

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
"""Shared e2e test fixtures.
2+
3+
The most important fixture here is ``_disable_telemetry_for_e2e``, which
4+
eliminates a long-standing source of flakiness in the retry tests.
5+
6+
Background
7+
----------
8+
The retry tests in ``tests/e2e/common/retry_test_mixins.py`` patch urllib3
9+
*globally* (``HTTPSConnectionPool._get_conn`` / ``._validate_conn``) and assert
10+
that the connection's own request was retried exactly N times (e.g.
11+
``assert call_count == 6``). The connector's telemetry feature is a
12+
process-global singleton (``TelemetryClientFactory``) with a shared
13+
``ThreadPoolExecutor`` and a daemon periodic-flush thread. Every ordinary e2e
14+
test (``test_dates``, ``test_decimals``, ...) opens a connection with telemetry
15+
enabled by default, which leaves a real ``TelemetryClient`` plus background
16+
threads/futures registered in that global factory.
17+
18+
Under pytest-xdist, those ordinary tests run in the SAME worker process as the
19+
retry tests. When a retry test installs its global urllib3 mock, an in-flight
20+
or subsequently-fired telemetry POST to ``/telemetry-ext`` from a *previous*
21+
test's lingering client goes through the same mocked pool and inflates the
22+
mock's ``call_count`` past the expected value -> flaky ``AssertionError``. CI
23+
logs confirm ``/telemetry-ext`` retries appearing *inside* the failing retry
24+
test even after that test's own connection disabled telemetry.
25+
26+
Fix
27+
---
28+
Disable telemetry for *every* e2e connection (not just the retry tests) by
29+
force-installing ``NoopTelemetryClient`` for the duration of each test. With no
30+
real ``TelemetryClient`` ever created in the worker, there are no background
31+
telemetry threads/futures to leak into any test's urllib3 mock. This removes
32+
the flake at its source deterministically, rather than trying to drain racy
33+
background threads.
34+
35+
The telemetry suite (``tests/e2e/test_telemetry_e2e.py``) is the one place that
36+
asserts real telemetry behavior; it is marked ``@pytest.mark.serial`` and
37+
manages its own ``TelemetryClientFactory`` state, so this fixture explicitly
38+
skips ``serial``-marked tests.
39+
"""
40+
41+
import pytest
42+
43+
from databricks.sql.telemetry.telemetry_client import (
44+
NoopTelemetryClient,
45+
TelemetryClientFactory,
46+
_TelemetryClientHolder,
47+
)
48+
49+
50+
def _drain_telemetry_factory():
51+
"""Hard-reset the process-global telemetry factory: close clients, stop the
52+
flush thread, shut the executor down, and clear all state."""
53+
with TelemetryClientFactory._lock:
54+
for holder in list(TelemetryClientFactory._clients.values()):
55+
try:
56+
holder.client.close()
57+
except Exception:
58+
pass
59+
try:
60+
TelemetryClientFactory._stop_flush_thread()
61+
except Exception:
62+
pass
63+
if TelemetryClientFactory._executor is not None:
64+
try:
65+
TelemetryClientFactory._executor.shutdown(wait=True)
66+
except Exception:
67+
pass
68+
TelemetryClientFactory._clients = {}
69+
TelemetryClientFactory._executor = None
70+
TelemetryClientFactory._initialized = False
71+
72+
73+
@pytest.fixture(autouse=True)
74+
def _disable_telemetry_for_e2e(request, monkeypatch):
75+
"""Force every e2e connection to use NoopTelemetryClient so no real
76+
telemetry background work runs in the worker process.
77+
78+
Skipped for ``serial``-marked tests (the telemetry suite), which assert
79+
real telemetry behavior and manage their own factory state.
80+
"""
81+
if request.node.get_closest_marker("serial") is not None:
82+
# Let the telemetry suite exercise real telemetry.
83+
yield
84+
return
85+
86+
# Clear any real telemetry state left by an earlier test in this worker,
87+
# then make all subsequent telemetry initialization a no-op.
88+
_drain_telemetry_factory()
89+
90+
def _install_noop(*args, host_url=None, **kwargs):
91+
host_url = TelemetryClientFactory.getHostUrlSafely(host_url)
92+
with TelemetryClientFactory._lock:
93+
TelemetryClientFactory._clients[host_url] = _TelemetryClientHolder(
94+
NoopTelemetryClient()
95+
)
96+
97+
monkeypatch.setattr(
98+
TelemetryClientFactory,
99+
"initialize_telemetry_client",
100+
staticmethod(_install_noop),
101+
)
102+
103+
try:
104+
yield
105+
finally:
106+
# Leave the factory clean for the next test regardless of what ran.
107+
_drain_telemetry_factory()

0 commit comments

Comments
 (0)