refactor(api): share a single URLSession across all PostHogApi requests#593
refactor(api): share a single URLSession across all PostHogApi requests#593turnipdabeets wants to merge 34 commits intofeat/logs-public-apifrom
Conversation
PR 1 of 2 for the logs feature. Lands the on-disk storage and queue plumbing for shipping log records to PostHog's /i/v1/logs endpoint as OTLP/JSON. No public API yet — PR 2 will add captureLog, the PostHogLogger entry point, and the lifecycle wiring (background flush, shutdown drain). What's in scope: - PostHogLogsQueue: standalone sibling of PostHogQueue (events) and PostHogReplayQueue (snapshots). File-backed via the existing PostHogFileBackedQueue primitive at a new posthog.logsFolder directory. Serial dispatch queue + NSLock-guarded state. - 413 adaptive batch sizing: halve currentBatchCap on payload-too-large, retry the same records, ramp +1 on each healthy send, drop the offender if a single-record batch still 413s. - Tumbling-window rate cap (default 500 / 10s) with one-shot warning per window so chatty callers can't spam the console. - Defensive crash-proof flush path: no force-unwraps, do/catch on every user-facing entry, completion handlers always invoked exactly once. - PostHogLogsOTLP: pure helpers building the resourceLogs[].scopeLogs[] .logRecords[] payload. NaN/Infinity correctly serialized as stringValue per the proto3 JSON int64/float rules. - PostHogLogsConfig: public @objc config holding flush/buffer/batch knobs, OTLP resource attributes, rate cap settings, and a Swift beforeSend closure. - PostHogLogRecord and PostHogLogSeverity: public types carrying body, attributes, optional W3C trace fields, and a per-capture context snapshot (distinctId, sessionId, screenName, appState, feature flag keys) so identity changes between capture and flush cannot corrupt a record. - /i/v1/logs endpoint added to PostHogApi (gzipped JSON, token in query string). - 19 tests covering persistence, FIFO eviction, threshold flush, 413 halve+ramp, 413 single-record poison drop with cap reset, 5xx retain, non-413 4xx poison-pill, rate cap window roll, beforeSend drop/mutate/empty-body, concurrent add+flush race, clear-during-add race, and end-to-end OTLP wire shape via gzip+JSON decode. Defaults match the iOS events queue (30s flush interval, 50/batch) rather than the JS/web defaults — mobile platforms have ~30s background budgets and tighter battery/radio constraints. Android events use the same numbers. What's deliberately not done: - No public captureLog / logger / flushLogs API. Internal-only for now. - No reachability subscription on the logs queue. Events queue owns the single-subscriber Reachability slots; logs handles transient failures via the 5xx-retry path. - Logs queue preserved across reset() (matches events / replay behaviour). Records carry capture-time distinctId so they always flush as the user who captured them. A reciprocal TODO has been added to PostHogQueue.handleResult and to PostHogLogsQueue noting that the backoff state, periodic timer, and 413 handling are duplicated and should be extracted into shared helpers under PostHog/Utils. That extraction (plus three real bug fixes uncovered in events: 413 silent drop, 5xx silent drop, timer start/stop race) will land as a separate small follow-up PR. No effect on the events or replay queues. Verified: 480 tests pass, 0 serious lint violations, iOS + macOS xcodebuild clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Reachability multicast hooks (onReachable / onUnreachable) landed in #584. This wires the logs queue into them so it gains the same network-aware behaviour the events queue has: - Pauses on unreachable network. - Pauses on cellular when config.dataMode == .wifi. - Auto-flushes on WiFi reconnect. Subscriptions are held via RegistrationToken instances cleared in stop(). [weak self] in callbacks. The events queue still owns startNotifier(); the multicast broadcasts to all subscribers regardless of who started it. Adds a reachabilityPauseAndResume integration test on PostHogLogsQueue that drives a real Reachability through unreachable, reachable transitions and verifies the queue's pause/resume behaviour. 485 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ies safeguard Brings the logs queue HTTP error handling into parity with the events queue (PostHogQueue) after the 413 hardening landed in PR #585: 1. Halve from min(currentBatchCap, batchSize) instead of currentBatchCap directly, so a 413 on a partial batch doesn't waste halvings on caps the server hasn't accepted. 2. Drop every queued record once retryCount exceeds config.maxRetries, mirroring PostHogQueue.dropAllQueuedEvents. Without this, a permanently-broken backend (wrong API key, deterministic 5xx, etc.) leaves the logs queue retrying the same batch forever. Extracts halveBatchCap(_:actualBatchSize:) and retryCountExceeded(_:maxRetries:) to module-internal free functions in PostHogQueue.swift so both queues share the formula and the > vs >= rule. PostHogReplayQueue wraps a PostHogQueue(.snapshot) instance so it inherits the same handling for free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e cases in tests `flushIfOverThreshold` was running `executeFlushOnDispatchQueue` inline, which meant `add()` (callable from any thread) could block on disk peek + JSON decode + OTLP build before returning. The doc-comment said "caller must already be on dispatchQueue" but the only caller is `add()` on the user's thread — so the optimization was unsafe. Hop to `dispatchQueue.async` to match `flush()`. Tests added: - `halves cap from min(cap, actualBatchSize) when queue depth was below cap` — exercises the `actualBatchSize` bound on the 413 halve path. A regression that read `currentBatchCap` directly would still pass the existing depth-equals-cap test but shrink too aggressively here. - `OTLP encodes non-string attributes (Int / Double / Bool / array / dict)` — pins down `intValue` (string-encoded), `doubleValue`, `boolValue` (NSNumber-bridge trap), `arrayValue.values`, and `kvlistValue.values` shapes so wire-format regressions don't slip past the previous all-strings test.
…eEndpoint
The events queue and the new logs queue had ~80% identical infrastructure
— file persistence, reachability subscription, flush timer, isFlushing /
pausedUntil / retryCount state machine, 413 halving, maxRetries drop.
The duplication landed reviewer feedback ("PostHogQueue already serves
batch and snapshot, why a third class?").
Make `PostHogQueue` generic over `Record` and extract a
`QueueEndpoint<Record>` struct that captures everything per-wire-specific:
- Storage key, dispatch queue label
- Disk codec (encode / decode)
- Sender (the actual `api.batch / api.snapshot / api.logs` call)
- Retry policy (retriable status set, whether 3xx is retriable)
- Cap policy (after-success ramp, after-poison-drop, after-drop-all)
- Per-config runtime knobs (initial cap, flushAt, maxQueueSize, flush
interval) read from `PostHogConfig` via closures
Three concrete factories: `.batch(api:)`, `.snapshot(api:)`,
`.logs(api:resourceAttributes:)`. The first two replace the old
`PostHogApiEndpoint` enum and switch in `eventHandler`.
`PostHogReplayQueue` keeps its existing wrapper shape — it now composes
`PostHogQueue<PostHogEvent>` with `.snapshot(api:)`. `PostHogLogsQueue`
becomes a thin (~200 LOC) wrapper around `PostHogQueue<PostHogLogRecord>`
that owns only the rate cap and `beforeSend` chain — everything else
delegates to the inner generic queue. Net diff is +197 / −449 LOC.
Two policy fixes shaken out by the unification:
- 413 poison-drop (cap == 1) no longer counts as a retry. Original
logs queue had a separate short-circuit that skipped the retryCount
increment; the unified queue now matches that for both endpoints.
The drop *is* the resolution, not another attempt.
- `dropAllQueuedRecords` now applies the endpoint's `capAfterDropAll`
policy. Events keep cap where it is; logs reset to max because the
queue is empty and new records should start fresh. Matches the
original logs-queue behaviour.
All 488 existing tests pass without modification — the refactor is
behaviour-preserving for every call site. Tests that reach into queue
internals (`fileQueue`, `currentBatchCapForTesting`) work via passthrough
properties on `PostHogLogsQueue`.
The user-facing API it describes (`captureLog`, `logger`, `flushLogs`, `config.logs`) ships in feat/logs-public-api, which already carries this same file. Keeping it on this branch would double-publish the entry.
`PostHogLogRecord.traceFlags` was declared as `Int?`, which can't be expressed in Objective-C — ObjC consumers couldn't read or set it. The "Swift-only" framing wasn't a design decision worth defending; it was a small interop bug. Switch the storage to `@objc public var traceFlags: NSNumber?` (idiomatic ObjC pattern for "boxed optional primitive"). Add a Swift-friendly `traceFlagsValue: Int?` computed accessor backed by the same storage so Swift callers don't have to deal with NSNumber. Behaviour is unchanged: nil → field absent on the wire, value (incl. 0) → explicitly emitted. Matches RN / JS SDKs and OTLP spec.
…bundle helpers Both fields now have eager init-time defaults via the existing bundleIdentifier(fallback:) and appVersionString() helpers in BundleUtils (same helpers PostHogContext / PostHogAppLifeCycleIntegration use). The fallback logic moves from the consumption site (buildResourceAttributes) into the property defaults; ObjC consumers see non-nullable NSString; Swift callers no longer write `?? bundleIdentifier(...)` at every read. serviceVersion stays a String (defaulting to empty string when no CFBundleShortVersionString is available); empty values are omitted from the wire payload.
…tifier() Use the same helper the events queue uses for `$app_namespace` and storage namespacing — single source of truth for the bundle-id fallback chain (incl. the TESTING vs production distinction baked into `getBundleIdentifier()`).
Four narrow tests that pin behaviour the existing coverage didn't fully
nail down after the QueueEndpoint refactor:
- Logs cap ramps +1 toward maxBatchSize on healthy 2xx (logs'
capAfterSuccess policy). A regression that returned cap unchanged
(events policy) would silently ship.
- Events cap stays put after maxRetries dropAll (events'
capAfterDropAll policy). A regression that wired logs' reset-to-max
policy to the events factory would slip through the existing
'queue keeps working after a maxRetries drop' test.
- PostHogLogRecord round-trips every optional field through
toStorageJSON / fromStorageJSON, including traceFlags (newly
NSNumber? for ObjC bridging), traceId, spanId, attributes of mixed
types, and per-record context.
- traceFlags propagates as the OTLP 'flags' integer field on the wire
after the Int? -> NSNumber? change. Catches encoder regressions in
the bridging path.
… budget A regression that incremented retryCount on cap=1 + 413 would drop the entire queue via dropAll once the cap-halving sequence consumed the retry budget — wiping all records together instead of letting them drain record-by-record via repeated halve→poison-drop cycles. Test drives 8 oversized records through repeated 413s and asserts the queue makes more HTTP requests than the buggy dropAll path would (≤ 4 — 3 halvings then dropAll). Record-by-record drainage produces > 8 requests since each record costs at least one halve cycle plus a poison drop.
… stay Logs queue's adaptive batch-cap policy (`+1` ramp on 2xx, reset to max on poison-drop and dropAll) was the only divergence between the three queues after the QueueEndpoint generic refactor. Align logs to the same "halve on 413, stay otherwise" behaviour the events queue ships with — matching what we did for events in the HTTP 413 work (PR #585) and what posthog-android / posthog-js-lite already do. Result: - Cap stays put on 2xx (no ramp). - Cap stays at 1 after a size-1 + 413 poison-drop (offender popped, next records use the small cap until success). - Cap stays where it was after a `dropAll` from `maxRetries` (queue is empty; new records inherit the conservative state). DRY follow-through: with all three endpoints now sharing one cap policy, the per-endpoint `capAfterSuccess` / `capAfterPoisonDrop` / `capAfterDropAll` closures on `QueueEndpoint` become identical boilerplate. Drop them and hardcode the behaviour in `PostHogQueue.handleResult`. The QueueEndpoint abstraction stays — it still expresses the meaningful divergence (storage key, codec, sender, retriable status set) — just smaller and more honest. Tests updated: - "cap ramps +1 on 2xx" deleted; replaced with "cap stays put on 2xx" asserting no ramp. - "413 single-record drops + resets cap" → "413 single-record drops + leaves the cap at 1". - "drops the entire queue once retryCount exceeds maxRetries" updated to assert cap stays at 16 (the halved value), not 64 (max). Cross-SDK note: this aligns iOS logs to events / replay behaviour but diverges from posthog-js-lite logs (which ramps on 2xx). Discussion about unifying the JS / RN logs queue to the same conservative policy is a follow-up in the JS repo.
…lushAt - Collapse PostHogLogLevel (replay 3-level) and PostHogLogSeverity (logs 6-level) into one PostHogLogLevel with 6 cases ordered by severity (trace < debug < info < warn < error < fatal). Drop the iOS-only guard so the type is cross-platform. Replay case names unchanged; only consumers reading raw values directly are affected. - config.logs.maxBufferSize default now reads from PostHogConfig.Defaults.maxQueueSize (1000 / 100 on tvOS) instead of the hardcoded 100. - New config.logs.flushAt knob separates the flush threshold from maxBatchSize. Default matches config.flushAt.
beforeSend is the user-facing sensitive-data filter — it should run at the SDK boundary so records are scrubbed before they enter any internal pipeline. PostHogLogsQueue.add no longer invokes the chain; that becomes the SDK layer's job (captureLog, in PR2). Rate cap stays at the queue layer because its windowed state is queue-lifecycle-scoped (resets on clear()). Existing beforeSend tests now exercise the chain mechanism on PostHogLogsConfig directly, mirroring how the SDK layer will use it.
After unifying PostHogLogLevel from 3 to 6 cases, the rrweb console plugin's switch became non-exhaustive. Map the four extra cases (trace, debug, fatal) onto rrweb's three-level set (info, warn, error) since that's the wire format rrweb expects. Also renamed `v` -> `version` in osVersionString() to satisfy the identifier-name lint rule (the rule was tripped by the file moving through lint after the osName() addition).
e51c72b to
f46ed04
Compare
posthog-ios Compliance ReportDate: 2026-05-06 21:38:06 UTC
|
| Test | Status | Duration |
|---|---|---|
| Request Payload.Request With Person Properties Device Id | ❌ | 233ms |
| Request Payload.Flags Request Uses V2 Query Param | ❌ | 209ms |
| Request Payload.Flags Request Hits Flags Path Not Decide | ❌ | 173ms |
| Request Payload.Flags Request Omits Authorization Header | ❌ | 253ms |
| Request Payload.Token In Flags Body Matches Init | ❌ | 207ms |
| Request Payload.Groups Round Trip | ❌ | 199ms |
| Request Payload.Groups Default To Empty Object | ❌ | 238ms |
| Request Payload.Person Properties Distinct Id Auto Populated When Caller Omits It | ❌ | 252ms |
| Request Payload.Disable Geoip False Propagates As Geoip Disable False | ❌ | 246ms |
| Request Payload.Disable Geoip Omitted Defaults To False | ❌ | 318ms |
| Request Payload.Flag Keys To Evaluate Contains Only Requested Key | ❌ | 105ms |
| Request Lifecycle.No Flags Request On Init Alone | ❌ | 159ms |
| Request Lifecycle.No Flags Request On Normal Capture | ❌ | 2413ms |
| Request Lifecycle.Two Flag Calls Produce Two Remote Requests | ❌ | 245ms |
| Request Lifecycle.Mock Response Value Is Returned To Caller | ❌ | 195ms |
| Side Effect Events.Get Feature Flag Captures Feature Flag Called Event | ❌ | 252ms |
Failures
request_payload.request_with_person_properties_device_id
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.flags_request_uses_v2_query_param
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.flags_request_hits_flags_path_not_decide
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.flags_request_omits_authorization_header
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.token_in_flags_body_matches_init
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.groups_round_trip
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.groups_default_to_empty_object
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.person_properties_distinct_id_auto_populated_when_caller_omits_it
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.disable_geoip_false_propagates_as_geoip_disable_false
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.disable_geoip_omitted_defaults_to_false
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_payload.flag_keys_to_evaluate_contains_only_requested_key
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_lifecycle.no_flags_request_on_init_alone
Expected 0 /flags requests, got 1
request_lifecycle.no_flags_request_on_normal_capture
Expected 0 /flags requests, got 2
request_lifecycle.two_flag_calls_produce_two_remote_requests
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
request_lifecycle.mock_response_value_is_returned_to_caller
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
side_effect_events.get_feature_flag_captures_feature_flag_called_event
404, message='Not Found', url='http://host.docker.internal:8080/get_feature_flag'
User-facing surface on top of the storage + queue infrastructure:
- PostHogSDK.captureLog(_:level:attributes:traceId:spanId:traceFlags:)
with @objc overloads and a Swift default-arg variant
- PostHogSDK.logger — lazy facade with trace / debug / info / warn /
error / fatal
The existing PostHogSDK.flush() now drains all three queues (events,
replay, logs) — no separate flushLogs() needed.
beforeSend runs at the SDK layer (before logsQueue.add) so sensitive
data is scrubbed before it can enter any internal pipeline. Rate cap
stays at the queue layer.
Capture-time context (distinctId, sessionId, screenName, appState,
featureFlagKeys) is snapshotted on the calling thread; identity or
session changes between capture and flush cannot mutate a queued
record. Screen name is cached on PostHogSDK via the existing
screen-view publisher so log records can be tagged without needing
main-thread access to UIViewController helpers.
Also adds:
- PostHogTests/PostHogLogsCaptureTest.swift — integration tests
covering main / background / concurrent capture, empty-body and
opt-out drops, distinctId snapshotting, all six severity numbers
on the wire, OTLP resource attributes, and backgrounding triggers
flush.
- PostHogExample/Views/LogsView.swift — demo each level, flush,
traced-log, and a 1000-record flood.
- .changeset/logs-feature.md — combined-feature entry.
f46ed04 to
c1eb83a
Compare
User-facing surface on top of the storage + queue infrastructure:
- PostHogSDK.captureLog(_:level:attributes:traceId:spanId:traceFlags:)
with @objc overloads and a Swift default-arg variant
- PostHogSDK.logger — lazy facade with trace / debug / info / warn /
error / fatal
The existing PostHogSDK.flush() now drains all three queues (events,
replay, logs) — no separate flushLogs() needed.
beforeSend runs at the SDK layer (before logsQueue.add) so sensitive
data is scrubbed before it can enter any internal pipeline. Rate cap
stays at the queue layer.
Capture-time context (distinctId, sessionId, screenName, appState,
featureFlagKeys) is snapshotted on the calling thread; identity or
session changes between capture and flush cannot mutate a queued
record. Screen name is cached on PostHogSDK via the existing
screen-view publisher so log records can be tagged without needing
main-thread access to UIViewController helpers.
Also adds:
- PostHogTests/PostHogLogsCaptureTest.swift — integration tests
covering main / background / concurrent capture, empty-body and
opt-out drops, distinctId snapshotting, all six severity numbers
on the wire, OTLP resource attributes, and backgrounding triggers
flush.
- PostHogExample/Views/LogsView.swift — demo each level, flush,
traced-log, and a 1000-record flood.
- .changeset/logs-feature.md — combined-feature entry.
9e5d922 to
6c45165
Compare
Per-request URLSession is an anti-pattern (Apple WWDC 2018 session 714,
DTS forum guidance). Each fresh URLSession has its own connection pool,
so back-to-back requests pay a TCP + TLS handshake every time and HTTP/2
multiplexing — which the PostHog backend supports — never engages.
Lift URLSession to a stored property on PostHogApi, built once in init.
All five API methods (batch, snapshot, logs, flags, remoteConfig) now
share the same session and connection pool. Per-request gzip semantics
move from session-level httpAdditionalHeaders onto URLRequest via
Content-Encoding, so /flags and /array don't carry a misleading gzip
header.
Concrete wins:
- TLS handshake amortised across requests instead of paid per-call.
- HTTP/2 multiplexing kicks in when events / replay / logs / flags
flush concurrently — one TCP connection carries all four.
- Reduced per-request memory + setup cost (session is "fairly
expensive to create" per Apple).
No behavior change at the wire level. Tests unchanged.
|
nice! |
0217342 to
5989997
Compare
💡 Motivation and Context
Stacked on #592. Base:
feat/logs-public-api, notmain.Per-request
URLSessionis an anti-pattern. The SDK currently constructs a freshURLSession(configuration: ...).uploadTask(...)for every batch / snapshot / logs / flags / remoteConfig call (5 sites inPostHogApi.swift). Each fresh session has its own connection pool, so back-to-back requests pay a TCP + TLS handshake every time and HTTP/2 multiplexing — which the PostHog backend supports (us.i.posthog.comandeu.i.posthog.comboth negotiate HTTP/2 via Envoy) — never engages.Apple's guidance (WWDC 2018 session 714 + DTS forum thread 84663) is explicit: reuse a single
URLSessioninstance. Quinn the Eskimo: "This is a common anti-pattern… creating a session per request prevents connection reuse, which can radically slow down back-to-back requests. This is especially bad for HTTP/2."This PR lifts
URLSessionto a stored property onPostHogApi, built once ininit. All five API methods now share the same session and connection pool. Per-request gzip semantics move from session-levelhttpAdditionalHeadersontoURLRequestviaContent-Encoding, so/flagsand/array/<token>/configdon't carry a misleading gzip header (their bodies aren't gzipped).Concrete wins:
URLSessionis "fairly expensive to create" with "non-trivial memory footprint."💚 How did you test it?
make test— 503 tests pass, no behaviour change at the wire level.xcodebuild -scheme PostHog -destination 'generic/platform=iOS Simulator' build— green.Manual: not yet validated against a real backend. Worth confirming with a captured network trace that subsequent requests reuse the connection (single SYN/ACK + TLS, then multiple POSTs over it) rather than opening fresh connections per request.
📝 Checklist
If releasing new changes
pnpm changesetto generate a changeset file. (not yet — pure perf refactor with no observable wire-level change; will add if the team wants a changelog entry for it)