feat(trace-exporter): add fail-closed fallback to v04#2037
feat(trace-exporter): add fail-closed fallback to v04#2037anais-raison wants to merge 13 commits into
Conversation
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2037 +/- ##
==========================================
+ Coverage 72.83% 72.97% +0.14%
==========================================
Files 459 460 +1
Lines 76134 76682 +548
==========================================
+ Hits 55452 55959 +507
- Misses 20682 20723 +41
🚀 New features to boost your workflow:
|
…allback # Conflicts: # libdd-data-pipeline/src/trace_exporter/trace_serializer.rs # libdd-trace-utils/src/msgpack_encoder/v04/mod.rs # libdd-trace-utils/src/msgpack_encoder/v1/mod.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 73cbd6de94
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // and the payload is encoded as V1. | ||
| let start = std::time::Instant::now(); | ||
| while libdd_data_pipeline::agent_info::get_agent_info().is_none() { | ||
| if start.elapsed() > std::time::Duration::from_secs(5) { |
There was a problem hiding this comment.
I think this is probably ok, but we need to be careful anytime we introduce waiting to tests. Could we try to stress test this a bit to see if it's flaky on CI? I'd suggest temporarily modifying the CI workflow to run the tracing integration step 1000x on the various platforms in our matrix in a loop (the loop should only apply to the integration tests, not the entire suite) to see if any flakes appear.
There was a problem hiding this comment.
I added a temporary step to stress test it 👍
| .as_ref() | ||
| .is_some_and(|e| e.iter().any(|p| p == "/v1.0/traces")); | ||
| let previous = self.v1_active.swap(supports_v1, Ordering::Relaxed); | ||
| match (previous, supports_v1) { |
There was a problem hiding this comment.
Should we also log (false, false)? Or is there somewhere else we should log the scenario where the SDK is requesting V1, but the agent doesn't support it? I'm not sure if we care about that?
There was a problem hiding this comment.
Logging every (false, false) would be noisy in the common "agent never supports V1" case. We could gate it behind an AtomicBool so we emit a single warn! the first time we see it but I don't think it's that necessary or useful.
There was a problem hiding this comment.
I agree logging every (false, false) is not helpful. I was thinking logging it at least once could be helpful for debugging, but realistically the only reason that would happen is if the agent is version is rolled back, which we should be able to tell via other signals. We can leave it as is.
| match (previous, supports_v1) { | ||
| (false, true) => debug!("V1 trace protocol enabled (agent advertises /v1.0/traces)"), | ||
| (true, false) => { | ||
| warn!("V1 trace protocol no longer advertised by agent; falling back to v0.4") |
There was a problem hiding this comment.
Related to this comment, when would this happen?
There was a problem hiding this comment.
There is three cases i can think about :
- Agent rolled back to a version without V1
- Agent restarted with V1 disabled via config
- The new 404 fail-closed hook (on a 404 from /v1.0/traces we flip v1_active directly and trigger an /info refresh)
| @@ -347,6 +351,9 @@ impl<C: HttpClientCapability + SleepCapability + MaybeSend + Sync + 'static> Tra | |||
| fn check_agent_info(&self) { | |||
| if let Some(agent_info) = agent_info::get_agent_info() { | |||
| if self.has_agent_info_state_changed(&agent_info) { | |||
There was a problem hiding this comment.
I might be concern trolling here...but if a likely scenario we have to support is the agent advertising that it supports V1 to advertising it does not support v1 then it's possible we get trapped in a loop where we don't get a correct state hash from the agent.
- Agent says it's ok to send V1, we send V1.
- Agent changes its mind and does not support V1
- We check if agent state has changed based on the last successful response that happened before the agent changed its mind.
- We send V1, get a 404 because the endpoint isn't available. We don't get a new state hash on a 404.
- We keep sending V1, keep 404'ing, never get a new state hash.
We poll /info every 5 minutes independently of the state hash. But 5 minutes is a long time to be dropping traces.
I think this would also apply to CSS, which has the same logic. Is this a valid concern?
CC @VianneyRuhlmann and @ajgajg1134
There was a problem hiding this comment.
hmm, yeah I think it's reasonable to say that a 404 response from the agent at any point should probably invalidate the info state cache, especially for something like the newer v1 endpoint where a 404 would likely indicate the customer rolled back their agent to a non-v1 version or restarted it with the v1 endpoint explicitly disabled.
I will note that this is probably an area that we should have a clearer spec (and tests) for across all the languages since this issue is not unique to libdatadog. E.g. the go tracer just fetches once on startup and every 5s afterwards, it doesn't look at the header values at all nor does it have any special "invalidation" logic on a failed request like a 404
There was a problem hiding this comment.
I added a hook in send_trace_chunks_inner that catches 404s on /v1.0/traces: when that happens we immediately flip v1_active to false so next send fallback to v0.4, and call info_response_observer.manual_trigger() to force an /info refetch instead of waiting for the next 5-minute poll. So the trap closes on the first request after a rollback rather than the next scheduled poll.
ekump
left a comment
There was a problem hiding this comment.
LGTM, just don't forget to remove the temporary code in the GH workflow
What does this PR do?
Adds runtime V1 trace protocol negotiation with fail-closed fallback to V0.4. Opt in via
enable_v1_protocol()orDD_TRACE_AGENT_PROTOCOL_VERSION=1. V1 is only used after the agent advertises/v1.0/tracesin/info.Motivation
APMSP-2809
Makes V1 safe to enable: falls back to V0.4 against agents that don't support it.
Additional Notes
Dynamic rollback supported (if the agent stops advertising V1, the exporter switches back to V0.4).
How to test the change?
All test are working.
End to end tests against a real agent were done and successful as well.