feat: drop guards and cancellation wrappers to "fix" stuck pubkeys#1172
feat: drop guards and cancellation wrappers to "fix" stuck pubkeys#1172
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. 👉 Get your free trial and get 200 agent minutes per Slack user (a $50 value). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Manual Deploy AvailableYou can trigger a manual deploy of this PR branch to testnet: Alternative: Comment
Comment updated automatically when the PR is synchronized. |
Amp-Thread-ID: https://ampcode.com/threads/T-019ddd97-fbe9-70e6-9bc5-62c631b63536 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019ddda0-55b3-7629-bddf-22a807f099d0 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019ddda2-a5d6-7068-afa5-688b3cb44c61 Co-authored-by: Amp <amp@ampcode.com>
308a465 to
c0142be
Compare
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
magicblock-chainlink/src/remote_account_provider/mod.rs (1)
971-993:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRollback currently tears down pre-existing subscriptions.
setup_subscriptions()treats everyOk(())as a newly created subscription, butsubscribe()also returnsOk(())for pubkeys that were already watched and only got LRU-promoted. On a partial failure, this loop unsubscribes those long-lived watches too, which can drop unrelated updates. Track only subscriptions created by this call, or havesubscribe()returnCreatedvsAlreadyWatching.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 971 - 993, setup_subscriptions() currently treats every Ok(()) from subscribe() as a newly created watch and then unsubscribes all succeeded entries on partial failure, which tears down pre-existing subscriptions; change the behavior so only subscriptions actually created by this call are rolled back: either modify subscribe() to return a discriminated result (e.g., an enum like SubscribeResult::Created | SubscribeResult::AlreadyWatching) and treat only Created as eligible for unsubscribe in the error cleanup loop, or have setup_subscriptions() track which pubkeys were newly created (via a boolean flag returned or a separate created list) and call unsubscribe() only for those entries; adjust matching/code around unsubscribe and RemoteAccountProviderError handling accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 34-35: The 15s hardcoded PENDING_REQUEST_STALE_AFTER can evict
healthy in-flight fetches; change it so the dedup layer doesn't mark a live
fetch stale after a short fixed interval: replace the constant
PENDING_REQUEST_STALE_AFTER = Duration::from_secs(15) with a configurable
timeout (e.g., read from config/env or derived from the remote fetch timeout) or
increase it to a safe default (e.g., 2+ minutes), and update the dedup staleness
checks that reference PENDING_REQUEST_STALE_AFTER to use the new configurable
value (or to consult request progress/status before evicting) so live
fetch+clone operations aren't prematurely considered stale.
- Around line 1923-1941: The 5s timeout in the waiter loop is too short and
causes spurious ChainlinkError::Custom("timeout waiting...") failures; update
the timeout used in the tokio::time::timeout call (replace
Duration::from_secs(5)) to a much longer, configurable value (e.g.,
PENDING_REQUEST_TIMEOUT = Duration::from_secs(60) or higher) or remove the
timeout to await the receiver indefinitely; adjust the await_pending loop where
tokio::time::timeout(Duration::from_secs(5), receiver).await is invoked and
ensure the new constant or config is referenced so callers can tune it if needed
(keep handling of PendingRequestCompletion variants unchanged).
In `@magicblock-chainlink/src/chainlink/fetch_cloner/pending_request_guard.rs`:
- Around line 13-17: PendingRequestCompletion::Success currently carries no
payload, which discards the owner's FetchAndCloneResult metadata; change the
enum so Success holds the owner's FetchAndCloneResult (e.g.,
Success(FetchAndCloneResult)), then update all sites that construct or match on
PendingRequestCompletion (in pending_request_guard.rs and any completion/notify
helpers) to pass through and propagate that FetchAndCloneResult, and update
fetch_and_clone_accounts_with_dedup() waiter-handling to extract and return the
carried FetchAndCloneResult instead of an empty result so waiters observe
not_found_on_chain / missing_delegation_record metadata.
In `@magicblock-chainlink/src/chainlink/fetch_cloner/tests.rs`:
- Around line 3947-3963: The sleep is racy because it doesn't ensure task_waiter
registered as a waiter in fetch_cloner before aborting task_owner; replace the
fixed sleep with a short polling loop (with an overall timeout) that queries
fetch_cloner.pending_requests for the entry keyed by account_pubkey and verifies
the waiter count/state indicates the waiter is registered, then only call
task_owner.abort(); use symbols fetch_cloner, task_waiter,
fetch_and_clone_accounts_with_dedup, pending_requests and task_owner.abort in
the check and fail the test if the timeout elapses without observing the waiter.
In `@magicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rs`:
- Around line 411-418: The loop in wait_for_subscribe_attempts registers the
notified() future after checking subscribe_attempts, which can race; change the
pattern in wait_for_subscribe_attempts to create the waiter first (let notified
= self.subscribe_notify.notified()), then check
self.subscribe_attempts.load(AtomicOrdering::SeqCst) and only await the
previously created notified if the condition still isn't met; apply the same
change to the other similar loop that uses subscribe_attempts and
subscribe_notify so the waiter is always registered before re-checking the
condition.
In `@magicblock-chainlink/src/remote_account_provider/mod.rs`:
- Line 86: The stale cutoff is too short (const FETCHING_ACCOUNT_STALE_AFTER)
and can evict a still-active fetch() owner because fetch() can run ~24s (10 * 2s
timeouts + 9 * 400ms backoffs); increase FETCHING_ACCOUNT_STALE_AFTER to exceed
that retry budget (e.g., set to Duration::from_secs(30) or at least 25) so
in-flight owners in the fetching_accounts path are not prematurely considered
stale and evicted.
- Around line 237-245: The Drop implementation for FetchingAccountGuard
currently calls self.fetching_accounts.lock().unwrap(), which can panic if the
mutex is poisoned; change it to handle the Result from lock() safely (match or
map_or_else) so the drop path never panics: attempt to acquire the mutex with
self.fetching_accounts.lock() and on Ok(guard) remove the entry with
guard.remove(&self.pubkey), and on Err(poisoned) obtain the inner guard via
poisoned.into_inner() or otherwise handle the poisoned case (log the poison and
still attempt to remove the waiter) so the waiter notification path still runs;
reference FetchingAccountGuard::drop, fetching_accounts, and pubkey when
applying the fix.
In `@magicblock-chainlink/src/remote_account_provider/tests.rs`:
- Around line 40-60: The test drops the forward receiver `_forward_rx` before
returning from setup, which can cause sends on `forward_tx` to fail; modify the
setup/init function (where `forward_tx` and `_forward_rx` are created—e.g., in
`setup_provider()` or `init_remote_account_provider()`) to store the receiver on
`ProviderTestCtx` (add a field like `forward_rx` or `_fwd_rx`) and return that
context so the receiver lives for the full test lifetime; update
`ProviderTestCtx` and call sites to accept/ignore the stored receiver but keep
it alive.
- Around line 196-220: The test races because it sleeps a fixed 50ms before
aborting first_task_handle; instead, poll provider.fetching_accounts until the
entry for the pubkey has a non-empty waiters list to ensure second_task_handle
has registered as a waiter before aborting the owner. Locate the test code
around try_get_multi/second_task_handle and replace the fixed sleep with a
short-loop that reads provider.fetching_accounts (or its relevant map), breaks
when waiters.len() > 0 for that pubkey (with a timeout to avoid hangs), then
abort first_task_handle and proceed to _pubsub_client.release_subscribe().
---
Outside diff comments:
In `@magicblock-chainlink/src/remote_account_provider/mod.rs`:
- Around line 971-993: setup_subscriptions() currently treats every Ok(()) from
subscribe() as a newly created watch and then unsubscribes all succeeded entries
on partial failure, which tears down pre-existing subscriptions; change the
behavior so only subscriptions actually created by this call are rolled back:
either modify subscribe() to return a discriminated result (e.g., an enum like
SubscribeResult::Created | SubscribeResult::AlreadyWatching) and treat only
Created as eligible for unsubscribe in the error cleanup loop, or have
setup_subscriptions() track which pubkeys were newly created (via a boolean flag
returned or a separate created list) and call unsubscribe() only for those
entries; adjust matching/code around unsubscribe and RemoteAccountProviderError
handling accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: dc3c1b11-b8f0-472f-919e-9f1a8bc5e8a7
📒 Files selected for processing (7)
magicblock-chainlink/src/chainlink/errors.rsmagicblock-chainlink/src/chainlink/fetch_cloner/mod.rsmagicblock-chainlink/src/chainlink/fetch_cloner/pending_request_guard.rsmagicblock-chainlink/src/chainlink/fetch_cloner/tests.rsmagicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rsmagicblock-chainlink/src/remote_account_provider/mod.rsmagicblock-chainlink/src/remote_account_provider/tests.rs
Amp-Thread-ID: https://ampcode.com/threads/T-019de167-5f93-7236-80c7-dce5dcf9c9df Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de16a-1b5d-7651-82ca-46258254a983 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de16c-3929-737d-915b-e9411e64d39e Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de16f-9f67-752d-bca6-d9a903e2837d Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de171-b0ae-75da-b665-b039a04aa511 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de173-d5f3-70cd-94ed-a99bf05e9dfa Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de175-944e-76c9-a17c-5d1716543766 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de177-33bd-70ea-87c4-882d939a0c6c Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de179-29ef-70d9-a5a3-3056fc404db0 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de17a-9f6f-7516-8bda-94558b59504f Co-authored-by: Amp <amp@ampcode.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
magicblock-chainlink/src/remote_account_provider/mod.rs (1)
803-890:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly claimed pubkeys should subscribe and fetch.
Lines 874-890 still make every caller run
setup_subscriptions()andfetch()for the fullpubkeyslist, even when this call only joined an existingfetching_accountsentry as a waiter. That means a waiter can win the race and resolve all receivers with its ownmark_empty_if_not_found/fetch_start_slot/program_ids, or roll back a subscription it created even though another in-flight owner depends on it.Restrict owner side effects to the pubkeys inserted or stale-replaced by this call; waiter-only pubkeys should just await their oneshot result.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@magicblock-chainlink/src/remote_account_provider/mod.rs` around lines 803 - 890, The bug is that setup_subscriptions() and fetch() are being called for all pubkeys even when this call only joined as a waiter; change the loop around fetching.entry to track which pubkeys this call actually inserted or replaced (e.g., collect claimed_pubkeys and only push their (pubkey, receiver) into subscription_overrides and create FetchingAccountGuard for them), while for waiter-only pubkeys only register the receiver to be awaited (do not add to subscription_overrides or owner_guards); then call self.setup_subscriptions(&subscription_overrides).await? and self.fetch(claimed_pubkeys, ...) using only the claimed_pubkeys so side effects (setup_subscriptions, fetch, dismiss via owner_guards) are restricted to entries this call owns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@magicblock-chainlink/src/chainlink/fetch_cloner/mod.rs`:
- Around line 294-344: The stale-entry replacement is unsafe because ownership
is only keyed by Pubkey; add a per-entry generation token to PendingRequestState
(e.g., a monotonic u64 or reuse created_at as a unique token) and store that
token in the PendingRequestGuard when you create Owner in claim_pending_request
(both the vacant and stale-replace paths). Update finish_pending_request (and
the PendingRequestGuard Drop) to check the map entry for the given pubkey and
only remove/complete it when the stored token matches the current map entry's
token; when evicting/replacing, increment/set a new token so old guards no
longer match. Apply the same token-checking pattern to the other path mentioned
(the code around 357-375) so removes only occur when the entry identity matches
the guard's token.
---
Outside diff comments:
In `@magicblock-chainlink/src/remote_account_provider/mod.rs`:
- Around line 803-890: The bug is that setup_subscriptions() and fetch() are
being called for all pubkeys even when this call only joined as a waiter; change
the loop around fetching.entry to track which pubkeys this call actually
inserted or replaced (e.g., collect claimed_pubkeys and only push their (pubkey,
receiver) into subscription_overrides and create FetchingAccountGuard for them),
while for waiter-only pubkeys only register the receiver to be awaited (do not
add to subscription_overrides or owner_guards); then call
self.setup_subscriptions(&subscription_overrides).await? and
self.fetch(claimed_pubkeys, ...) using only the claimed_pubkeys so side effects
(setup_subscriptions, fetch, dismiss via owner_guards) are restricted to entries
this call owns.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 18a748ab-7401-4b28-8384-3bd34f6a80b6
📒 Files selected for processing (7)
magicblock-chainlink/src/chainlink/fetch_cloner/mod.rsmagicblock-chainlink/src/chainlink/fetch_cloner/pending_request_guard.rsmagicblock-chainlink/src/chainlink/fetch_cloner/tests.rsmagicblock-chainlink/src/chainlink/fetch_cloner/types.rsmagicblock-chainlink/src/remote_account_provider/chain_pubsub_client.rsmagicblock-chainlink/src/remote_account_provider/mod.rsmagicblock-chainlink/src/remote_account_provider/tests.rs
Amp-Thread-ID: https://ampcode.com/threads/T-019de1a4-82d5-71d9-aab2-b6550a41c062 Co-authored-by: Amp <amp@ampcode.com>
Add waiter reconciliation check to ensure accounts are in valid terminal state before accepting owner's result. If not, perform fresh fetch for that waiter pubkey only, making ownership/waiter race deterministic. Amp-Thread-ID: https://ampcode.com/threads/T-019de1e0-a654-777c-bdd6-264e5b7c7d53 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019de1e3-72f7-74a0-81eb-0a1cb30ef122 Co-authored-by: Amp <amp@ampcode.com>
Adds two integration tests in magicblock-chainlink/tests/09_waiter_reconciliation_race.rs: 1. test_waiter_reconciliation_detects_valid_delegated_state() - Verifies that when an account is pre-populated in the bank as delegated at the current slot, multiple concurrent fetch requests all succeed with the valid terminal state accepted. 2. test_multiple_concurrent_requests_with_valid_delegated_state() - Verifies that three concurrent requests on the same pre-populated delegated account all succeed without unnecessary fresh fetches. These tests exercise the full system-level behavior of waiter_reconciliation_check() from Step 1, demonstrating how concurrent tasks interact with the deduplication logic and the reconciliation check ensures valid terminal states are accepted. Tests use the TestContext integration test harness with mocked RPC client, pubsub client, and accounts bank, following the pattern established by existing integration tests in the chainlink/tests directory. Amp-Thread-ID: https://ampcode.com/threads/T-019de1ea-b5a6-742a-b14f-135943ef0b13 Co-authored-by: Amp <amp@ampcode.com>
GabrielePicco
left a comment
There was a problem hiding this comment.
I do not think this should land in its current shape. The PR improves some abort paths, but the production stuck-pubkey failure mode is still mostly handled by passive stale timers. The main issue is that ownership is still tied to caller futures and unbounded awaits, rather than to an owned operation with a hard deadline and guaranteed completion.
The current 30s/60s/120s values are too high, but lowering them alone is not the right fix. A pending pubkey operation should be an owned task with a total budget, cancellation token, generation, and shared completion channel. The owner task should always remove the pending entry and notify waiters with Ok/Err/TimedOut. Stale eviction should be a diagnostic fallback, not a correctness mechanism.
I also think the dedup key is too weak: requests are deduped by pubkey even though mark_empty_if_not_found, min_context_slot/slot, (e.g. incoming request with higher min_context_slot) and fetch semantics affect the result (this may require a huge refactoring in the cloning, so I think we can accept pubkey as keys for now).
Instead of caller-owned pending entries plus stale timers, I believe a better approach is to:
- Insert
Pending { generation, deadline, waiters, cancel }. - Spawn exactly one owner task for that operation.
- Run the full lifecycle under
timeout(total_budget, ...). - The owner task always removes
{pubkey, generation}and sends a terminal result to every waiter. - Waiters only await shared completion; their cancellation removes their waiter, not the owner.
- Manage subscriptions with per-pubkey state/refcount, not last-writer rollback tokens.
This design makes completion active and bounded by construction.
* master: chore(ci): cut integration test wall-clock (#1178)
Summary
Make repeated per-pubkey coordination cancellation-safe by adding guard-based cleanup and stale-entry recovery to both the outer
pending_requestsqueue (FetchCloner) and innerfetching_accountsqueue (RemoteAccountProvider). This prevents pubkeys from getting stuck when owner tasks are cancelled during subscription setup or due to edge cases.This should fix the issue we saw where specific pubkeys time out to get fetched via
getAccountInfo().Details
This implementation adds the "smallest effective change first" cancellation-safety rollout, addressing the core race condition where dropped owner tasks leave stale entries in coordination maps that block subsequent operations on the same pubkey.
FetchCloner Cancellation Safety (fetch_cloner)
New Module:
pending_request_guard.rsPendingRequestGuardfor the outer dedup queuePendingRequestCompletion::FailedChanges to
fetch_cloner/mod.rsclaim_pending_request()now:PendingRequestClaim::OwnerorPendingRequestClaim::Waiterfinish_pending_request()cleans up entries on successful completionfetch_and_clone_accounts_with_dedup()rewritten to:RemoteAccountProvider Cancellation Safety (remote_account_provider)
Changes to
remote_account_provider/mod.rsFetchingAccountGuardnow with explicit Drop cleanuptry_get_multi()insertion loop now:fetching_accountsentries (>15s old)AccountResolutionsFailedObservability
Both modules now log lifecycle events:
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores