Skip to content

report hung radostrace ops on interrupt#92

Open
zhhuabj wants to merge 1 commit into
taodd:mainfrom
zhhuabj:detect_hang_ops_with_radostrace
Open

report hung radostrace ops on interrupt#92
zhhuabj wants to merge 1 commit into
taodd:mainfrom
zhhuabj:detect_hang_ops_with_radostrace

Conversation

@zhhuabj
Copy link
Copy Markdown
Contributor

@zhhuabj zhhuabj commented May 13, 2026

Handle SIGINT without exiting immediately so radostrace can dump operations still pending in the BPF ops map before cleanup.

  • add SIGINT state tracking in radostrace
  • iterate the ops BPF map and print unfinished operations
  • include stuck duration, client, tid, target osd, object, and rw mode
  • return cleanly after interrupt/timeout-triggered shutdown

@zhhuabj
Copy link
Copy Markdown
Contributor Author

zhhuabj commented May 13, 2026

test result - https://paste.ubuntu.com/p/FvpMC8XsCX/

Comment thread src/radostrace.cc Outdated
clog << "process killed" << endl;
got_sigint = 1;
} else {
exit(signum);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit(3) isn't a async-signal-safe function: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html

better to use _exit() instead.

The same concern is applicable for clog too. We need to review "signal safety" across the codease. But that can be done separately later.

Comment thread src/radostrace.cc Outdated
int found = 0;

memset(&key, 0, sizeof(key));
while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get the first key that's not in the map, bpf_map_get_next_key needs to be called with NULL as key. But a valid "key" is passed now (key is zero-initalized but it's not NULL), so this is likely to miss the correct first key.

struct client_op_k *cur = NULL;
while (bpf_map_get_next_key(map_fd, cur, &next_key) == 0) {
    if (bpf_map_lookup_elem(map_fd, &next_key, &val) == 0) {
        ...
    }
    key = next_key;
    cur = &key;
}

would fix it.

@pponnuvel
Copy link
Copy Markdown
Collaborator

I also think "hung ops" better describes than "hang ops", so suggest name changes as such.

@zhhuabj zhhuabj force-pushed the detect_hang_ops_with_radostrace branch from 3baee40 to aa20743 Compare May 14, 2026 13:34
Report unfinished radostrace ops from the BPF ops map before exit when tracing stops
via Ctrl+C or when the session timeout expires.
- keep SIGINT on a graceful shutdown path
- dump pending ops with duration, client, tid, osd, object, and rw mode
- report pending ops before exiting on -t timeout
- prevent false positives when ring buffer reservation cannot reserve space

Signed-off-by: Zhang Hua <joshua.zhang@canonical.com>
@zhhuabj zhhuabj force-pushed the detect_hang_ops_with_radostrace branch from aa20743 to b05de17 Compare May 14, 2026 14:16
Copy link
Copy Markdown
Owner

@taodd taodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add a test to deliberately cause a hang Op

Comment thread src/radostrace.bpf.c
}
if (name_len > 0) {
bpf_probe_read_user(val->object_name, name_len, (void *)name_base);
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the original code here

Comment thread src/radostrace.cc
return 0;
}

static int report_hung_ops(struct radostrace_bpf *skel,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change the format about the hung ops output.

  1. Keep the same format as the complete ops output
  2. Add a column to indicate whether an op is complete or incomplete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants