feat: Enhanced Trace System: Flexible Hierarchical Tracing and New Frontend UI (Trace追踪功能完善,灵活追踪和前端新UI)#6168
feat: Enhanced Trace System: Flexible Hierarchical Tracing and New Frontend UI (Trace追踪功能完善,灵活追踪和前端新UI)#6168crt106 wants to merge 1 commit intoAstrBotDevs:masterfrom
Conversation
There was a problem hiding this comment.
Sorry @crt106, your pull request is larger than the review limit of 150000 diff characters
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: AstrBotTeam's Space pr4697的改动View Suggested Changes@@ -1364,6 +1364,147 @@
- JWT 处理和错误处理机制增强,提升系统安全性和稳定性
- UI 细节优化,提升用户体验
- 日志与异常处理增强,便于问题追踪
+
+---
+
+### 15. Trace 系统(层次化可观测性)(PR #6168)
+
+#### 功能概述
+[PR #6168](https://github.com/AstrBotDevs/AstrBot/pull/6168) 引入了完整的层次化追踪系统,将原有扁平记录模型改写为 Span 树结构,覆盖消息处理全生命周期。该系统追踪以下所有环节:
+
+- Pipeline 阶段执行时序
+- LLM 调用与 token 用量
+- 工具/函数调用
+- 插件处理器调用
+- SubAgent 编排
+
+所有 trace 数据会:
+- 通过 SSE(Server-Sent Events)实时广播到 WebUI
+- 写入 trace 日志文件
+- 异步持久化到 SQLite 数据库
+- 在新的 Trace 页面通过 SpanTree 和 SpanDetail UI 组件展示
+
+#### 配置方式
+在配置文件中启用或禁用追踪:
+
+```yaml
+trace_enable: true # 默认为 false
+```
+
+禁用时,所有追踪操作的开销为零。
+
+#### 面向插件开发者的 Trace API
+
+trace 系统通过 `astrbot.api.trace` 模块向第三方插件开发者开放公开 API。
+
+##### 装饰器用法
+```python
+from astrbot.api.trace import span_record
+from astrbot.core.star import Star, command
+
+class MyPlugin(Star):
+ @command("weather")
+ @span_record("plugin.weather", span_type="plugin_call", record_input=True)
+ async def get_weather(self, event: AstrMessageEvent, city: str):
+ result = await self._fetch_weather(city)
+ yield event.plain_result(result)
+```
+
+##### 上下文管理器用法
+```python
+from astrbot.api.trace import span_context
+
+async def fetch_data(self, url: str):
+ async with span_context("http_fetch", span_type="io_call") as s:
+ s.set_input(url=url)
+ response = await httpx.get(url)
+ s.set_output(status=response.status_code, size=len(response.content))
+ return response.json()
+```
+
+##### 手动 Span 操作
+```python
+from astrbot.api.trace import get_current_span
+
+def process_data(self, data):
+ span = get_current_span()
+ if span:
+ span.set_meta(data_size=len(data), format="json")
+ # ... 处理逻辑
+```
+
+#### Trace Span 树结构
+
+每个 trace 以一个根 span 开始,代表一个完整的 `AstrMessageEvent` 处理周期。子 span 嵌套在父节点下:
+
+- **Root Span** (trace_id == span_id)
+ - **Pipeline Stages** (type: `pipeline_stage`)
+ - **LLM Calls** (type: `llm_call`)
+ - 元数据包含 `input_tokens`、`output_tokens`、`model`
+ - **Tool Calls** (type: `tool_call`)
+ - **Plugin Handlers** (type: `plugin_call`)
+ - 元数据包含 `plugin` 名称和 `plugin_type`
+ - **SubAgent Execution** (type: `subagent_call`)
+ - **IO Operations** (type: `io_call`)
+
+#### 插件归属
+插件处理器内创建的所有 span 会自动从父节点继承 `plugin` 和 `plugin_type` 元数据,便于在 dashboard 中按插件过滤和查询 trace。
+
+#### 数据库模式
+trace 记录存储在 `traces` 表中,包含以下字段:
+- `trace_id`:trace 唯一标识符
+- `umo`:统一消息来源标识符
+- `sender_name`:消息发送者名称
+- `message_outline`:消息摘要
+- `started_at`、`finished_at`、`duration_ms`:时序信息
+- `status`:"ok"、"error"、"running"
+- `spans`:完整的层次化 span 树(JSON)
+- `total_input_tokens`、`total_output_tokens`:所有 LLM 调用的聚合 token 统计
+
+#### 插件开发者最佳实践
+
+1. **使用有意义的 span 名称**:前缀加上插件名,例如 `"plugin.weather.fetch"`
+2. **设置合适的 span 类型**:使用 `plugin_call`、`io_call`、`tool_call` 等
+3. **记录输入和输出**:使用 `set_input()` 和 `set_output()` 捕获相关数据
+4. **保持数据简洁**:输入/输出字符串会自动截断,避免数据膨胀
+5. **使用元数据**:通过 `set_meta()` 存储额外上下文,如 API 密钥(脱敏后)、重试次数等
+
+#### 禁用 trace 时的行为
+当 `trace_enable = false` 时:
+- 使用 `@span_record` 装饰的函数直接调用,无额外对象创建或 ContextVar 操作,开销约 200 纳秒(一次配置字典查找)
+- `span_context` 返回 `_NullSpan` 空桩对象,其方法均为空操作
+- `get_current_span()` 返回 `None`
+
+插件代码无需检查 trace 是否启用即可安全调用 trace API。
+
+#### 相关文件
+- `astrbot/api/trace.py`:插件开发者公开 API
+- `astrbot/core/utils/trace.py`:核心追踪实现
+- `astrbot/core/db/po.py`:TraceEntry 数据库模型
+- `astrbot/core/pipeline/scheduler.py`:Pipeline 级 trace 集成
+- Dashboard UI:`SpanTree.vue`、`SpanDetail.vue`
+
+#### 技术实现要点
+
+##### ContextVar 隐式传播
+trace 系统使用 Python 标准库的 `contextvars.ContextVar` 在异步调用链中隐式传播当前活跃的 span,无需在每个函数调用中显式传递 span 对象。
+
+##### 故障隔离
+trace 基础设施的任何错误(SSE 广播失败、JSON 序列化错误、数据库写入失败等)不会影响原始函数的正常执行:
+
+- `_on_root_finish()` 方法(负责广播和持久化)完全包裹在 `try/except` 中,异常仅记录在 debug 级别
+- `span_context` 和 `trace_span` 中的 `finish()` 调用也是异常安全的,原始业务异常始终能正确传播
+- `span_record` 的同步和异步包装器都会静默抑制所有 trace 相关异常
+
+##### 性能开销
+
+| 场景 | 额外开销 |
+|------|----------|
+| 禁用 trace(默认) | ~200 ns(一次配置字典查找) |
+| 启用 trace,创建子 span | ~3–5 μs(主要为 `uuid4` 生成) |
+| 请求结束时的序列化和广播 | ~50–200 μs(一次性,非阻塞) |
+
+对于典型的 LLM 请求(500 ms–5 s),总 trace 开销 < 0.05%。
---
Note: You must be authenticated to accept/decline updates. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the system's observability by transforming the flat trace record model into a comprehensive hierarchical span tree. This enhancement provides granular visibility into the entire message processing lifecycle, from pipeline stages and LLM interactions to tool and plugin executions. The changes also introduce a robust public API for third-party plugin integration and a new, visually rich frontend UI for intuitive trace analysis, enabling developers to better understand and debug complex workflows. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed enhancement to the observability of the system. It replaces the flat tracing model with a full hierarchical span tree, which is a great improvement for debugging and performance analysis. The introduction of a developer-facing tracing API (span_context, span_record) is a fantastic addition for plugin authors. The new frontend UI for traces also looks very polished and useful.
The implementation is thorough, covering various parts of the pipeline from the scheduler to agent execution and sub-agent orchestration. The use of ContextVar for implicit span propagation is a clean and modern approach.
I have a few suggestions, mostly related to improving robustness and maintainability by avoiding silent exception swallowing and reducing code duplication. Overall, this is an excellent contribution.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Silently swallowing exceptions with except Exception: pass can hide bugs and make debugging very difficult. This pattern appears multiple times in this function. It's crucial to at least log these exceptions, even if it's at a debug level, to maintain visibility into potential issues within the tracing logic.
| except Exception: | |
| pass | |
| except Exception as e: | |
| logger.debug(f"Error processing tool_call span: {e}") |
| try: | ||
| loop = asyncio.get_event_loop() | ||
| if loop.is_running(): | ||
| asyncio.create_task(self._persist_to_db(trace_dict)) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
This block has two issues:
- It uses
asyncio.get_event_loop(), which is deprecated since Python 3.10. It's recommended to useasyncio.get_running_loop()instead, which is more explicit and raises aRuntimeErrorif no loop is running. - The broad
except Exception: passwill silently swallow any errors, making it very hard to debug issues with trace persistence.
Please consider updating this block to use the modern API and to log any exceptions for better debuggability.
| try: | |
| loop = asyncio.get_event_loop() | |
| if loop.is_running(): | |
| asyncio.create_task(self._persist_to_db(trace_dict)) | |
| except Exception: | |
| pass | |
| try: | |
| loop = asyncio.get_running_loop() | |
| loop.create_task(self._persist_to_db(trace_dict)) | |
| except RuntimeError: | |
| logger.warning("[trace] Cannot persist trace to DB: not in a running event loop.") | |
| except Exception as e: | |
| logger.debug(f"[trace] Failed to schedule DB persistence: {e}") |
| try: | ||
| s.set_output(result=str(result)[:2000]) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Throughout this file, there are several try...except Exception: pass blocks. While it's important that tracing logic doesn't crash the main application, silently swallowing all exceptions makes it impossible to debug problems with the tracing system itself. Please consider logging these exceptions at a debug level. This will provide visibility during development without cluttering production logs.
| try: | |
| s.set_output(result=str(result)[:2000]) | |
| except Exception: | |
| pass | |
| try: | |
| s.set_output(result=str(result)[:2000]) | |
| except Exception as e: | |
| logger.debug(f"[trace] Failed to record output: {e}") |
| ) | ||
| # Per-step span and per-tool-call spans | ||
| _step_span = None | ||
| _tool_spans: dict[str, object] = {} # call_id -> TraceSpan |
There was a problem hiding this comment.
For better type safety and code clarity, consider using a more specific type hint for _tool_spans. Instead of dict[str, object], you could use dict[str, "TraceSpan"]. This uses a forward reference and improves readability, making it clear what kind of object is expected in the dictionary.
| _tool_spans: dict[str, object] = {} # call_id -> TraceSpan | |
| _tool_spans: dict[str, "TraceSpan"] = {} # call_id -> TraceSpan |
| except Exception: | ||
| return None |
There was a problem hiding this comment.
This broad except Exception clause silently swallows any errors that might occur while resolving tool metadata. This can make debugging difficult. It would be better to log the exception at a debug or warning level before returning None.
| except Exception: | |
| return None | |
| except Exception as e: | |
| logger.debug(f"Failed to resolve tool plugin meta for {tool_name}: {e}") | |
| return None |
| if _trace_on and _trace_parent is not None: | ||
| _step_count = 0 | ||
| _step_span = None | ||
| _tool_spans: dict[str, object] = {} | ||
|
|
||
| def _get_chain_tool_info(chain): | ||
| try: | ||
| first = chain.chain[0] if chain and chain.chain else None | ||
| data = getattr(first, "data", None) | ||
| return data if isinstance(data, dict) else None | ||
| except Exception: | ||
| return None | ||
|
|
||
| async def _run_one_step(): | ||
| nonlocal _step_span, _tool_spans | ||
| async for resp in agent_runner.step(): | ||
| if resp.type == "tool_call": | ||
| try: | ||
| ti = _get_chain_tool_info(resp.data.get("chain")) | ||
| if ti and _step_span is not None: | ||
| ts = _step_span.child( | ||
| ti.get("name", "tool"), span_type="tool_call" | ||
| ) | ||
| args = ti.get("arguments", {}) | ||
| ts.set_input( | ||
| **( | ||
| args | ||
| if isinstance(args, dict) | ||
| else {"args": args} | ||
| ) | ||
| ) | ||
| tid = str(ti.get("id", "")) | ||
| if tid: | ||
| _tool_spans[tid] = ts | ||
| except Exception: | ||
| pass | ||
| elif resp.type == "tool_call_result": | ||
| try: | ||
| chain = resp.data.get("chain") | ||
| rd = _get_chain_tool_info(chain) | ||
| if rd: | ||
| tid = str(rd.get("id", "")) | ||
| ts = _tool_spans.pop(tid, None) | ||
| if ts is not None and ts.finished_at is None: | ||
| result = chain.get_plain_text( | ||
| with_other_comps_mark=True | ||
| ) | ||
| ts.set_output(result=result[:4000]) | ||
| ts.finish() | ||
| except Exception: | ||
| pass | ||
| elif resp.type == "llm_result": | ||
| try: | ||
| resp_chain = resp.data.get("chain") | ||
| if _step_span is not None: | ||
| _step_span.set_output( | ||
| completion=( | ||
| resp_chain.get_plain_text()[:2000] | ||
| if resp_chain | ||
| else "" | ||
| ) | ||
| ) | ||
| if ( | ||
| agent_runner.stats | ||
| and agent_runner.stats.token_usage | ||
| ): | ||
| _step_span.set_meta( | ||
| input_tokens=agent_runner.stats.token_usage.input, | ||
| output_tokens=agent_runner.stats.token_usage.output, | ||
| ) | ||
| except Exception: | ||
| pass | ||
| finally: | ||
| if ( | ||
| _step_span is not None | ||
| and _step_span.finished_at is None | ||
| ): | ||
| _step_span.finish() | ||
| _step_span = None | ||
|
|
||
| while not agent_runner.done() and _step_count < max_steps: | ||
| _step_count += 1 | ||
| _step_span = _trace_parent.child( | ||
| f"llm_step_{_step_count}", | ||
| span_type="llm_call", | ||
| model=agent_runner.provider.get_model() | ||
| if agent_runner.provider | ||
| else "", | ||
| ) | ||
| _tool_spans = {} | ||
| await _run_one_step() | ||
| if _step_span is not None and _step_span.finished_at is None: | ||
| _step_span.finish() | ||
| _step_span = None | ||
|
|
||
| if not agent_runner.done(): | ||
| # Max steps reached — strip tools and force a final response | ||
| if agent_runner.req: | ||
| agent_runner.req.func_tool = None | ||
| agent_runner.run_context.messages.append( | ||
| Message( | ||
| role="user", | ||
| content="工具调用次数已达到上限,请停止使用工具,并根据已经收集到的信息,对你的任务和发现进行总结,然后直接回复用户。", | ||
| ) | ||
| ) | ||
| _step_span = _trace_parent.child( | ||
| f"llm_step_{_step_count + 1}", | ||
| span_type="llm_call", | ||
| model=agent_runner.provider.get_model() | ||
| if agent_runner.provider | ||
| else "", | ||
| ) | ||
| _tool_spans = {} | ||
| await _run_one_step() | ||
| if _step_span is not None and _step_span.finished_at is None: | ||
| _step_span.finish() | ||
| else: | ||
| async for _ in agent_runner.step_until_done(max_steps): | ||
| pass | ||
| # ───────────────────────────────────────────────────────────────────── |
There was a problem hiding this comment.
This block of code for a traced step loop is very similar to the logic in run_agent in astrbot/core/astr_agent_run_util.py. This duplication could make future maintenance harder as any change would need to be applied in two places. Consider refactoring this complex logic into a shared utility function or class that both tool_loop_agent and run_agent can use. This would improve maintainability and reduce code duplication.
📝 Description | 项目描述
🛠 Modifications | 改动点
1. Core Trace Infrastructure | 核心追踪基础设施
astrbot/core/utils/trace.pyTraceSpanextended withparent,span_type,input/output/meta, andchildren— forming a proper tree structure.树结构构建:
TraceSpan扩展了parent、span_type、input/output/meta和children属性,形成了完整的树形结构。ContextVar[_current_span]for implicit span propagation through async call chains without explicit passing.隐式传播:利用
ContextVar[_current_span]在异步调用链中隐式传播当前 Span,无需手动显式传参。span_context()async context manager,span_record()decorator, andget_current_span()as a general-purpose plugin tracing API.开发者 API:引入
span_context()异步上下文管理器、span_record()装饰器以及get_current_span(),作为通用的插件追踪 API。_NullSpanwhen disabled to avoid null-checks.容错与稳定性:所有追踪内部错误均被静默捕获,确保不影响原始函数执行;禁用时返回
_NullSpan空桩,插件代码无需判空。2. Pipeline Span Hierarchy | Pipeline Span 层次
scheduler.py,internal.py,star_request.pyContextVarbefore executing, so LLMAgent, plugin_handler, and tool_call spans automatically nest under the correct stage.自动嵌套:每个阶段 Span 执行前推入
ContextVar,使 LLMAgent、插件处理器及工具调用等 Span 自动挂载在正确的阶段节点下。system_prompt_chars,context_length,tools,model,reasoning, andtool_callsin the output.丰富元数据:LLMAgent 完整记录系统提示词长度、上下文长度、工具集、模型、推理过程及工具调用细节。
llm_callspans and per-tooltool_callspans inside the main agent loop.细粒度循环:在主 Agent 循环中为每一步 LLM 调用和每一次工具调用生成独立 Span。
3. Plugin Attribution | 插件归属标注
star_request.py,astr_agent_run_util.pymeta.plugin(name) andmeta.plugin_type("builtin" / "third_party").身份打标:每个 Span 携带
meta.plugin(插件名)和meta.plugin_type("builtin" 或 "third_party")。继承机制:子 Span 自动从祖先继承插件归属,即使是深层嵌套操作仍可溯源至原始插件。
@llm_tooltool_call spans resolve attribution viahandler_module_path→star_map.工具反查:使用
@llm_tool的工具调用通过handler_module_path→star_map机制精准定位所属插件。4. SubAgent Orchestration | SubAgent 编排支持
astr_agent_tool_exec.py,context.py_execute_handoff()creates a dedicatedLLMAgent [subagent_name]span as a child of the current context before running the subagent.移交追踪:
_execute_handoff()在调用子 Agent 前创建专属的LLMAgent [subagent_name]Span 并设为当前上下文的子节点。tool_loop_agent()replacesstep_until_done()with a traced step loop; falls back to the original loop when tracing is off for zero performance impact.逐步循环:启用追踪时使用逐步迭代模式产生详细 Span;禁用时回退至原始路径,实现零性能损耗。
5. Public API & Frontend | 开放接口与前端展示
astrbot/api/,dashboard/src/span_context,span_record,get_current_spanfor third-party plugin use.接口开放:向第三方插件开发者开放核心追踪 API。
SpanTree.vuefor recursive rendering andSpanDetail.vuefor I/O/Metadata inspection.视觉增强:新增
SpanTree.vue支持递归树渲染,SpanDetail.vue用于查看输入/输出/元数据详情。状态标识:基于哈希的插件唯一色;内置插件使用齿轮图标,第三方插件使用拼图图标。
📚 Documentation | 文档
完整中英文指南:涵盖启用追踪、三种使用模式(装饰器、上下文管理器、手动获取)及故障隔离机制。
Important
This is NOT a breaking change. / 这不是一个破坏性变更。
🖼 Screenshots or Test Results | 运行截图或测试结果
运行检查:
✅ Checklist | 检查清单
讨论:PR 中的新功能已预先与作者沟通。
测试:更改已通过测试,并提供了验证步骤与截图。
pyproject.toml).依赖:确保未引入未记录的新依赖(但是前端有依赖更新)。
安全:代码无恶意逻辑。
Would you like me to help you refine the technical wording of any specific section or generate a sample screenshot layout description?