Skip to content

feat: Enhanced Trace System: Flexible Hierarchical Tracing and New Frontend UI (Trace追踪功能完善,灵活追踪和前端新UI)#6168

Open
crt106 wants to merge 1 commit intoAstrBotDevs:masterfrom
crt106:feat/new_trace
Open

feat: Enhanced Trace System: Flexible Hierarchical Tracing and New Frontend UI (Trace追踪功能完善,灵活追踪和前端新UI)#6168
crt106 wants to merge 1 commit intoAstrBotDevs:masterfrom
crt106:feat/new_trace

Conversation

@crt106
Copy link

@crt106 crt106 commented Mar 13, 2026


📝 Description | 项目描述

This PR rewrites the Trace/observability system from a flat record model into a full hierarchical span tree covering the entire message processing lifecycle — pipeline stages, LLM calls, tool calls, plugin handlers, and SubAgent orchestration — and exposes the system as a general-purpose API for third-party plugins to use.
此 PR 将原有的 Trace/追踪记录系统从扁平记录模型改写为完整的层次化 Span 树,覆盖消息处理全生命周期——Pipeline 阶段、LLM 调用、工具调用、插件处理器与 SubAgent 编排——并将其作为通用 API 开放给第三方插件使用。


🛠 Modifications | 改动点

1. Core Trace Infrastructure | 核心追踪基础设施

astrbot/core/utils/trace.py

  • Tree Structure Construction: TraceSpan extended with parent, span_type, input/output/meta, and children — forming a proper tree structure.
    树结构构建TraceSpan 扩展了 parentspan_typeinput/output/metachildren 属性,形成了完整的树形结构。
  • Implicit Propagation: Utilizes ContextVar[_current_span] for implicit span propagation through async call chains without explicit passing.
    隐式传播:利用 ContextVar[_current_span] 在异步调用链中隐式传播当前 Span,无需手动显式传参。
  • Developer API: Introduced span_context() async context manager, span_record() decorator, and get_current_span() as a general-purpose plugin tracing API.
    开发者 API:引入 span_context() 异步上下文管理器、span_record() 装饰器以及 get_current_span(),作为通用的插件追踪 API。
  • Resilience & Stability: All trace errors are swallowed so failures never affect original function execution; returns _NullSpan when disabled to avoid null-checks.
    容错与稳定性:所有追踪内部错误均被静默捕获,确保不影响原始函数执行;禁用时返回 _NullSpan 空桩,插件代码无需判空。

2. Pipeline Span Hierarchy | Pipeline Span 层次

scheduler.py, internal.py, star_request.py

  • Automated Nesting: Each stage span is pushed to ContextVar before executing, so LLMAgent, plugin_handler, and tool_call spans automatically nest under the correct stage.
    自动嵌套:每个阶段 Span 执行前推入 ContextVar,使 LLMAgent、插件处理器及工具调用等 Span 自动挂载在正确的阶段节点下。
  • Rich Metadata: LLMAgent records system_prompt_chars, context_length, tools, model, reasoning, and tool_calls in the output.
    丰富元数据:LLMAgent 完整记录系统提示词长度、上下文长度、工具集、模型、推理过程及工具调用细节。
  • Granular Loops: Includes per-step llm_call spans and per-tool tool_call spans inside the main agent loop.
    细粒度循环:在主 Agent 循环中为每一步 LLM 调用和每一次工具调用生成独立 Span。

3. Plugin Attribution | 插件归属标注

star_request.py, astr_agent_run_util.py

  • Identity Tagging: Every span carries meta.plugin (name) and meta.plugin_type ("builtin" / "third_party").
    身份打标:每个 Span 携带 meta.plugin(插件名)和 meta.plugin_type("builtin" 或 "third_party")。
  • Inheritance Mechanism: Attribution is inherited from ancestor spans so deeply nested operations remain traceable to their originating plugin.
    继承机制:子 Span 自动从祖先继承插件归属,即使是深层嵌套操作仍可溯源至原始插件。
  • Tool Resolution: @llm_tool tool_call spans resolve attribution via handler_module_pathstar_map.
    工具反查:使用 @llm_tool 的工具调用通过 handler_module_pathstar_map 机制精准定位所属插件。

4. SubAgent Orchestration | SubAgent 编排支持

astr_agent_tool_exec.py, context.py

  • Handoff Tracing: _execute_handoff() creates a dedicated LLMAgent [subagent_name] span as a child of the current context before running the subagent.
    移交追踪_execute_handoff() 在调用子 Agent 前创建专属的 LLMAgent [subagent_name] Span 并设为当前上下文的子节点。
  • Step-by-Step Loop: tool_loop_agent() replaces step_until_done() with a traced step loop; falls back to the original loop when tracing is off for zero performance impact.
    逐步循环:启用追踪时使用逐步迭代模式产生详细 Span;禁用时回退至原始路径,实现零性能损耗。

5. Public API & Frontend | 开放接口与前端展示

astrbot/api/, dashboard/src/

  • Developer Access: Exposes span_context, span_record, get_current_span for third-party plugin use.
    接口开放:向第三方插件开发者开放核心追踪 API。
  • Visual Enhancements: New SpanTree.vue for recursive rendering and SpanDetail.vue for I/O/Metadata inspection.
    视觉增强:新增 SpanTree.vue 支持递归树渲染,SpanDetail.vue 用于查看输入/输出/元数据详情。
  • Status Indicators: Deterministic colors for plugins; gear icon for built-in components and puzzle-piece for third-party extensions.
    状态标识:基于哈希的插件唯一色;内置插件使用齿轮图标,第三方插件使用拼图图标。

📚 Documentation | 文档

  • Full Developer Guides: Covering enabling trace, usage patterns, and fault isolation in both English and Chinese.
    完整中英文指南:涵盖启用追踪、三种使用模式(装饰器、上下文管理器、手动获取)及故障隔离机制。

⚠️ Breaking Changes | 变更说明

Important

This is NOT a breaking change. / 这不是一个破坏性变更。


🖼 Screenshots or Test Results | 运行截图或测试结果

image image image

运行检查:

➜  AstrBot git:(feat/new_trace) ✗ make pr-test-neo

./scripts/pr_test_env.sh --profile neo
==> Profile: neo
==> Sync dependencies: true
==> Run lint: true
==> Run smoke test: true
==> Build dashboard: false
==> Syncing dependencies with uv
Resolved 153 packages in 1ms
Uninstalled 1 package in 1ms
 - mistletoe==1.4.0
==> Preparing test directories
==> Running Ruff format check
397 files already formatted
==> Running Ruff lint check
All checks passed!
==> Running pytest
.........                                                                            [100%]
===================================== warnings summary =====================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

tests/test_dashboard.py::test_neo_skills_routes
  /data00/home/chaoruitao/Project/AstrBot/.venv/lib/python3.12/site-packages/lark_oapi/ws/pb/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)

tests/test_dashboard.py::test_neo_skills_routes
  /data00/home/chaoruitao/Project/AstrBot/.venv/lib/python3.12/site-packages/lark_oapi/ws/client.py:67: DeprecationWarning: websockets.InvalidStatusCode is deprecated
    def _parse_ws_conn_exception(e: websockets.InvalidStatusCode):

tests/test_dashboard.py::test_neo_skills_routes
  /data00/home/chaoruitao/Project/AstrBot/.venv/lib/python3.12/site-packages/websockets/legacy/__init__.py:6: DeprecationWarning: websockets.legacy is deprecated; see https://websockets.readthedocs.io/en/stable/howto/upgrade.html for upgrade instructions
    warnings.warn(  # deprecated in 14.0 - 2024-11-09

tests/test_dashboard.py::test_neo_skills_routes
  /data00/home/chaoruitao/Project/AstrBot/data/plugins/astrbot_plugin_hapi_connector/main.py:41: DeprecationWarning: The 'register_star' decorator is deprecated and will be removed in a future version.
    @register("astrbot_plugin_hapi_connector", "LiJinHao999",

tests/test_dashboard.py::test_neo_skills_routes
  /data00/home/chaoruitao/Project/AstrBot/astrbot/dashboard/routes/auth.py:85: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    "exp": datetime.datetime.utcnow() + datetime.timedelta(days=7),

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
9 passed, 8 warnings in 10.82s
==> Starting smoke test on http://localhost:6185
==> Smoke test passed
==> PR checks completed successfully

✅ Checklist | 检查清单

  • Discussion: I have discussed new features with the authors through issues/emails.
    讨论:PR 中的新功能已预先与作者沟通。
  • Testing: Changes are well-tested; "Verification Steps" and "Screenshots" are provided.
    测试:更改已通过测试,并提供了验证步骤与截图。
  • Dependencies: No new dependencies introduced (or they are correctly added to pyproject.toml).
    依赖:确保未引入未记录的新依赖(但是前端有依赖更新)。
  • Security: My changes do not introduce malicious code.
    安全:代码无恶意逻辑。

Would you like me to help you refine the technical wording of any specific section or generate a sample screenshot layout description?

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 13, 2026
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @crt106, your pull request is larger than the review limit of 150000 diff characters

@dosubot dosubot bot added area:core The bug / feature is about astrbot's core, backend area:webui The bug / feature is about webui(dashboard) of astrbot. labels Mar 13, 2026
@dosubot
Copy link

dosubot bot commented Mar 13, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

AstrBotTeam's Space

pr4697的改动
View Suggested Changes
@@ -1364,6 +1364,147 @@
 - JWT 处理和错误处理机制增强,提升系统安全性和稳定性
 - UI 细节优化,提升用户体验
 - 日志与异常处理增强,便于问题追踪
+
+---
+
+### 15. Trace 系统(层次化可观测性)(PR #6168)
+
+#### 功能概述
+[PR #6168](https://github.com/AstrBotDevs/AstrBot/pull/6168) 引入了完整的层次化追踪系统,将原有扁平记录模型改写为 Span 树结构,覆盖消息处理全生命周期。该系统追踪以下所有环节:
+
+- Pipeline 阶段执行时序
+- LLM 调用与 token 用量
+- 工具/函数调用
+- 插件处理器调用
+- SubAgent 编排
+
+所有 trace 数据会:
+- 通过 SSE(Server-Sent Events)实时广播到 WebUI
+- 写入 trace 日志文件
+- 异步持久化到 SQLite 数据库
+- 在新的 Trace 页面通过 SpanTree 和 SpanDetail UI 组件展示
+
+#### 配置方式
+在配置文件中启用或禁用追踪:
+
+```yaml
+trace_enable: true  # 默认为 false
+```
+
+禁用时,所有追踪操作的开销为零。
+
+#### 面向插件开发者的 Trace API
+
+trace 系统通过 `astrbot.api.trace` 模块向第三方插件开发者开放公开 API。
+
+##### 装饰器用法
+```python
+from astrbot.api.trace import span_record
+from astrbot.core.star import Star, command
+
+class MyPlugin(Star):
+    @command("weather")
+    @span_record("plugin.weather", span_type="plugin_call", record_input=True)
+    async def get_weather(self, event: AstrMessageEvent, city: str):
+        result = await self._fetch_weather(city)
+        yield event.plain_result(result)
+```
+
+##### 上下文管理器用法
+```python
+from astrbot.api.trace import span_context
+
+async def fetch_data(self, url: str):
+    async with span_context("http_fetch", span_type="io_call") as s:
+        s.set_input(url=url)
+        response = await httpx.get(url)
+        s.set_output(status=response.status_code, size=len(response.content))
+        return response.json()
+```
+
+##### 手动 Span 操作
+```python
+from astrbot.api.trace import get_current_span
+
+def process_data(self, data):
+    span = get_current_span()
+    if span:
+        span.set_meta(data_size=len(data), format="json")
+    # ... 处理逻辑
+```
+
+#### Trace Span 树结构
+
+每个 trace 以一个根 span 开始,代表一个完整的 `AstrMessageEvent` 处理周期。子 span 嵌套在父节点下:
+
+- **Root Span** (trace_id == span_id)
+  - **Pipeline Stages** (type: `pipeline_stage`)
+    - **LLM Calls** (type: `llm_call`)
+      - 元数据包含 `input_tokens`、`output_tokens`、`model`
+    - **Tool Calls** (type: `tool_call`)
+    - **Plugin Handlers** (type: `plugin_call`)
+      - 元数据包含 `plugin` 名称和 `plugin_type`
+    - **SubAgent Execution** (type: `subagent_call`)
+    - **IO Operations** (type: `io_call`)
+
+#### 插件归属
+插件处理器内创建的所有 span 会自动从父节点继承 `plugin` 和 `plugin_type` 元数据,便于在 dashboard 中按插件过滤和查询 trace。
+
+#### 数据库模式
+trace 记录存储在 `traces` 表中,包含以下字段:
+- `trace_id`:trace 唯一标识符
+- `umo`:统一消息来源标识符
+- `sender_name`:消息发送者名称
+- `message_outline`:消息摘要
+- `started_at`、`finished_at`、`duration_ms`:时序信息
+- `status`:"ok"、"error"、"running"
+- `spans`:完整的层次化 span 树(JSON)
+- `total_input_tokens`、`total_output_tokens`:所有 LLM 调用的聚合 token 统计
+
+#### 插件开发者最佳实践
+
+1. **使用有意义的 span 名称**:前缀加上插件名,例如 `"plugin.weather.fetch"`
+2. **设置合适的 span 类型**:使用 `plugin_call`、`io_call`、`tool_call` 等
+3. **记录输入和输出**:使用 `set_input()` 和 `set_output()` 捕获相关数据
+4. **保持数据简洁**:输入/输出字符串会自动截断,避免数据膨胀
+5. **使用元数据**:通过 `set_meta()` 存储额外上下文,如 API 密钥(脱敏后)、重试次数等
+
+#### 禁用 trace 时的行为
+当 `trace_enable = false` 时:
+- 使用 `@span_record` 装饰的函数直接调用,无额外对象创建或 ContextVar 操作,开销约 200 纳秒(一次配置字典查找)
+- `span_context` 返回 `_NullSpan` 空桩对象,其方法均为空操作
+- `get_current_span()` 返回 `None`
+
+插件代码无需检查 trace 是否启用即可安全调用 trace API。
+
+#### 相关文件
+- `astrbot/api/trace.py`:插件开发者公开 API
+- `astrbot/core/utils/trace.py`:核心追踪实现
+- `astrbot/core/db/po.py`:TraceEntry 数据库模型
+- `astrbot/core/pipeline/scheduler.py`:Pipeline 级 trace 集成
+- Dashboard UI:`SpanTree.vue`、`SpanDetail.vue`
+
+#### 技术实现要点
+
+##### ContextVar 隐式传播
+trace 系统使用 Python 标准库的 `contextvars.ContextVar` 在异步调用链中隐式传播当前活跃的 span,无需在每个函数调用中显式传递 span 对象。
+
+##### 故障隔离
+trace 基础设施的任何错误(SSE 广播失败、JSON 序列化错误、数据库写入失败等)不会影响原始函数的正常执行:
+
+- `_on_root_finish()` 方法(负责广播和持久化)完全包裹在 `try/except` 中,异常仅记录在 debug 级别
+- `span_context` 和 `trace_span` 中的 `finish()` 调用也是异常安全的,原始业务异常始终能正确传播
+- `span_record` 的同步和异步包装器都会静默抑制所有 trace 相关异常
+
+##### 性能开销
+
+| 场景 | 额外开销 |
+|------|----------|
+| 禁用 trace(默认) | ~200 ns(一次配置字典查找) |
+| 启用 trace,创建子 span | ~3–5 μs(主要为 `uuid4` 生成) |
+| 请求结束时的序列化和广播 | ~50–200 μs(一次性,非阻塞) |
+
+对于典型的 LLM 请求(500 ms–5 s),总 trace 开销 < 0.05%。
 
 ---
 

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the system's observability by transforming the flat trace record model into a comprehensive hierarchical span tree. This enhancement provides granular visibility into the entire message processing lifecycle, from pipeline stages and LLM interactions to tool and plugin executions. The changes also introduce a robust public API for third-party plugin integration and a new, visually rich frontend UI for intuitive trace analysis, enabling developers to better understand and debug complex workflows.

Highlights

  • Core Trace Infrastructure: Rewrote the TraceSpan to support a full hierarchical tree structure with parent, span_type, input/output/meta, and children. Implemented implicit span propagation using ContextVar[_current_span] for async call chains. Introduced a developer API including span_context() async context manager, span_record() decorator, and get_current_span(). Ensured resilience by swallowing all trace errors and returning _NullSpan when disabled to avoid null-checks.
  • Pipeline Span Hierarchy: Enabled automated nesting of LLMAgent, plugin_handler, and tool_call spans under the correct pipeline stage by pushing each stage span to ContextVar. Enhanced metadata recording for LLMAgent, including system_prompt_chars, context_length, tools, model, reasoning, and tool_calls. Added granular llm_call and tool_call spans within the main agent loop.
  • Plugin Attribution: Implemented identity tagging for every span with meta.plugin (name) and meta.plugin_type ('builtin' / 'third_party'). Introduced an inheritance mechanism for attribution, ensuring deeply nested operations are traceable to their originating plugin. Enabled tool resolution for @llm_tool tool_call spans via handler_module_path.
  • SubAgent Orchestration: Added handoff tracing by creating a dedicated LLMAgent [subagent_name] span as a child of the current context before running a subagent. Replaced step_until_done() with a traced step loop in tool_loop_agent() that falls back to the original loop when tracing is off to ensure zero performance impact.
  • Public API & Frontend: Exposed span_context, span_record, and get_current_span for third-party plugin development. Introduced new Vue components (SpanTree.vue and SpanDetail.vue) for visual enhancements, supporting recursive rendering and detailed inspection of I/O/Metadata. Added status indicators with deterministic colors for plugins and distinct icons for built-in vs. third-party components.
  • Documentation: Provided full developer guides in both English and Chinese, covering how to enable tracing, usage patterns, and fault isolation.
Changelog
  • .gitignore
    • Added new entries for CLAUDE.md and .claude/ to the ignore list.
  • astrbot/api/all.py
    • Imported new tracing API functions: span_context, span_record, and get_current_span.
  • astrbot/api/trace.py
    • Added a new module to expose the public tracing API for AstrBot plugins.
  • astrbot/core/astr_agent_run_util.py
    • Integrated tracing for LLM agent steps and tool calls, including span creation, input/output recording, and plugin attribution resolution.
  • astrbot/core/astr_agent_tool_exec.py
    • Modified _execute_handoff to create dedicated LLMAgent spans for subagent orchestration, capturing input and output.
  • astrbot/core/astr_main_agent.py
    • Updated persona selection to create a child span for tracing, recording persona ID and toolset.
  • astrbot/core/db/init.py
    • Added TraceEntry to the database models and introduced abstract methods for trace management (insert, get, delete).
  • astrbot/core/db/po.py
    • Defined the TraceEntry SQLModel for persisting hierarchical trace data, including span details, token usage, and metadata.
  • astrbot/core/db/sqlite.py
    • Implemented _ensure_traces_table for database migration and concrete methods for insert_trace, get_traces, get_trace_detail, and delete_traces_before.
  • astrbot/core/log.py
    • Introduced a dedicated trace_cache in LogBroker and a publish_trace method for real-time trace broadcasting. Updated configure_trace_logger to enable file logging based on trace_enable setting.
  • astrbot/core/pipeline/process_stage/method/agent_sub_stages/internal.py
    • Integrated LLMAgent span creation and context propagation within the internal processing stage, capturing system prompts, context length, and tool usage.
  • astrbot/core/pipeline/process_stage/method/star_request.py
    • Integrated plugin handler span creation and context propagation, including plugin attribution and error handling.
  • astrbot/core/pipeline/scheduler.py
    • Integrated pipeline stage span creation and context propagation, ensuring the entire request lifecycle is traced and spans are properly managed on exit.
  • astrbot/core/platform/astr_message_event.py
    • Updated AstrMessageEvent to initialize a hierarchical TraceSpan as its root, capturing message origin, sender, and outline.
  • astrbot/core/star/context.py
    • Imported astrbot_config and _trace_current_span to support traced step loops in tool_loop_agent.
  • astrbot/core/utils/trace.py
    • Rewrote the core tracing system to use hierarchical TraceSpan objects and ContextVar for implicit propagation. Added span_context and span_record APIs, _NullSpan for disabled tracing, and logic for persistence and broadcasting.
  • astrbot/dashboard/routes/log.py
    • Added new API endpoints for managing trace data, including get_trace_history, list_traces, get_trace_detail, and clear_traces. Updated LogRoute initialization to accept db_helper.
  • astrbot/dashboard/server.py
    • Updated the Quart app initialization to include instance_path and modified LogRoute instantiation to pass the db_helper.
  • dashboard/package.json
    • Added vue-json-pretty as a new dependency.
  • dashboard/pnpm-lock.yaml
    • Updated the pnpm lock file to include vue-json-pretty.
  • dashboard/src/components/shared/SpanDetail.vue
    • Added a new Vue component for displaying detailed information of a selected trace span.
  • dashboard/src/components/shared/SpanTree.vue
    • Added a new Vue component for recursively rendering the hierarchical span tree with visual indicators for type, status, and plugin attribution.
  • dashboard/src/components/shared/TraceDisplayer.vue
    • Rewrote the TraceDisplayer component to fetch and display paginated trace records, integrate real-time updates via SSE, and support search and filtering.
  • dashboard/src/i18n/locales/en-US/features/trace.json
    • Updated English localization strings for the trace feature, including new hints and actions.
  • dashboard/src/i18n/locales/zh-CN/features/trace.json
    • Updated Chinese localization strings for the trace feature, including new hints and actions.
  • dashboard/src/layouts/full/FullLayout.vue
    • Adjusted the layout to accommodate the new Trace page, ensuring proper height and overflow handling.
  • dashboard/src/views/TracePage.vue
    • Rewrote the TracePage to integrate TraceDisplayer, SpanTree, and SpanDetail components, enabling interactive viewing, filtering, and management of trace data.
  • docs/en/dev/star/guides/trace.md
    • Added new English documentation detailing the request tracing system, core concepts, API usage in plugins, and dashboard viewing instructions.
  • docs/zh/dev/star/guides/trace.md
    • Added new Chinese documentation detailing the request tracing system, core concepts, API usage in plugins, and dashboard viewing instructions.
Activity
  • The pull request includes a comprehensive rewrite of the tracing system.
  • All checks passed successfully, as indicated by the make pr-test-neo output.
  • Screenshots of the new frontend UI for trace visualization have been provided.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed enhancement to the observability of the system. It replaces the flat tracing model with a full hierarchical span tree, which is a great improvement for debugging and performance analysis. The introduction of a developer-facing tracing API (span_context, span_record) is a fantastic addition for plugin authors. The new frontend UI for traces also looks very polished and useful.

The implementation is thorough, covering various parts of the pipeline from the scheduler to agent execution and sub-agent orchestration. The use of ContextVar for implicit span propagation is a clean and modern approach.

I have a few suggestions, mostly related to improving robustness and maintainability by avoiding silent exception swallowing and reducing code duplication. Overall, this is an excellent contribution.

Comment on lines +285 to +286
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Silently swallowing exceptions with except Exception: pass can hide bugs and make debugging very difficult. This pattern appears multiple times in this function. It's crucial to at least log these exceptions, even if it's at a debug level, to maintain visibility into potential issues within the tracing logic.

Suggested change
except Exception:
pass
except Exception as e:
logger.debug(f"Error processing tool_call span: {e}")

Comment on lines +166 to +171
try:
loop = asyncio.get_event_loop()
if loop.is_running():
asyncio.create_task(self._persist_to_db(trace_dict))
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block has two issues:

  1. It uses asyncio.get_event_loop(), which is deprecated since Python 3.10. It's recommended to use asyncio.get_running_loop() instead, which is more explicit and raises a RuntimeError if no loop is running.
  2. The broad except Exception: pass will silently swallow any errors, making it very hard to debug issues with trace persistence.

Please consider updating this block to use the modern API and to log any exceptions for better debuggability.

Suggested change
try:
loop = asyncio.get_event_loop()
if loop.is_running():
asyncio.create_task(self._persist_to_db(trace_dict))
except Exception:
pass
try:
loop = asyncio.get_running_loop()
loop.create_task(self._persist_to_db(trace_dict))
except RuntimeError:
logger.warning("[trace] Cannot persist trace to DB: not in a running event loop.")
except Exception as e:
logger.debug(f"[trace] Failed to schedule DB persistence: {e}")

Comment on lines +414 to +417
try:
s.set_output(result=str(result)[:2000])
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Throughout this file, there are several try...except Exception: pass blocks. While it's important that tracing logic doesn't crash the main application, silently swallowing all exceptions makes it impossible to debug problems with the tracing system itself. Please consider logging these exceptions at a debug level. This will provide visibility during development without cluttering production logs.

Suggested change
try:
s.set_output(result=str(result)[:2000])
except Exception:
pass
try:
s.set_output(result=str(result)[:2000])
except Exception as e:
logger.debug(f"[trace] Failed to record output: {e}")

)
# Per-step span and per-tool-call spans
_step_span = None
_tool_spans: dict[str, object] = {} # call_id -> TraceSpan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better type safety and code clarity, consider using a more specific type hint for _tool_spans. Instead of dict[str, object], you could use dict[str, "TraceSpan"]. This uses a forward reference and improves readability, making it clear what kind of object is expected in the dictionary.

Suggested change
_tool_spans: dict[str, object] = {} # call_id -> TraceSpan
_tool_spans: dict[str, "TraceSpan"] = {} # call_id -> TraceSpan

Comment on lines +379 to +380
except Exception:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This broad except Exception clause silently swallows any errors that might occur while resolving tool metadata. This can make debugging difficult. It would be better to log the exception at a debug or warning level before returning None.

Suggested change
except Exception:
return None
except Exception as e:
logger.debug(f"Failed to resolve tool plugin meta for {tool_name}: {e}")
return None

Comment on lines +251 to +370
if _trace_on and _trace_parent is not None:
_step_count = 0
_step_span = None
_tool_spans: dict[str, object] = {}

def _get_chain_tool_info(chain):
try:
first = chain.chain[0] if chain and chain.chain else None
data = getattr(first, "data", None)
return data if isinstance(data, dict) else None
except Exception:
return None

async def _run_one_step():
nonlocal _step_span, _tool_spans
async for resp in agent_runner.step():
if resp.type == "tool_call":
try:
ti = _get_chain_tool_info(resp.data.get("chain"))
if ti and _step_span is not None:
ts = _step_span.child(
ti.get("name", "tool"), span_type="tool_call"
)
args = ti.get("arguments", {})
ts.set_input(
**(
args
if isinstance(args, dict)
else {"args": args}
)
)
tid = str(ti.get("id", ""))
if tid:
_tool_spans[tid] = ts
except Exception:
pass
elif resp.type == "tool_call_result":
try:
chain = resp.data.get("chain")
rd = _get_chain_tool_info(chain)
if rd:
tid = str(rd.get("id", ""))
ts = _tool_spans.pop(tid, None)
if ts is not None and ts.finished_at is None:
result = chain.get_plain_text(
with_other_comps_mark=True
)
ts.set_output(result=result[:4000])
ts.finish()
except Exception:
pass
elif resp.type == "llm_result":
try:
resp_chain = resp.data.get("chain")
if _step_span is not None:
_step_span.set_output(
completion=(
resp_chain.get_plain_text()[:2000]
if resp_chain
else ""
)
)
if (
agent_runner.stats
and agent_runner.stats.token_usage
):
_step_span.set_meta(
input_tokens=agent_runner.stats.token_usage.input,
output_tokens=agent_runner.stats.token_usage.output,
)
except Exception:
pass
finally:
if (
_step_span is not None
and _step_span.finished_at is None
):
_step_span.finish()
_step_span = None

while not agent_runner.done() and _step_count < max_steps:
_step_count += 1
_step_span = _trace_parent.child(
f"llm_step_{_step_count}",
span_type="llm_call",
model=agent_runner.provider.get_model()
if agent_runner.provider
else "",
)
_tool_spans = {}
await _run_one_step()
if _step_span is not None and _step_span.finished_at is None:
_step_span.finish()
_step_span = None

if not agent_runner.done():
# Max steps reached — strip tools and force a final response
if agent_runner.req:
agent_runner.req.func_tool = None
agent_runner.run_context.messages.append(
Message(
role="user",
content="工具调用次数已达到上限,请停止使用工具,并根据已经收集到的信息,对你的任务和发现进行总结,然后直接回复用户。",
)
)
_step_span = _trace_parent.child(
f"llm_step_{_step_count + 1}",
span_type="llm_call",
model=agent_runner.provider.get_model()
if agent_runner.provider
else "",
)
_tool_spans = {}
await _run_one_step()
if _step_span is not None and _step_span.finished_at is None:
_step_span.finish()
else:
async for _ in agent_runner.step_until_done(max_steps):
pass
# ─────────────────────────────────────────────────────────────────────
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code for a traced step loop is very similar to the logic in run_agent in astrbot/core/astr_agent_run_util.py. This duplication could make future maintenance harder as any change would need to be applied in two places. Consider refactoring this complex logic into a shared utility function or class that both tool_loop_agent and run_agent can use. This would improve maintainability and reduce code duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend area:webui The bug / feature is about webui(dashboard) of astrbot. size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant