chore(test): Feishu report readability + skill audit + CI hardening#59
Merged
chore(test): Feishu report readability + skill audit + CI hardening#59
Conversation
基于 ai-case 实测样本(实体商品一期-需求评审-2026032500)逐项排查发现
report.md 模板存在 10 处影响用户阅读体验的问题,本次按飞书云文档场景
集中重构 TEMPLATES.md report.md 模板与 PHASES.md 5.1 章节描述。
10 处具体修复:
1. 删除 H1 重复标题 — 飞书 doc 名已包含「需求评审」字样,AI 再写
`# {需求名称} - 需求评审报告` 会出现两行几乎一样的标题
2. 功能点清单格式破碎 — 原模板用 `F1 | 模块 | 描述` 半列表+分组糅合,
渲染时分组与第 1 个功能点合并成一行;改为 `### {模块}` H3 分组
+ `- F1: 描述` 列表
3. (略)评审统计加表头 — review_checklist.md 模板已含表头,本次只
在报告侧明确不再嵌入裸数字段
4. ASCII 状态符号 [OK]/[\!]/[X] 全面替换为 emoji ✅/⚠️ /❌;新增"关键
约束"明确禁止 ASCII 兜底
5. 删除 §7「评审完整性声明」整节 — 90% 信息与 §1 元数据 / §3 章节
末尾计数重复,"阻断项 X% 在合理范围内" 类 AI 自评对读者无价值
6. 功能点清单 46 条过长 — 强制用 `<details><summary>` 包裹默认折叠,
飞书云文档导入后会渲染为可展开 block
7. 维度内部编码 FB2/LC2 + UX3 等 — 全部去掉,只保留人类可读的
「功能边界」「逻辑一致性」维度名
8. 新增 §4「时间盒清单」— 按"评审会前/会上/开发前/提测前"重新组织
§3 + §5 的 to-do,引用编号 + 一句话动作,不重复问题原文
9. 元数据加 `分析方式:AI 辅助 + 评审会决策`,明确强调结论由人在评审
会最终决策;不写「评审人:QA AI 助手」之类的 AI 署名(避免企业用户
质疑结论可信度)
10. 章节序号重排:1 评审结论 / 2 需求理解 / 3 各职能问题 / 4 时间盒
清单 / 5 风险项 / 6 评审范围声明(旧版本 1/2/3/4/5/7 跳号)
实测下一份需求评审 session 将自动按新模板生成,预期:
- 报告长度从 8.2KB 降到约 3-4KB(折叠功能点清单 + 删 §7)
- 关键 to-do 在 §4 一目了然,不需要在 §3 里翻 PM/Dev/QA 各 section 拼凑
- 飞书 doc 标题与正文不再重复
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #59 第一版本基于"看起来更好"的假设引入了 emoji ✅/⚠️ /❌ 与 HTML <details><summary> 折叠 block。今天上传一份 mock report 到飞书 docx 实测结果(doc_blocks API 验证): 1. ❌ 触发飞书的 bitable block 解析逻辑,整行内容被吃掉 2. ✅ 自动转换为 [OK]、⚠️ 自动转换为 [\!] 3. <details>/<summary> HTML 标签完全被忽略,summary 内容反而成为 误导性"点击展开"普通文本(用户实际无法点击) 立即修正 5 处: 1. 状态符号统一为中文方括号标记(飞书 import 全部原样保留): - 就绪度: [通过] / [有条件] / [不通过](替代 ✅/⚠️ /❌) - 信息边界: [已获取] / [未获取] / [不完整](替代 ✅/❌/⚠️ ) - 中间产物 review_checklist.md 仍用 [已确认]/[需关注]/[待确认]/ [不适用] 4 类,TEMPLATES 顶部新增"关于符号约定"段明确不要混用 2. 移除 <details> 折叠约束,功能点清单全量列出,按模块用 H4 `#### {模块名}` 分组(H4 视觉权重比 H3 低,不打乱大纲) 3. 关键约束新增 emoji 与 details 的禁用说明,附实测原因 同步本次审查发现的另 4 处结构问题(A1/A2/A4/F2): 4. TEMPLATES.md 目录写"7 章结构定义"但实际 6 章,已纠正 5. review_checklist.md 模板章节序号 ## 三、出现两次(标准 QA Checklist + 文档-设计稿交叉比对),重排为 三/四/五 6. SKILL.md 意图识别表新增 smoke-test / test-case-generation 的引导,避免用户说"做冒烟测试"被错配 7. SKILL.md output 阶段 chat 输出措辞与 PHASES.md 5.2 对齐, 明确 chat 给指标摘要 + 飞书链接,详细模板见 PHASES.md 8. TEMPLATES.md 三、标准 QA Checklist 表头明确引用 CHECKLIST.md 的具体检查项,AI 不需要重新发明 实测验证: - 用最新模板生成 mock report 上传飞书,58 个 block 全部正确渲染 - 0 emoji 残留、0 ASCII [OK] 残留、0 HTML 标签残留 - H2/H3/H4 三级标题层次正常,bullet list、§4 时间盒结构完整 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test-case-review skill 之前没有 TEMPLATES.md,summary 阶段产出的 review_summary.md 完全靠 AI 自由发挥,导致实测样本(分包上传 DC 开放 入口-用例评审-2026-03-04)出现以下问题: 1. §1 评审总览、§2.1 覆盖映射、§3.2 缺失场景、§4.1 / §7 全部用 markdown 表格,飞书 import 后被拆成离散 text/bullet block,看起来 是散文不是表格(实测 5 处 ≈ 31 行散行) 2. emoji ✅/⚠️ 大量使用,飞书 import 会自动转 [OK]/[\!],且 ❌ 还会 触发 bitable block 解析(详见 PR #59 D1 验证结论) 3. 缺"用例就绪度判定"——读者读完不知道这套用例算不算通过评审 4. §4 / §5 / §6 严重重叠:§6 改进建议高/中/低 15 条 ≈ 1:1 复述 §4 (6 条) + §5 (9 条) 5. 同一用例(如 b3769bb8)在 §3 / §4 / §5 多处被点名,修复者要来回 翻才能拼出"这个用例到底要改什么" 6. "QA AI 助手自动生成"署名削弱企业用户对结论的可信度 新增 TEMPLATES.md 强约束 8 节结构(0-7): - §0 评审结论:用例就绪度判定 [通过]/[有条件]/[不通过] + 待修复分级 - §1 评审总览:核心指标 + 覆盖率目标线 ≥ 90% - §2 需求覆盖率:已覆盖/部分覆盖/未覆盖 三组分别列出(不用表格) - §3 场景完整性:端到端不闭环 + 缺失边界场景按需求点分组 - §4 用例正确性 / §5 用例规范性 - §6 修复 to-do(按用例 ID 分组):核心可执行价值章节,跨章节问题 汇总到一起,用例 ID 行用粗体 + 待办行用 `· ` 前缀保留层次 - §7 建议补充的用例 关键约束("关键约束"小节): - 禁止 markdown 表格(飞书 import 散行) - 禁止 emoji ✅/⚠️ /❌(飞书 import 会破坏内容) - 禁止 §6 复述 §3/§4/§5 详情(用章节编号引用即可) - 禁止 AI 助手署名(用「分析方式:AI 辅助 + 评审人决策」) - 不写 H1(飞书 doc 名已包含「{需求名称}-用例评审-{YYYYMMDDHH}」) - §6 to-do 用「粗体用例 ID + `· ` 前缀」保层次(飞书会拍平嵌套 bullet 但 bold + `·` 字符可视觉区分) 同步更新: - PHASES.md 5.4 引用新 TEMPLATES,列出 8 节结构 - SKILL.md 中间文件清单标注 review_summary.md 是飞书最终产物 实测验证(mock 报告 88 个 block): - 0 emoji 残留 / 0 ASCII 残留 / 0 表格触发 / 0 bitable 触发 - §6 用例 ID 行带 [BOLD] 标记,待办行无粗体 + `·` 前缀 - 所有中文方括号标记([有条件]/[已覆盖]/[P0]/[完整性] 等)原样保留 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
change-analysis skill 之前没有 TEMPLATES.md,code_change_analysis.md 和 test_coverage_report.md 完全靠 AI 自由发挥,导致实测样本 (进社区默认落 tab 实验-变更分析-2026042310)出现以下问题: 1. 8 张 markdown 表格全部触发飞书 import 散行(变更分组、API 模型、 论坛模块、4.1 直接影响、4.2 间接影响、置信度、PR Test plan、 覆盖率汇总)— 是 4 份报告中表格密度最高的 2. ASCII [OK]/[X]/[\!] 大量使用(飞书 import 的自动转换结果) 3. ★/★★/★★★ 风险标记与跨 workflow 的 [阻断]/[高]/[中]/[低] 不统一 4. 缺"分析就绪度"全局判定 — 读者读完不知道这次变更算低/中/高风险 5. 第五节"置信度说明"放最后 — 读完 9KB 才发现某些结论是推测 6. multi-doc 间无交叉引用 — code_change_analysis §4.4 回归范围、 test_coverage_report §2 缺口、ca_sup TC 用例 三处信息互相孤立 7. 覆盖率统计含糊 — "总关键变更点约 13 个 → 约 54%" 没声明依据 8. 代码 diff 块极长(Swift 源码)夹在自然语言报告中,PM/QA 视角下 显得冗余但 Dev 又需要 新增 TEMPLATES.md 强约束 2 份报告: 【code_change_analysis.md】6 节 - §0 变更评估:变更风险 [低]/[中]/[高] + 实证 vs 推测占比 - §1 变更分组总览(H3 + bullet,不用表格) - §2 核心变更深度(每条结论行尾标 [实证]/[推测]) * 代码 diff 块前必须加 quote 提示「以下仅 Dev 可跳过」 - §3 调用链分析(代码块保留 ASCII 树) - §4 影响面评估(4.1/4.2 用 bullet;4.4 回归范围必须 → 详见 [测试覆盖 评估] 缺口 N 跨文档引用) - §5 置信度索引(实证 vs 推测占比 + 章节列表,作索引不重复结论) 【test_coverage_report.md】5 节 - §0 覆盖率结论:测试覆盖 [已达标]/[有缺口]/[严重不足] + 精确总变更 点数(禁止"约 N 个")+ 覆盖率目标线 ≥ 80% - §1 PR Test Plan 覆盖(bullet 列表) - §2 测试覆盖缺口分析(每缺口 → 详见 [代码变更分析] §X 跨引 + → ca_sup TC-XX 反向引用) - §3 覆盖率汇总(已覆盖/未覆盖/待评估三组列表,末尾给精确公式) - §4 高优先级补充测试建议(双向引用缺口 + TC-XX) 关键约束("关键约束"小节): - 禁止 markdown 表格(飞书 import 散行) - 禁止 emoji ✅/⚠️ /❌(飞书 import 破坏内容) - 禁止 ASCII [OK]/[\!]/[X](跨 workflow 语义漂移) - 禁止 ★/★★/★★★(与跨 workflow 分级符号不统一) - 代码 diff 块前必须加 quote 提示 - multi-doc 必须双向引用 - 置信度逐条标注 [实证]/[推测] - 覆盖率统计必须精确,禁止"约 N 个" 同步更新: - PHASES.md 阶段 4 / 阶段 5 引用新 TEMPLATES,列出章节结构 - SKILL.md Story 场景 7 阶段表加 TEMPLATES 链接 + 飞书 multi-doc 说明 实测验证(mock report 35 个 block): - 0 emoji 残留 / 0 ASCII 残留 / 0 表格触发 / 0 star 残留 - code block + quote block 正常渲染 - bold 标记保留(§0 元数据、变更点标题、§5 占比) - 中文方括号 [高]/[中]/[实证]/[推测] 全部原样保留 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
冒烟测试 workflow 复用 requirement-traceability skill 的 smoke-test 模式,但之前 PHASES.md 只约束了 smoke_test_report.json 结构化输出, 没约束最终上传到飞书云文档的 report.md 人类可读报告格式。 实测样本(插件升级提示-冒烟测试报告-2026040710)暴露的问题: 1. 5 张 markdown 表格全部触发飞书 import 散行(基本信息 + 指标 + P0 用例评估 + 需求覆盖矩阵 + 代码变更追溯)— 与 ca 并列最多 2. 装饰性 emoji 章节前缀 📋/🐛/📊/💡 被飞书自动转为 [Doc]/[Bug]/ [Chart]/[Tip] ASCII 形式 3. ⭕ 范围外 emoji 也被破坏 4. ASCII [OK]/[\!] 大量使用 5. 冒烟结论位置不在第一行(必须先看完基本信息表格) 6. 缺陷代码块过长(Kotlin 源码)无 Dev 跳过提示 7. 置信度散落各需求点(87%/75%/85%)但无整体均值汇总 8. 缺陷与用例的关联线弱(DEF-01 关联用例不在 P0 评估 6 条中) 9. "未追溯变更:0 个(无范围蔓延)" 含糊,PM 不懂"范围蔓延" 10. "本报告由 QA AI 助手...自动生成" 署名削弱可信度 新增 TEMPLATES.md「report.md(smoke-test 模式专用)」节,6 节强约束: - §0 冒烟测试结论(首行 `判定:[通过]/[不通过]` + 整体置信度均值) - §1 核心指标(bullet 列表,不用表格) - §2 P0 用例评估(标题必须 `共 N/总数` 明确分子分母) * 用粗体用例 ID + bullet `· ` 前缀(参照 test-case-review §6 模式) - §3 双通道追溯结论(§3.1 需求覆盖矩阵 + §3.2 代码变更追溯) * §3.2 末尾未追溯变更必须明确表述(禁止"无范围蔓延"含糊文案) - §4 缺陷清单(每个 DEF 的代码块前必须加 quote 提示「以下代码块 仅 Dev 排查时阅读」) - §5 其他观察(粗体短词分类) 关键约束("关键约束"小节): - 禁止 markdown 表格(飞书散行) - 禁止 emoji ⭕/✅/⚠️ /❌(⭕ 转乱码、❌ 触发 bitable) - 禁止 ASCII [OK]/[\!]/[X](跨 workflow 语义漂移) - 禁止装饰性 emoji 章节前缀 📋/🐛/📊/💡(飞书自动转 ASCII) - 禁止「QA AI 助手」署名 - 代码块前必须加 quote 提示 - §2 标题必须明确分子分母(共 N/总数) - §3.2 未追溯变更必须明确表述 - 置信度逐条标注且需汇总到 §0 同步 PHASES.md 5S.2:在 defect_list.json + smoke_test_report.json 两个产物之间新增第 3 步「写入 report.md」,引用 TEMPLATES 中 6 节 结构定义。 实测验证(mock report 53 个 block): - 0 emoji 残留 / 0 ASCII 残留 / 0 表格触发 / 0 装饰性章节前缀 - bold 标记保留(§0 元数据/缺陷字段名/§5 分类) - 中文方括号 [通过]/[待定]/[已覆盖]/[范围外]/[P0]/[推测] 全部原样 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PHASES.md 5.2 chat 输出文案 '单 Agent 串行' 与 schema enum ['multi_perspective', 'single_agent'] 不一致,AI 在写 rr_summary.json 时直译 '串行' → 'serial' 拼出 single_agent_serial, 触发 contracts/rr-summary.schema.json 校验违规(非阻断但污染 output_data._schema_violations)。 修复: 1. 5.2 文案从 '单 Agent 串行' 改为 '单 Agent'(与 enum 对齐) 2. 5.1.5 约束清单补一条 review_mode 枚举说明 + 黑名单 single_agent_serial/serial 变体,与 verdict 同样的强制写法 实测案例:ai-case session 248 完成时触发该违规,verdict + 49 issues 均正常,仅 review_mode 字段被污染。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…isions resolved
A systematic review covering all 16 skills, 10 agents, 5 contract schemas, and shared
infrastructure (CONVENTIONS.md, CONTRACT_SPEC.md, root docs). Lifts version 0.0.8 → 0.0.17.
Highlights:
CI baseline (tests/validate.sh): 5 → 9 check categories.
- Add Check 10/11/12/13: SKILL frontmatter name vs dir, handoffs targets, subagent
references, references/* paths (test plugin scope only)
- Add Check 14: cross-skill contract.yaml consistency via validate_contracts.py +
known-collisions.yaml allowlist (primary+fallback / shared-helper patterns)
- Extend tests/check-schemas.sh: 1 → 5 schemas (testcase / ca-summary / defect-list
/ rr-summary / smoke-test-report) with ~30 negative samples
- Fix sandbox false-green: new_tmpdir hard-fails on mktemp denial
Zombie agent cleanup: 3 deleted, ~13 dead refs cleaned.
- agents/requirement-traceability/forward-tracer.md (DEPRECATED, prompt inlined to
PHASES 3.2.4)
- agents/requirement-traceability/reverse-tracer.md (already 'main agent inline' per
PHASES 3.3, zero Task calls)
- agents/failure-classifier.md (zero refs in test-failure-analyzer/{SKILL,PHASES})
Contract collision resolution (8 detected → 4 renamed + 4 allowlisted):
- risk_assessment.json: change-analysis renamed to bug_risk_assessment.json
- test_execution_report.json: split into unit_test_execution_report.json /
integration_test_execution_report.json
- supplementary_cases.json (3-way): change-analysis → change_supplementary_cases.json,
test-case-review → review_supplementary_cases.json, test-case-generation keeps
canonical (internal scratch)
- ui_fidelity_report.json + 4 writeback files + requirement_points.md: allowlisted as
primary+fallback or shared-helper outputs
Dependency honesty:
- shared-tools/requirements.txt: corrected from 'stdlib only' to list pycryptodome /
pyyaml (required) + jsonschema / python-dotenv (optional)
- feedback/requirements.txt: new, declares aiohttp (required by feishu_api.py)
- feedback/contract.yaml: env_vars now includes FEISHU_PLUGIN_ID/SECRET/USER_KEY
Helper UX consistency: 5 scripts now support -h/--help cleanly (search_mrs/prs,
gitlab/github/metersphere helper).
Documentation:
- 3 root-level docs (AI_CODING_BEST_PRACTICES / PIPELINES / CONTRACT_SPEC) tagged
with audience notes (developer references, not runtime deps)
- README目录结构: each root doc annotated with role
- CONVENTIONS L466 numbering prefix table fixed: RP- belongs to TCG/TCR understand
(not 'TCG review')
- 2 broken anchors fixed: CONVENTIONS.md#审计输出格式 → _shared/ASSERTION_AUDIT.md
- AGENT_TEMPLATE: section-name exception extended for sub-agents (case-tracer)
- WORKFLOW_DEFS qa-lite: writeback behavior corrected from 'will fail' → 'gracefully
skip' (matches PHASES 6.1.b actual behavior)
- integration-test-design: 5 phases extracted to PHASES.md (matches unit-test-design
pattern)
- feedback SKILL: added Quick Start + Closing Checklist
- metersphere-sync: documented list-stages command + stdout shape
- change-analysis SKILL: added codex-change-analyzer to model layering table
- metersphere-sync deprecated confidence_threshold field removed (helper已ignored
since v0.0.16; safe to drop)
CI: 79 PASS / 0 FAIL, EXIT=0.
…reakdown 正反例 (v0.0.18) ca-summary.schema.json: - 删 scenario 字段(无 backend / 无下游 skill 消费) - risk_count description 加防混淆说明(指明 integer 总数,分桶用 risk_breakdown) - risk_breakdown description 加示例 rr-summary.schema.json: - 删 role_breakdown 字段(前端不展示,下游也未读) - 删 review_mode 字段(无消费方) change-analysis/PHASES.md: - ca_summary example 调整字段顺序:risk_breakdown 前 risk_count 后,避免 AI 边读边写时把 dict 错填到 risk_count(304 实证 confusion) - 加正反例对比块,明确"分桶 dict 应放 risk_breakdown 而非 risk_count" - 删 scenario 相关约束行 requirement-review/PHASES.md: - rr_summary example 删 role_breakdown / review_mode - 约束段同步删除 contract.yaml: - ca/rr description 同步去掉已删字段 tests/check-schemas.sh: - 同步移除 scenario / role_breakdown / review_mode 的 valid sample 与负样本 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to v0.0.10 contract renames (risk_assessment.json, supplementary_cases.json, test_execution_report.json) — propagate the new names through PIPELINES.md, README.md, and two TEMPLATES.md files that still pointed at the old canonical names. Also fix a broken anchor in requirement-traceability/PHASES.md and a self-contradicting description in change-analysis/contract.yaml. - requirement-traceability/PHASES.md: anchor #coverage_reportjson → #traceability_coverage_reportjson - change-analysis/contract.yaml: description aligned with outputs (change_supplementary_cases.json) - PIPELINES.md: 8 stale refs updated; work_dir layout split risk/supplementary into per-skill entries - README.md: 7 stale refs updated across scenario tables and link diagrams - test-case-review/TEMPLATES.md: §7 references review_supplementary_cases.json - change-analysis/TEMPLATES.md: §0 + §5 cross-doc note references change_supplementary_cases.json
Reset versions from rolling patch bumps (0.0.18 → 0.0.19 over the branch lifetime) down to production+1: marketplace 0.1.40 → 0.1.41, test 0.0.8 → 0.0.9. Updated CHANGELOG.md, plugins/test/README.md version history, and root README.md / README.en.md plugin tables.
…(v0.0.11) - Bind testcase.schema.json on *_cases.json outputs in tcg/tcr/ca contract.yaml - Add demand-driven vs change-driven distinction in tcg/ca SKILL descriptions + cross-reference SKIPs - Mark review_result.json as non-strict JSON (verdict object, not TestCaseList) - Sync testcase.schema.json from ai-case Pydantic: Step.expected now required (allows empty string) - Relax CONVENTIONS.md test_method to optional (Pydantic was already Optional; doc was lagging) - Move AGENT_TEMPLATE.md from agents/ to docs/ (avoid loader picking it up as agent) - Bump test 0.0.9 -> 0.0.11, marketplace metadata 0.1.41 -> 0.1.43 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…0.0.12)
- 5-A: Cross-reference known-collisions.yaml from PHASES.md & PIPELINES.md
to clarify trace Phase 6 writeback vs ms-sync mode=execute share the
same helper but are mutually exclusive entry points
- 5-C: qa-workflow accepts re_entry_phase + requirement_change_summary;
passes through to test-case-generation for "rerun after requirement
change" scenarios
- 8-A: README selection guide adds explicit demand-driven (tcg) vs
change-driven (ca) rows to match D2-Q3 boundary decision
- 8-B: README architecture features section drops stale v0.0.10+ version
gates (current is 0.0.12)
- 8-C: README directory tree moves AGENT_TEMPLATE.md from agents/ to docs/
(matches the file move done in 0.0.10)
- Bump test 0.0.11 -> 0.0.12, marketplace metadata 0.1.43 -> 0.1.44
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…clarify sample artifacts (v0.0.13)
- 7-C: Fix _shared/AGENT_PROTOCOL.md broken link to AGENT_TEMPLATE.md
(was agents/AGENT_TEMPLATE.md, now ../../docs/AGENT_TEMPLATE.md);
remove AGENT_TEMPLATE from agents/ tree diagram; add note that
template lives outside agents/ to avoid loader pickup
- 7-B: Add quickstart line in requirement-clarification SKILL.md noting
output/*.json files are pre-shipped format samples, not runtime
artifacts (avoid confusion for first-time users)
- 7-A: Keep AI_CODING_BEST_PRACTICES.md (developer onboarding value); no change
- Bump test 0.0.12 -> 0.0.13, marketplace metadata 0.1.44 -> 0.1.45
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… was wrong (v0.0.14) Reverts the move from v0.0.10 (D1-Q2 decision). The original justification was that AGENT_TEMPLATE.md would be picked up by Claude/Codex agent loader and pollute the agent picker. Verified with claude-code-guide that this is incorrect: Claude Code does NOT auto-scan plugins/<name>/agents/ for subagents — they must be explicitly declared. Additionally, all 9 .md files under plugins/test/agents/ have NO YAML frontmatter, so they would be skipped by the loader even if scanning happened. They are prompt templates loaded via Task tool calls inside skills, not subagents. - mv plugins/test/docs/AGENT_TEMPLATE.md back to plugins/test/agents/ - Remove empty plugins/test/docs/ directory - AGENT_PROTOCOL.md: tree restored, link reverted, add inline note explaining why loader is not a concern - README.md: directory tree updated to reflect agents/ location and add inline note about loader behavior - Bump test 0.0.13 -> 0.0.14, marketplace metadata 0.1.45 -> 0.1.46 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reset versions back to production+1 baseline (test 0.0.9 / marketplace 0.1.41) after consolidating 4 in-flight commits (836cb61/3564b01/bb3381b/ b3c4f1a) into a single release-prep entry. File contents from those commits are preserved; only version markers + CHANGELOG + plugin README are updated. CHANGELOG 0.1.41 entry extended with "Skill audit phase 1-3 follow-ups" section covering: schema bindings on *_cases.json, demand/change-driven gating for tcg/ca, review_result.json non-strict JSON note, ai-case Pydantic Step.expected required sync, CONVENTIONS.md test_method optional, known-collisions cross-references, qa-workflow re_entry_phase support, README selection guide updates, output sample annotations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
em0t
approved these changes
May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Block 1 inspects 4 real Feishu report samples produced by ai-case workflows and constrains output format: 3 new TEMPLATES.md created (test-case-review / change-analysis / requirement-traceability had none) + 1 rebuilt (requirement-review) + 1 schema violation fixed (
review_modeblacklist). All changes verified via real Feishu docx import (zero emoji / ASCII / table residue).Block 2 systematically audits the test plugin (16 skills / 10 agents / 5 contract schemas) and hardens CI from 5 → 9 check categories. Resolves 8 contract output collisions, removes 3 zombie agents, makes dependencies honest.
Block 3 runs an 8-dimension cross-skill review: structural compliance, scope/overlap, trigger description, contract closure, orchestration, executability, dead code, cross-cutting docs consistency. Surfaces and resolves contract drift (schema bindings, demand-driven vs change-driven gating, MS writeback dual-entry clarity) and syncs
Step.expectedfrom ai-case Pydantic into testcase.schema.json (now required, allows empty string).Block 1 — Feishu Report Readability
Background
ai-case auto-uploads markdown reports to Feishu Docs after each AI workflow. Real-world inspection of 4 samples surfaced format issues hurting readability, and one triggered a schema violation:
[OK]/[!]/[X]ASCII markers /QA AI 助手self-praise signature★/★★/★★★markers across workflows / no multi-doc cross-references / vague "约 N 个" coverage statsrr_summary.jsonwrote illegalreview_mode: \"single_agent_serial\", triggering contract schema validation failure (logged inoutput_data._schema_violations)Feishu docx import compatibility traps (discovered via real upload)
The first iteration (commit
3562807) introduced emoji + HTML<details>based on "looks better" assumptions — all failed in Feishu docx import testing. These 4 traps are the scientific basis for every constraint in this PR:<details><summary>...</summary>...</details>✅ 已获取[OK] 已获取⚠️ 待定[!] 待定❌ 未获取📋 基本信息(decorative section prefix)[Doc] 基本信息| ... | ... |Fix direction: unify Chinese square-bracket markers
[通过]/[已覆盖]/[实证]to replace all emoji + ASCII + star ratings; tables → H3 + bullets; collapse → full-content H4 grouping.Changes (6 commits / 4 skills / 11 files)
3562807TEMPLATES.mdforreport.md(10 readability optimizations)26d3b0391ace7eTEMPLATES.md(no template existed before) — 8-section strict templatefb47249TEMPLATES.md(multi-doc constraints forcode_change_analysis.md+test_coverage_report.md) — bidirectional cross-references + per-line[实证]/[推测]confidence tagginge459975report.mdtemplate (PHASES 5S.2 only constrained JSON output before, no markdown template)32eb7f1rr_summary.jsonwriting illegalreview_modevalues likesingle_agent_serial(was PR #58, merged in here)Per-skill report structure:
report.mdreview_summary.md·prefix to preserve hierarchy after Feishu flattens nested bullets)code_change_analysis.md+test_coverage_report.md[实证]/[推测]tagging + §5 confidence indexreport.md共 N/总数+ §3.2 explicit untraced changesUnified conventions (11 rules across 4 workflows)
PR consolidates per-skill report conventions into unified rules:
[OK]/[!]/[X]— Semantic drift across workflows (rr means "conditional", ca_cov means "low risk but hard to reproduce")<details><summary>collapse — Feishu doesn't recognize, summary becomes misleading text[通过]/[有条件]/[不通过][已覆盖]/[部分覆盖]/[未覆盖]/[范围外][高]/[中]/[低][实证]/[推测][P0]/[P1]/[P2]QA AI 助手自动生成self-signature (which weakened enterprise trust)> 以下代码块仅 Dev 排查时阅读,PM/QA 可跳过— Balances PM/QA vs Dev viewpointsVerification (real Feishu docx upload)
Used ai-case
create_feishu_doc.pyto upload 4 mock reports to Feishu docx, then verified rendering viadocx_v1_documents/{id}/blocksAPI:·prefix preserves hierarchy)All Feishu doc output elements (H2/H3/H4 hierarchy, bullets, code blocks, quote blocks, bold markers, Chinese square brackets) preserved verbatim.
Impact
rr_summary.jsonschema unchanged (commit32eb7f1only constrains AI output, schema enum was always strict)defect_list.json/smoke_test_report.jsonschemas unchangedBlock 2 — Test Plugin Audit + CI Baseline Hardening
TL;DR
Systematic audit of test plugin (16 skills / 10 agents / 5 contract schemas / CONVENTIONS / 3 root docs). CI baseline expanded 5 → 9 check categories. Caught 8 real contract collisions (resolved via 4 renames + 4 whitelist entries), removed 3 zombie agents, made dependencies honest.
CI baseline hardening (validate.sh 5 → 9 check categories)
name↔ directory name match[].skilltargets existsubagent_type=...referenced agent files existreferences/*.mdpaths resolve (incl. Chinese filenames)validate_contracts.pycross-skill consistency + whitelist (contracts/known-collisions.yaml)Plus: extended
tests/check-schemas.shfrom 1 → 5 schemas (testcase / ca-summary / defect-list / rr-summary / smoke-test-report) with ~30 negative cases. Fixedmktempfailure silently passing under sandbox mode.Contract output collision resolution (8 real collisions → 4 renames + 4 whitelist)
risk_assessment.jsonbug_risk_assessment.json(semantic distinction from traceability)test_execution_report.jsonunit_test_execution_report.json/integration_test_execution_report.jsonsupplementary_cases.json(3-way collision)change_supplementary_cases.json; test-case-review →review_supplementary_cases.json; test-case-generation keeps canonical (internal scratch)ui_fidelity_report.jsonforward_verification.enriched.json+ 3 writeback filesmetersphere_helper.py writeback-from-fvhelper, runtime-mutually-exclusive)requirement_points.mdAdded
plugins/test/contracts/known-collisions.yamlas whitelist source of truth, with self-binding "review protocol" rule (rejects semantically different same-name entries).Zombie agent cleanup (3 deletes + 13 dead references)
agents/requirement-traceability/forward-tracer.mdagents/requirement-traceability/reverse-tracer.mdagents/failure-classifier.mdCleaned 13 dead references across 7 files (SKILL/PHASES/TEMPLATES/TRACEABILITY_PROTOCOL/AGENT_PROTOCOL/README).
Dependencies honesty
shared-tools/requirements.txtfeedback/requirements.txtfeedback/contract.yamlFEISHU_PLUGIN_ID/SECRET/USER_KEY(was[]despite scripts needing them)Helper script UX consistency
5 shared scripts now support standard
-h/--help(previously--helpwas sent as a network query insearch_mrs.py):search_mrs.py/search_prs.py: added USAGE constant +--helpinterceptiongitlab_helper.py/github_helper.py/metersphere_helper.py: "unknown subcommand" path nowif argv[1] in (\"-h\", \"--help\"): print(usage); exit 0Documentation polish
AI_CODING_BEST_PRACTICES.md/PIPELINES.md/CONTRACT_SPEC.mdget audience notes ("developer reference, not skill runtime dependency")contracts/directoryCONVENTIONS.md#审计输出格式→_shared/ASSERTION_AUDIT.md#审计输出格式## 执行流程/输出/置信度评分instead of standard names)integration-test-design: 5-phase content split from SKILL into standalonePHASES.md(matches unit-test-design style; SKILL trimmed 387 → 252 lines)feedback/SKILL.mdadds Quick Start + Closing Checklistmetersphere-sync/SKILL.mdaddslist-stagescommand docs + stdout shapechange-analysis/SKILL.mdmodel-tier table adds codex-change-analyzerconfidence_thresholdfield (helper hadn't honored it; only contract.yaml placeholder)Block 3 — 8-Dimension Cross-Skill Review Phase 1-3 Follow-ups
TL;DR
Multi-dimensional review of all 15 skills, 7 agents (under
agents/), and 5 contracts. 23 decision points across 8 dimensions resolved. 7 actionable fixes implemented; 3 "keep current state" decisions noted as known limitations.Coverage (8 dimensions)
Changes implemented
testcase.schema.jsonon 5*_cases.jsonoutputs (was implicit via CONVENTIONS.md inheritance; now machine-checkable)review_result.jsonnon-strict JSON: test-case-review/SKILL.md notes it's a verdict object (not TestCaseList), deliberately not schema-boundStep.expectednow required: synced from ai-case Pydantic source (companion ai-case MR !40); allows empty string but field must exist. MCP tool input_schema auto-reflects via TypeAdapter at LLM tool-call timetest_methodrelaxed to optional: CONVENTIONS.md updated to match Pydantic (which was already Optional; doc was lagging) — supplementary case producers may omitknown-collisions.yamlexplaining trace Phase 6 andmetersphere-sync mode=executeshare themetersphere_helper.py writeback-from-fvhelper as auto vs manual entry points (not duplicate implementations)re_entry_phase+requirement_change_summaryoptional inputs, passthrough to test-case-generation for "rerun after requirement change" scenariosoutput/*.jsonare pre-shipped format samples, not runtime artifacts_shared/AGENT_PROTOCOL.mdadd inline note explainingplugins/test/agents/markdown files lack YAML frontmatter and are loaded explicitly via Task tool calls inside skills, not auto-registered as Claude/Codex subagentsKnown limitations (deferred decisions)
dependencies.agents(kept as-is, per D6-A)feedback/ tcg / tcr / qa-workflow lack explicit fallback documentation when external deps unavailable (per D6-B; relies on fail-loud + error-message guidance)requirement-clarification↔requirement-reviewoverlap not gated; LLM routes (per D2-Q1)tcginternal redundancy-pair review vs standalonetest-case-reviewboundary not enforced (per D2-Q2)Companion change in ai-case
ai-case MR !40 (
fix/skill-loading-doc) commitscase_schema.pyStep.expected required. Merge order recommendation: this PR first → ai-case MR !40 second → submodule pointer update via small follow-up MR.Commit map
356280726d3b0391ace7efb47249e45997532eb7f14679a7ffad14aa117cbcf3a57da6836cb613564b01bb3381bb3c4f1abf41bceTest plan
bash tests/validate.shPASS (79 PASS / 0 FAIL)Step.required = ['action', 'expected']🤖 Generated with Claude Code