Files
life-echo/api/tests/evaluation_calibration/README.md
Kevin 6772e1269c feat(evaluation): memoir readiness, judge/replay updates, eval web playground
Add memoir_readiness_service and router tests; extend judge schemas/services, replay_service, and conversation rubric; align story route agent, payload, prompts, and story_pipeline_sync; update agent logging, config, and DI. Document internal-eval; add replayDraft util and PlaygroundPage changes in app-eval-web.
2026-04-08 09:43:34 +08:00

26 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Evaluation rubric regression标定与回归
用于在调整 `rubrics/*.py``judge_schemas.py` 后,做一次**低成本**回归,避免因 prompt/schema 改动引入静默破坏。
## 自动化(不调用 LLM
```bash
cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py -q
```
- `test_judge_schemas.py`:细项上限与 `total_score` 算術一致性。
- `test_eval_composite.py`:合成分在「仅对话 / 仅回忆录 / 双侧缺失」时的语义。
## 定性标定集
见同目录 `fixtures.json`:描述若干**微型 transcript / 成稿片段**与**预期倾向**(区间或关键词),不绑定具体模型版本。
变更 rubric 后建议:
1. 跑通上述 pytest。
2. 任选 12 条 fixture用内网评测或 `EvalJudgeManualService` 对 GLM-5 跑一次人工 spot-check对照 `expected_band` / `must_flag_issues` 是否仍合理。
## rubric 版本
`fixtures.json``rubric_id` 与代码中 `conversation_v1` / `memoir_v1` 对齐;大改 rubric 时请同步更新 `fixtures.json` 的说明与期望。