Files
life-echo/api/tests/evaluation_calibration/README.md
Kevin 99543d04c6 feat(eval): internal-eval stack, judge fixes, and eval web overhaul
- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.
2026-04-07 17:18:47 +08:00

26 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Evaluation rubric regression标定与回归
用于在调整 `rubrics/*.py``judge_schemas.py` 后,做一次**低成本**回归,避免因 prompt/schema 改动引入静默破坏。
## 自动化(不调用 LLM
```bash
cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py -q
```
- `test_judge_schemas.py`:细项上限与 `total_score` 算術一致性。
- `test_eval_composite.py`:合成分在「仅对话 / 仅回忆录 / 双侧缺失」时的语义。
## 定性标定集
见同目录 `fixtures.json`:描述若干**微型 transcript / 成稿片段**与**预期倾向**(区间或关键词),不绑定具体模型版本。
变更 rubric 后建议:
1. 跑通上述 pytest。
2. 任选 12 条 fixture用内网评测或 `EvalJudgeManualService` 对真实 GLM 跑一次人工 spot-check对照 `expected_band` / `must_flag_issues` 是否仍合理。
## rubric 版本
`fixtures.json``rubric_id` 与代码中 `conversation_v1` / `memoir_v1` 对齐;大改 rubric 时请同步更新 `fixtures.json` 的说明与期望。