api/tests/evaluation_calibration/README.md

# Evaluation rubric regression（标定与回归）

用于在调整 `rubrics/*.py` 或 `judge_schemas.py` 后，做一次**低成本**回归，避免因 prompt/schema 改动引入静默破坏。

## 自动化（不调用 LLM）

```bash
cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py -q
```

- `test_judge_schemas.py`：细项上限与 `total_score` 算術一致性。
- `test_eval_composite.py`：合成分在「仅对话 / 仅回忆录 / 双侧缺失」时的语义。

## 定性标定集

见同目录 `fixtures.json`：描述若干**微型 transcript / 成稿片段**与**预期倾向**（区间或关键词），不绑定具体模型版本。

变更 rubric 后建议：

1. 跑通上述 pytest。
2. 任选 1～2 条 fixture，用内网评测或 `EvalJudgeManualService` 对 GLM-5 跑一次人工 spot-check，对照 `expected_band` / `must_flag_issues` 是否仍合理。

## rubric 版本

`fixtures.json` 内 `rubric_id` 与代码中 `conversation_v1` / `memoir_v1` 对齐；大改 rubric 时请同步更新 `fixtures.json` 的说明与期望。
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								# Evaluation rubric regression（标定与回归）
 								用于在调整 `rubrics/*.py` 或 `judge_schemas.py` 后，做一次**低成本**回归，避免因 prompt/schema 改动引入静默破坏。
 								## 自动化（不调用 LLM）
 								```bash
 								cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py -q
 								```
 								- `test_judge_schemas.py`：细项上限与 `total_score` 算術一致性。
 								- `test_eval_composite.py`：合成分在「仅对话 / 仅回忆录 / 双侧缺失」时的语义。
 								## 定性标定集
 								见同目录 `fixtures.json`：描述若干**微型 transcript / 成稿片段**与**预期倾向**（区间或关键词），不绑定具体模型版本。
 								变更 rubric 后建议：
 . 跑通上述 pytest。
-												feat(evaluation): memoir readiness, judge/replay updates, eval web playground

Add memoir_readiness_service and router tests; extend judge schemas/services, replay_service, and conversation rubric; align story route agent, payload, prompts, and story_pipeline_sync; update agent logging, config, and DI. Document internal-eval; add replayDraft util and PlaygroundPage changes in app-eval-web.

											
										
										
											2026-04-08 09:38:07 +08:00
+. 任选 1～2 条 fixture，用内网评测或 `EvalJudgeManualService` 对 GLM-5 跑一次人工 spot-check，对照 `expected_band` / `must_flag_issues` 是否仍合理。
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
 								## rubric 版本
 								`fixtures.json` 内 `rubric_id` 与代码中 `conversation_v1` / `memoir_v1` 对齐；大改 rubric 时请同步更新 `fixtures.json` 的说明与期望。