feat(evaluation): memoir readiness, judge/replay updates, eval web playground

Add memoir_readiness_service and router tests; extend judge schemas/services, replay_service, and conversation rubric; align story route agent, payload, prompts, and story_pipeline_sync; update agent logging, config, and DI. Document internal-eval; add replayDraft util and PlaygroundPage changes in app-eval-web.
2026-04-08 09:38:07 +08:00
parent 99543d04c6
commit 6772e1269c
26 changed files with 1255 additions and 124 deletions
--- a/api/tests/evaluation_calibration/README.md
+++ b/api/tests/evaluation_calibration/README.md
@@ -18,7 +18,7 @@ cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py
 变更 rubric 后建议：

 1. 跑通上述 pytest。
-2. 任选 1～2 条 fixture，用内网评测或 `EvalJudgeManualService` 对真实 GLM 跑一次人工 spot-check，对照 `expected_band` / `must_flag_issues` 是否仍合理。
+2. 任选 1～2 条 fixture，用内网评测或 `EvalJudgeManualService` 对 GLM-5 跑一次人工 spot-check，对照 `expected_band` / `must_flag_issues` 是否仍合理。

 ## rubric 版本