feat(evaluation): memoir readiness, judge/replay updates, eval web playground

Add memoir_readiness_service and router tests; extend judge schemas/services, replay_service, and conversation rubric; align story route agent, payload, prompts, and story_pipeline_sync; update agent logging, config, and DI. Document internal-eval; add replayDraft util and PlaygroundPage changes in app-eval-web.
This commit is contained in:
Kevin
2026-04-08 09:38:07 +08:00
parent 99543d04c6
commit 6772e1269c
26 changed files with 1255 additions and 124 deletions

View File

@@ -18,7 +18,7 @@ cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py
变更 rubric 后建议:
1. 跑通上述 pytest。
2. 任选 12 条 fixture用内网评测或 `EvalJudgeManualService`真实 GLM 跑一次人工 spot-check对照 `expected_band` / `must_flag_issues` 是否仍合理。
2. 任选 12 条 fixture用内网评测或 `EvalJudgeManualService` 对 GLM-5 跑一次人工 spot-check对照 `expected_band` / `must_flag_issues` 是否仍合理。
## rubric 版本