- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001 when :8000 is already up; document in api/docs/internal-eval.md. - Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks, execution_service and router updates; tests for judge and composite eval. - Memory: ingest nested transaction for embedding/enrichment rollback safety. - Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError). - app-eval-web: Playground saved replays, dialogue turns helper, hash user_id for Memoir; Memoir chapter baseline↔DB row compare with title heuristics; Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI; react-markdown; development proxy and fixture updates.
1.1 KiB
1.1 KiB
Evaluation rubric regression(标定与回归)
用于在调整 rubrics/*.py 或 judge_schemas.py 后,做一次低成本回归,避免因 prompt/schema 改动引入静默破坏。
自动化(不调用 LLM)
cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py -q
test_judge_schemas.py:细项上限与total_score算術一致性。test_eval_composite.py:合成分在「仅对话 / 仅回忆录 / 双侧缺失」时的语义。
定性标定集
见同目录 fixtures.json:描述若干微型 transcript / 成稿片段与预期倾向(区间或关键词),不绑定具体模型版本。
变更 rubric 后建议:
- 跑通上述 pytest。
- 任选 1~2 条 fixture,用内网评测或
EvalJudgeManualService对真实 GLM 跑一次人工 spot-check,对照expected_band/must_flag_issues是否仍合理。
rubric 版本
fixtures.json 内 rubric_id 与代码中 conversation_v1 / memoir_v1 对齐;大改 rubric 时请同步更新 fixtures.json 的说明与期望。