Files

Kevin 99543d04c6 feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

2026-04-07 17:18:47 +08:00

1.1 KiB

Raw Blame History

Evaluation rubric regression（标定与回归）

用于在调整 rubrics/*.py 或 judge_schemas.py 后，做一次低成本回归，避免因 prompt/schema 改动引入静默破坏。

自动化（不调用 LLM）

cd api && uv run pytest tests/test_judge_schemas.py tests/test_eval_composite.py -q

test_judge_schemas.py：细项上限与 total_score 算術一致性。
test_eval_composite.py：合成分在「仅对话 / 仅回忆录 / 双侧缺失」时的语义。

定性标定集

见同目录 fixtures.json：描述若干微型 transcript / 成稿片段与预期倾向（区间或关键词），不绑定具体模型版本。

变更 rubric 后建议：

跑通上述 pytest。
任选 1～2 条 fixture，用内网评测或 EvalJudgeManualService 对真实 GLM 跑一次人工 spot-check，对照 expected_band / must_flag_issues 是否仍合理。

rubric 版本

fixtures.json 内 rubric_id 与代码中 conversation_v1 / memoir_v1 对齐；大改 rubric 时请同步更新 fixtures.json 的说明与期望。

1.1 KiB Raw Blame History Unescape Escape

Evaluation rubric regression（标定与回归）

自动化（不调用 LLM）

定性标定集

rubric 版本

1.1 KiB

Raw Blame History