api/docs/internal-eval.md

# 内部回归评测平台

与主 API（`app/main.py`）隔离进程部署，避免评测候选链路透出给消费者 App。

## 启动

**推荐一条命令**：`internal-eval.sh` 实际调用 `development.sh`，在同一进程树里启动主站 `main:app`（**8000**）、**一份** Celery、内部评测 `internal_app`（默认 **8001**）以及 `app-eval-web`（默认 **5174**）。不需要再并行执行两份启动脚本。

| | 单一命令 `./internal-eval.sh` |
|---|-------------------------------|
| HTTP | 主站 **8000** + internal **8001** |
| Celery | 仅 **一个** worker（与主站共用队列） |
| 前端 | 默认启动 `app-eval-web`（`START_EVAL_WEB=0` 可关） |

若 **主站 + Celery 已在其他终端** 由 `./development.sh` 跑起来了，只在同一台机器上多开评测 HTTP 与前端、且 **不再起第二份 Worker**：

```bash
cd api
# 确保 .env.development / .env 含 INTERNAL_EVAL_API_KEY；:8000 已被主站监听
SKIP_INFRA=1 SKIP_INSTALL=1 EVAL_ATTACH_ONLY=1 ./internal-eval.sh
```

兼容旧写法：`SKIP_CELERY=1` 会映射为 `EVAL_ATTACH_ONLY=1`（仍要求 **8000 已在监听**）。

仅主业务、不要评测台时照旧：`./development.sh`（不设置 `LIFE_ECHO_WITH_INTERNAL_EVAL`）。

若你只需要 **8001**、刻意不启主站 **8000**，请用下文「手动 uvicorn」配合既有 Celery，不要用 `./internal-eval.sh`（一键脚本会顺带拉起主站）。

**默认会起 `app-eval-web`，并用 Vite `--open` 尝试打开浏览器**（`http://127.0.0.1:5174/`）。不要前端时设 `START_EVAL_WEB=0`；只要前端但不要弹窗时设 `OPEN_EVAL_WEB=0`。

数据库与主服务共用；需配置环境变量后启动专用进程：

```bash
cd api
export INTERNAL_EVAL_API_KEY='your-long-random-secret'
export INTERNAL_EVAL_ENABLE_DOCS=1   # 可选，开 /docs
# 评测评审（Playground / Memoir 手动的对话与成稿打分）
# 智谱：默认 EVAL_JUDGE_API_KEY，否则回退 ZHIPU_API_KEY
export EVAL_JUDGE_API_KEY='...'        # 可选
export EVAL_JUDGE_MODEL='glm-5'
# DeepSeek（API 模型名 deepseek-reasoner 即 R1）：与访谈主链路密钥一致，独立默认模型名
export DEEPSEEK_API_KEY='...'           # 选用 DeepSeek 评审时必填（或回退 LLM_API_KEY）
export EVAL_JUDGE_DEEPSEEK_MODEL='deepseek-reasoner'   # 可选
export EVAL_JUDGE_DEEPSEEK_CONTEXT_WINDOW_TOKENS='64000' # 可选；用于 transcript 截断，避免按 GLM 200K 估长

uv run uvicorn app.internal_main:internal_app --host 0.0.0.0 --port 8001
```

Celery worker 与主站共用（`celery_app` 已 `include` 回忆录等任务；**不再**包含已下线的 `evaluation_tasks` 实验批量跑批）。需 Phase1 / 叙事推进时请启动 worker：

```bash
uv run celery -A app.tasks.celery_app worker -l info -Q celery,memory_idle
```

## 前端（`app-eval-web`）

```bash
cd app-eval-web
npm install
VITE_EVAL_API_BASE=http://127.0.0.1:8001 VITE_EVAL_API_KEY=与上同 npm run dev
```

或使用仓库根目录 `npm run eval-web`（需本地已 `npm install` 在 `app-eval-web`）。

## 流式评审

`POST /internal/api/evaluation/judge/conversation-stream` 使用 **fetch 读取 SSE**（chunk），请求头携带 `X-Internal-Eval-Key` 即可；不要求浏览器 `EventSource`。Body 可选 **`judge_provider`**：`zhipu`（默认）| `deepseek`，以及 **`judge_model`**（空则用该供应商环境默认）。首轮 `meta` 事件会回显 `judge_provider` / `judge_model`。

新增事件：

- `compare_summary`：结构化 A/B 对比摘要，包含 `group_deltas`、关键回落维度、是否出现重复盘问风险，以及 transcript 截断提示。
- `compare_delta`：原有自由文案流，适合人读；不替代结构化结论。

## 评测 Web（`app-eval-web`）

- **Playground · 分步测评**：选用户导出 MD 为基线 → `eval-sandbox` + 逐轮 `replay/conversation`（**`skip_memoir: true`** 时只做对话）→ **`memoir-submit`** 再可选轮询 **`memoir-phase1-ready`** → 跳转 **Memoir / Stories** 看成稿；支持 **智谱 / DeepSeek R1** 对话流式评分（工具栏「评审模型」）。
- **Memoir**：按 `user_id` 拉库中章节快照与基线对照评审。
- **Stories**：故事列表与评审。

## 真实链路透传回放（与 App 一致）

| 方法 | 路径 | 说明 |
|------|------|------|
| `POST` | `/internal/api/evaluation/sessions/eval-sandbox` | 无 body：新建**临时用户**（`eval_` 伪手机号）+ 空白 `conversation_id` |
| `POST` | `/internal/api/evaluation/sessions/replay-bootstrap` | body：`{ "user_id" }`，在已有用户下返回新 `conversation_id` |
| `POST` | `/internal/api/evaluation/replay/conversation` | body：`conversation_id`、`fixture_filename` **或** `user_utterances`；可选 **`skip_memoir`**（默认 false；为 true 时不 `queue_message`、且不会仅因 `flush_memoir_after` 而 `flush_pending`）、`flush_memoir_after`（默认 true）、`skip_tts`（默认 true）。响应含 `segment_ids`（本批创建的用户 segment） |
| `POST` | `/internal/api/evaluation/sessions/{conversation_id}/memoir-submit` | 无 body：收集本会话内 `topic_category IS NULL` 且 `processed` 为 false 的 segment，调用 `flush_pending(user_id, extra_segment_ids=…)`；返回 `segment_ids`、`celery_task_id` |
| `GET` | `/internal/api/evaluation/sessions/{conversation_id}/memoir-phase1-ready` | query：`segment_ids` 可重复。所列 segment 均已写入 `topic_category` 时 `ready: true` |

**默认（`skip_memoir: false`）**：每轮仍相当于主站路径：`create_user_segment` → `process_user_message` → `background_runner.queue_message`；末尾可 `flush_pending`。

**Playground 分步（`skip_memoir: true` + `flush_memoir_after: false`）**：只做 `create_user_segment` 与 `process_user_message`，**不**入回忆录队列；对话结束后再调 **`memoir-submit`** 统一 flush。

- **TTS**：回放默认 `skip_tts: true`。
- **Celery**：Phase1 / 叙事仍依赖 worker；仅起 HTTP 未起 worker 时，`memoir-submit` 后任务会堆积。
- **Playground**：第 2 步可选轮询 `memoir-phase1-ready`（前端默认最长约 **10 分钟**，`VITE_MEMOIR_PHASE1_WAIT_MAX_MS` 可覆盖）。中断时本地草稿可「继续未完成重放」接续同一 `conversation_id`（仅对话进度；旧版「每轮等待 Phase1」草稿会被跳过并提示改走 `memoir-submit`）。

## A/B 发布口径（追平 A / 超过 A）

Playground 的结构化摘要里，后端会给出一份 `gate`：

- `regressed`：仍明显落后 A，或 `context_memory` / `emotion_carry` 等关键项明显回落，或再次出现“重复盘问 / 忽略已答信息”。
- `parity`：总分基本追平 A，且关键维度未明显退步。
- `surpass`：总分显著高于 A，同时 `context_memory`、人物建模等关键项不退步，且未出现重复盘问风险。

建议发布前不要只看单个 case：

1. 先固定一组 **黄金样本 fixture**（覆盖童年、求学、职业、家庭、价值观，以及长对话样本）。
2. 每次 prompt / state / anti-repeat 改动后，用同一组 fixture 全量重放。
3. 要求整组样本里：
   - 不得出现 `regressed` 的受保护样本；
   - 大多数样本至少达到 `parity`；
   - 目标样本才以 `surpass` 作为升级完成标志。

如果 `compare_summary.truncation.*_truncated_for_compare = true`，说明 A/B 对比所用 transcript 仍超过合计预算（`compare_cap_total_chars`）后做了裁切；单侧较短时会先占满「合计字符池」再裁较长一侧尾部。若仍截断，可略调高 `EVAL_JUDGE_CONTEXT_WINDOW_TOKENS` / 降低 `EVAL_JUDGE_APPROX_TOKENS_PER_CHAR`，或见 `EVAL_JUDGE_MAX_COMPARE_TRANSCRIPT_CHARS_EACH`。结论仍应结合逐轮评分与关键样本人工复核。

## 手动 GLM-5（不写 `eval_runs` 表）

| 方法 | 路径 | 说明 |
|------|------|------|
| `POST` | `/internal/api/evaluation/judge/conversation` | body：`{ "conversation_id" }`，返回轮次分 + 全文对话分 |
| `POST` | `/internal/api/evaluation/judge/memoir-chapters` | body：`{ "user_id", "baseline_sections"? }`，Chapter/Story 分项 |
| `GET` | `/internal/api/evaluation/users/{user_id}/memoir-snapshot` | 只读章节与故事正文快照 |

## 回忆录评审：可追溯证据闭包（lineage）

**产品与 tier 口径（strict / partial / fallback）、synthetic vs library 分表、PM 对齐规则、backlog** 见同目录 **[traceable-memoir-lineage.md](./traceable-memoir-lineage.md)**。

手动 `/judge/memoir-chapters` 与历史自动化 run 的 `judge_bundle_json` 已按 **artifact 绑定证据** 组 prompt，而不再默认拼接「最近 N 个会话全文」：

- **`lineage_tier`**：`strict` / `partial` / `fallback`（章节：**有可解析 transcript 链 + 结构化记忆为 strict**；**仅有结构化记忆、无绑定 segment/transcript = partial**，与标注口径一致）。故事侧以 `StoryEvidenceLink` 与章节推导为主；`fallback` = 显式降级最近会话 transcript，避免静默当 strict。
- **`evidence_trace`**：bundle 完整 JSON（segment / conversation / chunk / fact / timeline / summary、`notes` 等）。内审计一般够用；若需按类型深链 UI 再排期。
- **`format_meta`**：`truncated`、`dropped_sections`、`included_token_estimate` 等，区分「prompt 裁掉」与「库中无 lineage」。
- **生产侧**：叙事流水线在每次 Story 写入后覆盖 `story_evidence_links`，并在当前 `story_versions.prompt_meta.memoir_retrieval` 写入本轮检索到的稳定 id（见 `story_pipeline_sync._persist_story_lineage_sync`）。
- **章节快照 Phase C**：`chapter_evidence_snapshots` + `chapter_evidence_links`，`chapters.current_evidence_snapshot_id` 指向当前版本；`evidence_bundle_json` 仍为镜像。评测读取顺序：表快照 → JSON → 现场 `source_segments`（不一致时 `notes` 提示）。刷新见 `memoir/chapter_evidence_snapshot.py`。历史库可选 `uv run python scripts/backfill_chapter_evidence_snapshots.py`（旧数据不强制）。
- **对话 memory trace（Phase 八）**：访谈路由下，`conversation_messages.memory_retrieval_trace_json` 在配对 **AI** 消息上写入本轮 `HybridRetriever` 命中的 chunk/fact/timeline/summary/story 等 id（见 `memory/retrieval_trace.py`）。

历史数据可无 link：评测仍可用 partial/fallback 跑通；可选离线 backfill 须在 job 中显式打标，不冒充 strict。

## Fixture 详情扩展

`GET /internal/api/evaluation/fixtures/user-exports/{filename}` 在原有 `turns` 外增加：

- `source_user_id`：导出抬头中的 User ID
- `memoir_sections`：`## 回忆录章节（生成正文）` 下按标题切分的基线正文（已去掉 `{{IMAGE:...}}` 占位）
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
+								# 内部回归评测平台
 								与主 API（`app/main.py`）隔离进程部署，避免评测候选链路透出给消费者 App。
 								## 启动
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								**推荐一条命令**：`internal-eval.sh` 实际调用 `development.sh`，在同一进程树里启动主站 `main:app`（**8000**）、**一份** Celery、内部评测 `internal_app`（默认 **8001**）以及 `app-eval-web`（默认 **5174**）。不需要再并行执行两份启动脚本。
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								| | 单一命令 `./internal-eval.sh` |
 								|---|-------------------------------|
 								| HTTP | 主站 **8000** + internal **8001** |
 								| Celery | 仅 **一个** worker（与主站共用队列） |
 								| 前端 | 默认启动 `app-eval-web`（`START_EVAL_WEB=0` 可关） |
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								若 **主站 + Celery 已在其他终端** 由 `./development.sh` 跑起来了，只在同一台机器上多开评测 HTTP 与前端、且 **不再起第二份 Worker**：
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
 								```bash
 								cd api
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								# 确保 .env.development / .env 含 INTERNAL_EVAL_API_KEY；:8000 已被主站监听
 								SKIP_INFRA=1 SKIP_INSTALL=1 EVAL_ATTACH_ONLY=1 ./internal-eval.sh
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
+								```
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								兼容旧写法：`SKIP_CELERY=1` 会映射为 `EVAL_ATTACH_ONLY=1`（仍要求 **8000 已在监听**）。
 								仅主业务、不要评测台时照旧：`./development.sh`（不设置 `LIFE_ECHO_WITH_INTERNAL_EVAL`）。
 								若你只需要 **8001**、刻意不启主站 **8000**，请用下文「手动 uvicorn」配合既有 Celery，不要用 `./internal-eval.sh`（一键脚本会顺带拉起主站）。
 								**默认会起 `app-eval-web`，并用 Vite `--open` 尝试打开浏览器**（`http://127.0.0.1:5174/`）。不要前端时设 `START_EVAL_WEB=0`；只要前端但不要弹窗时设 `OPEN_EVAL_WEB=0`。
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
+								数据库与主服务共用；需配置环境变量后启动专用进程：
 								```bash
 								cd api
 								export INTERNAL_EVAL_API_KEY='your-long-random-secret'
 								export INTERNAL_EVAL_ENABLE_DOCS=1   # 可选，开 /docs
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								# 评测评审（Playground / Memoir 手动的对话与成稿打分）
 								# 智谱：默认 EVAL_JUDGE_API_KEY，否则回退 ZHIPU_API_KEY
 								export EVAL_JUDGE_API_KEY='...'        # 可选
 								export EVAL_JUDGE_MODEL='glm-5'
 								# DeepSeek（API 模型名 deepseek-reasoner 即 R1）：与访谈主链路密钥一致，独立默认模型名
 								export DEEPSEEK_API_KEY='...'           # 选用 DeepSeek 评审时必填（或回退 LLM_API_KEY）
 								export EVAL_JUDGE_DEEPSEEK_MODEL='deepseek-reasoner'   # 可选
 								export EVAL_JUDGE_DEEPSEEK_CONTEXT_WINDOW_TOKENS='64000' # 可选；用于 transcript 截断，避免按 GLM 200K 估长
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
 								uv run uvicorn app.internal_main:internal_app --host 0.0.0.0 --port 8001
 								```
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								Celery worker 与主站共用（`celery_app` 已 `include` 回忆录等任务；**不再**包含已下线的 `evaluation_tasks` 实验批量跑批）。需 Phase1 / 叙事推进时请启动 worker：
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
 								```bash
-												feat(eval): memoir A/B chapter judging and eval-web parity with dialogue

- Judge baseline excerpt and library chapter separately; build_memoir_compare_summary for gate, nine-dim and leaf deltas.

- Memoir SSE chapter payload: baseline_judge, compare_summary, baseline_judge_error.

- MemoirJudgeOutput: loose score coercion and post-validate clamp; memoir judge prompt caps from settings.

- app-eval-web: two-column MemoirScoreCard layout, MemoirCompareSummary, chapter blocks and CSS.

- Add memoir_compare_summary, log_events, celery_log_context, memoir_pipeline_progress; tests and migration 0014.

- Misc: memory/evidence and enrichment paths, task/orchestrator updates, internal-eval docs, env examples.

											
										
										
											2026-04-10 10:23:43 +08:00
+								uv run celery -A app.tasks.celery_app worker -l info -Q celery,memory_idle
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
+								```
 								## 前端（`app-eval-web`）
 								```bash
 								cd app-eval-web
 								npm install
 								VITE_EVAL_API_BASE=http://127.0.0.1:8001 VITE_EVAL_API_KEY=与上同 npm run dev
 								```
 								或使用仓库根目录 `npm run eval-web`（需本地已 `npm install` 在 `app-eval-web`）。
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								## 流式评审
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								`POST /internal/api/evaluation/judge/conversation-stream` 使用 **fetch 读取 SSE**（chunk），请求头携带 `X-Internal-Eval-Key` 即可；不要求浏览器 `EventSource`。Body 可选 **`judge_provider`**：`zhipu`（默认）| `deepseek`，以及 **`judge_model`**（空则用该供应商环境默认）。首轮 `meta` 事件会回显 `judge_provider` / `judge_model`。
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								新增事件：
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								- `compare_summary`：结构化 A/B 对比摘要，包含 `group_deltas`、关键回落维度、是否出现重复盘问风险，以及 transcript 截断提示。
 								- `compare_delta`：原有自由文案流，适合人读；不替代结构化结论。
 								## 评测 Web（`app-eval-web`）
 								- **Playground · 分步测评**：选用户导出 MD 为基线 → `eval-sandbox` + 逐轮 `replay/conversation`（**`skip_memoir: true`** 时只做对话）→ **`memoir-submit`** 再可选轮询 **`memoir-phase1-ready`** → 跳转 **Memoir / Stories** 看成稿；支持 **智谱 / DeepSeek R1** 对话流式评分（工具栏「评审模型」）。
 								- **Memoir**：按 `user_id` 拉库中章节快照与基线对照评审。
 								- **Stories**：故事列表与评审。
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
 								## 真实链路透传回放（与 App 一致）
 								| 方法 | 路径 | 说明 |
 								|------|------|------|
 								| `POST` | `/internal/api/evaluation/sessions/eval-sandbox` | 无 body：新建**临时用户**（`eval_` 伪手机号）+ 空白 `conversation_id` |
 								| `POST` | `/internal/api/evaluation/sessions/replay-bootstrap` | body：`{ "user_id" }`，在已有用户下返回新 `conversation_id` |
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								| `POST` | `/internal/api/evaluation/replay/conversation` | body：`conversation_id`、`fixture_filename` **或** `user_utterances`；可选 **`skip_memoir`**（默认 false；为 true 时不 `queue_message`、且不会仅因 `flush_memoir_after` 而 `flush_pending`）、`flush_memoir_after`（默认 true）、`skip_tts`（默认 true）。响应含 `segment_ids`（本批创建的用户 segment） |
 								| `POST` | `/internal/api/evaluation/sessions/{conversation_id}/memoir-submit` | 无 body：收集本会话内 `topic_category IS NULL` 且 `processed` 为 false 的 segment，调用 `flush_pending(user_id, extra_segment_ids=…)`；返回 `segment_ids`、`celery_task_id` |
 								| `GET` | `/internal/api/evaluation/sessions/{conversation_id}/memoir-phase1-ready` | query：`segment_ids` 可重复。所列 segment 均已写入 `topic_category` 时 `ready: true` |
 								**默认（`skip_memoir: false`）**：每轮仍相当于主站路径：`create_user_segment` → `process_user_message` → `background_runner.queue_message`；末尾可 `flush_pending`。
 								**Playground 分步（`skip_memoir: true` + `flush_memoir_after: false`）**：只做 `create_user_segment` 与 `process_user_message`，**不**入回忆录队列；对话结束后再调 **`memoir-submit`** 统一 flush。
 								- **TTS**：回放默认 `skip_tts: true`。
 								- **Celery**：Phase1 / 叙事仍依赖 worker；仅起 HTTP 未起 worker 时，`memoir-submit` 后任务会堆积。
 								- **Playground**：第 2 步可选轮询 `memoir-phase1-ready`（前端默认最长约 **10 分钟**，`VITE_MEMOIR_PHASE1_WAIT_MAX_MS` 可覆盖）。中断时本地草稿可「继续未完成重放」接续同一 `conversation_id`（仅对话进度；旧版「每轮等待 Phase1」草稿会被跳过并提示改走 `memoir-submit`）。
 								## A/B 发布口径（追平 A / 超过 A）
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								Playground 的结构化摘要里，后端会给出一份 `gate`：
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								- `regressed`：仍明显落后 A，或 `context_memory` / `emotion_carry` 等关键项明显回落，或再次出现“重复盘问 / 忽略已答信息”。
 								- `parity`：总分基本追平 A，且关键维度未明显退步。
 								- `surpass`：总分显著高于 A，同时 `context_memory`、人物建模等关键项不退步，且未出现重复盘问风险。
 								建议发布前不要只看单个 case：
 . 先固定一组 **黄金样本 fixture**（覆盖童年、求学、职业、家庭、价值观，以及长对话样本）。
 . 每次 prompt / state / anti-repeat 改动后，用同一组 fixture 全量重放。
 . 要求整组样本里：
 								   - 不得出现 `regressed` 的受保护样本；
 								   - 大多数样本至少达到 `parity`；
 								   - 目标样本才以 `surpass` 作为升级完成标志。
-												feat:
1. 建立问题库大纲，对应每个人生阶段槽位
2. 鼓励使用更生活化的交流语言共情与总结
3. 降低评审模型可能发生截断的概率
4. 成稿质量维度强化情感表达和上下文连贯性

											
										
										
											2026-04-09 15:32:35 +08:00
+								如果 `compare_summary.truncation.*_truncated_for_compare = true`，说明 A/B 对比所用 transcript 仍超过合计预算（`compare_cap_total_chars`）后做了裁切；单侧较短时会先占满「合计字符池」再裁较长一侧尾部。若仍截断，可略调高 `EVAL_JUDGE_CONTEXT_WINDOW_TOKENS` / 降低 `EVAL_JUDGE_APPROX_TOKENS_PER_CHAR`，或见 `EVAL_JUDGE_MAX_COMPARE_TRANSCRIPT_CHARS_EACH`。结论仍应结合逐轮评分与关键样本人工复核。
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
-												feat(evaluation): memoir readiness, judge/replay updates, eval web playground

Add memoir_readiness_service and router tests; extend judge schemas/services, replay_service, and conversation rubric; align story route agent, payload, prompts, and story_pipeline_sync; update agent logging, config, and DI. Document internal-eval; add replayDraft util and PlaygroundPage changes in app-eval-web.

											
										
										
											2026-04-08 09:38:07 +08:00
+								## 手动 GLM-5（不写 `eval_runs` 表）
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
 								| 方法 | 路径 | 说明 |
 								|------|------|------|
 								| `POST` | `/internal/api/evaluation/judge/conversation` | body：`{ "conversation_id" }`，返回轮次分 + 全文对话分 |
 								| `POST` | `/internal/api/evaluation/judge/memoir-chapters` | body：`{ "user_id", "baseline_sections"? }`，Chapter/Story 分项 |
 								| `GET` | `/internal/api/evaluation/users/{user_id}/memoir-snapshot` | 只读章节与故事正文快照 |
-												feat: 回忆录证据血缘与内部评测可追溯，顺带对齐本地评测台与 CI

数据库与模型：新增多版迁移（章节证据快照、对话血缘、记忆事实/时间线 lineage 等），把「成稿 ↔ 对话/记忆」的溯源信息落到表结构里。
业务链路：会话与 WS、回忆录/故事流水线、记忆写入与 enrichment 等跟着接上线索与快照；新增章节证据快照与评测侧 EvalTraceService 等模块，方便组评审用的证据包。
内部评测：自动化 run 与手工 memoir 评审共用可追溯证据；rubric/ judge 相关脚本与文档有配套调整。
app-eval-web：Memoir/实验详情里能展开看证据摘要与 evidence_trace（含对话轮次 id）；Vite 代理与 development.sh 注入的 API 端口与当前默认内部评测端口一致，避免改端口后页面连错服务。
工程杂项：GitHub Actions / 仓库说明有更新；各适配器与支付/配额/plan 等多处为小改动或跟随主改动的收尾；新增/扩充了?

											
										
										
											2026-04-08 15:37:09 +08:00
+								## 回忆录评审：可追溯证据闭包（lineage）
 								**产品与 tier 口径（strict / partial / fallback）、synthetic vs library 分表、PM 对齐规则、backlog** 见同目录 **[traceable-memoir-lineage.md](./traceable-memoir-lineage.md)**。
-												refactor(eval+memoir)：精简内部评测路由与服务，composite/对话摘要与 judge 能力补强

- 访谈：新增 interview_state_hints，联动 orchestrator 与提示词
- 回忆录：story_pipeline_sync/state/memory/post_commit 与 Celery 任务调整
- 基建：开发用 celery broker、compose/development 脚本、依赖注入
- eval-web：移除数据集/实验/版本等页面与流式轮询，突出 Playground
- 文档与单测同步

											
										
										
											2026-04-08 21:36:12 +08:00
+								手动 `/judge/memoir-chapters` 与历史自动化 run 的 `judge_bundle_json` 已按 **artifact 绑定证据** 组 prompt，而不再默认拼接「最近 N 个会话全文」：
-												feat: 回忆录证据血缘与内部评测可追溯，顺带对齐本地评测台与 CI

数据库与模型：新增多版迁移（章节证据快照、对话血缘、记忆事实/时间线 lineage 等），把「成稿 ↔ 对话/记忆」的溯源信息落到表结构里。
业务链路：会话与 WS、回忆录/故事流水线、记忆写入与 enrichment 等跟着接上线索与快照；新增章节证据快照与评测侧 EvalTraceService 等模块，方便组评审用的证据包。
内部评测：自动化 run 与手工 memoir 评审共用可追溯证据；rubric/ judge 相关脚本与文档有配套调整。
app-eval-web：Memoir/实验详情里能展开看证据摘要与 evidence_trace（含对话轮次 id）；Vite 代理与 development.sh 注入的 API 端口与当前默认内部评测端口一致，避免改端口后页面连错服务。
工程杂项：GitHub Actions / 仓库说明有更新；各适配器与支付/配额/plan 等多处为小改动或跟随主改动的收尾；新增/扩充了?

											
										
										
											2026-04-08 15:37:09 +08:00
 								- **`lineage_tier`**：`strict` / `partial` / `fallback`（章节：**有可解析 transcript 链 + 结构化记忆为 strict**；**仅有结构化记忆、无绑定 segment/transcript = partial**，与标注口径一致）。故事侧以 `StoryEvidenceLink` 与章节推导为主；`fallback` = 显式降级最近会话 transcript，避免静默当 strict。
 								- **`evidence_trace`**：bundle 完整 JSON（segment / conversation / chunk / fact / timeline / summary、`notes` 等）。内审计一般够用；若需按类型深链 UI 再排期。
 								- **`format_meta`**：`truncated`、`dropped_sections`、`included_token_estimate` 等，区分「prompt 裁掉」与「库中无 lineage」。
 								- **生产侧**：叙事流水线在每次 Story 写入后覆盖 `story_evidence_links`，并在当前 `story_versions.prompt_meta.memoir_retrieval` 写入本轮检索到的稳定 id（见 `story_pipeline_sync._persist_story_lineage_sync`）。
 								- **章节快照 Phase C**：`chapter_evidence_snapshots` + `chapter_evidence_links`，`chapters.current_evidence_snapshot_id` 指向当前版本；`evidence_bundle_json` 仍为镜像。评测读取顺序：表快照 → JSON → 现场 `source_segments`（不一致时 `notes` 提示）。刷新见 `memoir/chapter_evidence_snapshot.py`。历史库可选 `uv run python scripts/backfill_chapter_evidence_snapshots.py`（旧数据不强制）。
 								- **对话 memory trace（Phase 八）**：访谈路由下，`conversation_messages.memory_retrieval_trace_json` 在配对 **AI** 消息上写入本轮 `HybridRetriever` 命中的 chunk/fact/timeline/summary/story 等 id（见 `memory/retrieval_trace.py`）。
 								历史数据可无 link：评测仍可用 partial/fallback 跑通；可选离线 backfill 须在 job 中显式打标，不冒充 strict。
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
+								## Fixture 详情扩展
 								`GET /internal/api/evaluation/fixtures/user-exports/{filename}` 在原有 `turns` 外增加：
 								- `source_user_id`：导出抬头中的 User ID
 								- `memoir_sections`：`## 回忆录章节（生成正文）` 下按标题切分的基线正文（已去掉 `{{IMAGE:...}}` 占位）