api/docs/internal-eval.md

# 内部回归评测平台

与主 API（`app/main.py`）隔离进程部署，避免评测候选链路透出给消费者 App。

## 启动

**推荐一条命令**：`internal-eval.sh` 实际调用 `development.sh`，在同一进程树里启动主站 `main:app`（**8000**）、**一份** Celery、内部评测 `internal_app`（默认 **8001**）以及 `app-eval-web`（默认 **5174**）。不需要再并行执行两份启动脚本。

| | 单一命令 `./internal-eval.sh` |
|---|-------------------------------|
| HTTP | 主站 **8000** + internal **8001** |
| Celery | 仅 **一个** worker（与主站共用队列） |
| 前端 | 默认启动 `app-eval-web`（`START_EVAL_WEB=0` 可关） |

若 **主站 + Celery 已在其他终端** 由 `./development.sh` 跑起来了，只在同一台机器上多开评测 HTTP 与前端、且 **不再起第二份 Worker**：

```bash
cd api
# 确保 .env.development / .env 含 INTERNAL_EVAL_API_KEY；:8000 已被主站监听
SKIP_INFRA=1 SKIP_INSTALL=1 EVAL_ATTACH_ONLY=1 ./internal-eval.sh
```

兼容旧写法：`SKIP_CELERY=1` 会映射为 `EVAL_ATTACH_ONLY=1`（仍要求 **8000 已在监听**）。

仅主业务、不要评测台时照旧：`./development.sh`（不设置 `LIFE_ECHO_WITH_INTERNAL_EVAL`）。

若你只需要 **8001**、刻意不启主站 **8000**，请用下文「手动 uvicorn」配合既有 Celery，不要用 `./internal-eval.sh`（一键脚本会顺带拉起主站）。

**默认会起 `app-eval-web`，并用 Vite `--open` 尝试打开浏览器**（`http://127.0.0.1:5174/`）。不要前端时设 `START_EVAL_WEB=0`；只要前端但不要弹窗时设 `OPEN_EVAL_WEB=0`。

数据库与主服务共用；需配置环境变量后启动专用进程：

```bash
cd api
export INTERNAL_EVAL_API_KEY='your-long-random-secret'
export INTERNAL_EVAL_ENABLE_DOCS=1   # 可选，开 /docs
# GLM 评审（默认复用智谱 key，也可单独配置）
export EVAL_JUDGE_API_KEY='...'        # 可选，默认 ZHIPU_API_KEY
export EVAL_JUDGE_MODEL='glm-4-flash'

uv run uvicorn app.internal_main:internal_app --host 0.0.0.0 --port 8001
```

Celery worker 需已包含 `app.tasks.evaluation_tasks`（仓库 `celery_app.include` 已注册）。跑实验前：

```bash
uv run celery -A app.tasks.celery_app worker -l info
```

## 前端（`app-eval-web`）

```bash
cd app-eval-web
npm install
VITE_EVAL_API_BASE=http://127.0.0.1:8001 VITE_EVAL_API_KEY=与上同 npm run dev
```

或使用仓库根目录 `npm run eval-web`（需本地已 `npm install` 在 `app-eval-web`）。

## SSE / EventSource

浏览器 `EventSource` 无法带自定义 Header，流式端点支持 **query** `?key=`，与 `X-Internal-Eval-Key` 等效。

## 评测 Web：两大模块

- **对话评测**：选 `api/tests/user_exports/*.md` 为基准 →「新建评测会话」或填写已有 `conversation_id` →「执行回放」→「GLM 评审对话」。
- **回忆录章节**：同一套 fixture 会带上导出 MD 中的 `source_user_id` 与 `memoir_sections`；「刷新库中章节/故事」拉 DB 快照 →「GLM 评审章节」（基线节选与当前成稿一并送评）。

## 真实链路透传回放（与 App 一致）

| 方法 | 路径 | 说明 |
|------|------|------|
| `POST` | `/internal/api/evaluation/sessions/eval-sandbox` | 无 body：新建**临时用户**（`eval_` 伪手机号）+ 空白 `conversation_id` |
| `POST` | `/internal/api/evaluation/sessions/replay-bootstrap` | body：`{ "user_id" }`，在已有用户下返回新 `conversation_id` |
| `POST` | `/internal/api/evaluation/replay/conversation` | body：`conversation_id`、`fixture_filename` **或** `user_utterances`；可选 `flush_memoir_after`（默认 true）、`skip_tts`（默认 true） |

每轮等价于 WebSocket 文本路径：`create_user_segment` → `process_user_message`（内部可 `force_skip_tts`）→ `background_runner.queue_message`。

- **TTS**：回放默认 `skip_tts: true`，不在评测台跑语音合成。
- **Memory / 回忆录管线**：`queue_message` 与末尾 `flush_pending` 依赖 **Celery worker**（`process_memoir_phase1` 等）；仅起 internal API 未起 worker 时，对话会落库但章节异步不会推进。

## 手动 GLM（不写 `eval_runs` 表）

| 方法 | 路径 | 说明 |
|------|------|------|
| `POST` | `/internal/api/evaluation/judge/conversation` | body：`{ "conversation_id" }`，返回轮次分 + 全文对话分 |
| `POST` | `/internal/api/evaluation/judge/memoir-chapters` | body：`{ "user_id", "baseline_sections"? }`，Chapter/Story 分项 |
| `GET` | `/internal/api/evaluation/users/{user_id}/memoir-snapshot` | 只读章节与故事正文快照 |

## Fixture 详情扩展

`GET /internal/api/evaluation/fixtures/user-exports/{filename}` 在原有 `turns` 外增加：

- `source_user_id`：导出抬头中的 User ID
- `memoir_sections`：`## 回忆录章节（生成正文）` 下按标题切分的基线正文（已去掉 `{{IMAGE:...}}` 占位）

## 门禁规则（v1）

- 所有 case 的合成均分：候选须 **严格高于** 基线。
- `is_protected=true` 的 case：合成份跌幅不得超过 `EVAL_GATE_PROTECTED_REGRESSION_THRESHOLD`（默认 2 分）。

结果写入 `eval_gate_verdicts`，不影响 `git`；后续可接 pre-commit / CI。
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
+								# 内部回归评测平台
 								与主 API（`app/main.py`）隔离进程部署，避免评测候选链路透出给消费者 App。
 								## 启动
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								**推荐一条命令**：`internal-eval.sh` 实际调用 `development.sh`，在同一进程树里启动主站 `main:app`（**8000**）、**一份** Celery、内部评测 `internal_app`（默认 **8001**）以及 `app-eval-web`（默认 **5174**）。不需要再并行执行两份启动脚本。
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								| | 单一命令 `./internal-eval.sh` |
 								|---|-------------------------------|
 								| HTTP | 主站 **8000** + internal **8001** |
 								| Celery | 仅 **一个** worker（与主站共用队列） |
 								| 前端 | 默认启动 `app-eval-web`（`START_EVAL_WEB=0` 可关） |
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								若 **主站 + Celery 已在其他终端** 由 `./development.sh` 跑起来了，只在同一台机器上多开评测 HTTP 与前端、且 **不再起第二份 Worker**：
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
 								```bash
 								cd api
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								# 确保 .env.development / .env 含 INTERNAL_EVAL_API_KEY；:8000 已被主站监听
 								SKIP_INFRA=1 SKIP_INSTALL=1 EVAL_ATTACH_ONLY=1 ./internal-eval.sh
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
+								```
-												feat(eval): internal-eval stack, judge fixes, and eval web overhaul

- Merge internal-eval into development.sh (single Celery/infra); internal-eval.sh
  wraps with LIFE_ECHO_WITH_INTERNAL_EVAL; EVAL_ATTACH_ONLY for attaching 8001
  when :8000 is already up; document in api/docs/internal-eval.md.
- Evaluation: transcript_for_judge, judge error surfacing, rubric/schema tweaks,
  execution_service and router updates; tests for judge and composite eval.
- Memory: ingest nested transaction for embedding/enrichment rollback safety.
- Conversation WS: logger.exception for pipeline errors (avoid loguru KeyError).
- app-eval-web: Playground saved replays, dialogue turns helper, hash user_id
  for Memoir; Memoir chapter baseline↔DB row compare with title heuristics;
  Stories page (#memoir-stories); Markdown + copy buttons; toolbar/panel UI;
  react-markdown; development proxy and fixture updates.

											
										
										
											2026-04-07 17:15:01 +08:00
+								兼容旧写法：`SKIP_CELERY=1` 会映射为 `EVAL_ATTACH_ONLY=1`（仍要求 **8000 已在监听**）。
 								仅主业务、不要评测台时照旧：`./development.sh`（不设置 `LIFE_ECHO_WITH_INTERNAL_EVAL`）。
 								若你只需要 **8001**、刻意不启主站 **8000**，请用下文「手动 uvicorn」配合既有 Celery，不要用 `./internal-eval.sh`（一键脚本会顺带拉起主站）。
 								**默认会起 `app-eval-web`，并用 Vite `--open` 尝试打开浏览器**（`http://127.0.0.1:5174/`）。不要前端时设 `START_EVAL_WEB=0`；只要前端但不要弹窗时设 `OPEN_EVAL_WEB=0`。
-												feat(evaluation): session catalog, user export import, and eval web UI

- Extend evaluation API: schemas, router, repo, admin and execution services
- Improve user export markdown importer; add fixtures and importer tests
- Session catalog repo/service updates; internal app wiring and docs
- Add internal-eval.sh helper; refresh app-eval-web (App, styles, Vite)

											
										
										
											2026-04-06 13:45:04 +08:00
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
+								数据库与主服务共用；需配置环境变量后启动专用进程：
 								```bash
 								cd api
 								export INTERNAL_EVAL_API_KEY='your-long-random-secret'
 								export INTERNAL_EVAL_ENABLE_DOCS=1   # 可选，开 /docs
 								# GLM 评审（默认复用智谱 key，也可单独配置）
 								export EVAL_JUDGE_API_KEY='...'        # 可选，默认 ZHIPU_API_KEY
 								export EVAL_JUDGE_MODEL='glm-4-flash'
 								uv run uvicorn app.internal_main:internal_app --host 0.0.0.0 --port 8001
 								```
 								Celery worker 需已包含 `app.tasks.evaluation_tasks`（仓库 `celery_app.include` 已注册）。跑实验前：
 								```bash
 								uv run celery -A app.tasks.celery_app worker -l info
 								```
 								## 前端（`app-eval-web`）
 								```bash
 								cd app-eval-web
 								npm install
 								VITE_EVAL_API_BASE=http://127.0.0.1:8001 VITE_EVAL_API_KEY=与上同 npm run dev
 								```
 								或使用仓库根目录 `npm run eval-web`（需本地已 `npm install` 在 `app-eval-web`）。
 								## SSE / EventSource
 								浏览器 `EventSource` 无法带自定义 Header，流式端点支持 **query** `?key=`，与 `X-Internal-Eval-Key` 等效。
-												feat/ eval

											
										
										
											2026-04-06 23:19:20 +08:00
+								## 评测 Web：两大模块
 								- **对话评测**：选 `api/tests/user_exports/*.md` 为基准 →「新建评测会话」或填写已有 `conversation_id` →「执行回放」→「GLM 评审对话」。
 								- **回忆录章节**：同一套 fixture 会带上导出 MD 中的 `source_user_id` 与 `memoir_sections`；「刷新库中章节/故事」拉 DB 快照 →「GLM 评审章节」（基线节选与当前成稿一并送评）。
 								## 真实链路透传回放（与 App 一致）
 								| 方法 | 路径 | 说明 |
 								|------|------|------|
 								| `POST` | `/internal/api/evaluation/sessions/eval-sandbox` | 无 body：新建**临时用户**（`eval_` 伪手机号）+ 空白 `conversation_id` |
 								| `POST` | `/internal/api/evaluation/sessions/replay-bootstrap` | body：`{ "user_id" }`，在已有用户下返回新 `conversation_id` |
 								| `POST` | `/internal/api/evaluation/replay/conversation` | body：`conversation_id`、`fixture_filename` **或** `user_utterances`；可选 `flush_memoir_after`（默认 true）、`skip_tts`（默认 true） |
 								每轮等价于 WebSocket 文本路径：`create_user_segment` → `process_user_message`（内部可 `force_skip_tts`）→ `background_runner.queue_message`。
 								- **TTS**：回放默认 `skip_tts: true`，不在评测台跑语音合成。
 								- **Memory / 回忆录管线**：`queue_message` 与末尾 `flush_pending` 依赖 **Celery worker**（`process_memoir_phase1` 等）；仅起 internal API 未起 worker 时，对话会落库但章节异步不会推进。
 								## 手动 GLM（不写 `eval_runs` 表）
 								| 方法 | 路径 | 说明 |
 								|------|------|------|
 								| `POST` | `/internal/api/evaluation/judge/conversation` | body：`{ "conversation_id" }`，返回轮次分 + 全文对话分 |
 								| `POST` | `/internal/api/evaluation/judge/memoir-chapters` | body：`{ "user_id", "baseline_sections"? }`，Chapter/Story 分项 |
 								| `GET` | `/internal/api/evaluation/users/{user_id}/memoir-snapshot` | 只读章节与故事正文快照 |
 								## Fixture 详情扩展
 								`GET /internal/api/evaluation/fixtures/user-exports/{filename}` 在原有 `turns` 外增加：
 								- `source_user_id`：导出抬头中的 User ID
 								- `memoir_sections`：`## 回忆录章节（生成正文）` 下按标题切分的基线正文（已去掉 `{{IMAGE:...}}` 占位）
-												feat/ 导出开发容器内的数据用于评估

											
										
										
											2026-04-03 14:44:46 +08:00
+								## 门禁规则（v1）
 								- 所有 case 的合成均分：候选须 **严格高于** 基线。
 								- `is_protected=true` 的 case：合成份跌幅不得超过 `EVAL_GATE_PROTECTED_REGRESSION_THRESHOLD`（默认 2 分）。
 								结果写入 `eval_gate_verdicts`，不影响 `git`；后续可接 pre-commit / CI。