feat: OpenTelemetry LGTM observability, dev tooling, and memoir UX fixes (#31)

* add staging ios app build script

* feat(api): add OpenTelemetry LGTM stack for local observability

Wire OTel traces, metrics, and logs through a collector to Tempo,
Prometheus, and Loki, with custom LLM instrumentation, dev compose overlay,
Grafana provisioning, env templates, and development.sh auto-start.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat: expand observability, harden dev tooling, and fix expo staging UX

Add business and LLM Prometheus metrics with Grafana dashboards, alerting,
and a metrics verification script. Wire telemetry through adapters and core
LLM paths, and document the local LGTM workflow.

Fix development.sh for macOS bash 3.2, open Grafana and eval-web in Chrome,
and repair eval-web auto-open (unbound EVAL_WEB_BROWSER_SCHEDULED). Merge
internal-eval into the main dev script with improved compose handling.

Require EXPO_PUBLIC_* at build time, improve iOS HTTP ATS for staging IPs,
show memoir empty state instead of load errors when no chapters exist, and
add jest env setup plus chapter list response normalization.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore: enable Grafana Assistant Cursor plugin

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: memoir empty state and repair withdrawn 0020_chapters_book_id stamp

Show empty memoir UI when the chapter list succeeds with no items; treat auth/404 as non-fatal. Extend alembic revision repair so local dev DBs stamped with the removed 0020_chapters_book_id migration can roll back and upgrade to 0019.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Kevin <kevin@brighteng.org>
Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Sully
2026-05-20 15:12:21 +08:00
committed by GitHub
parent 0d417331fd
commit fa42757916
85 changed files with 3894 additions and 405 deletions

View File

@@ -4,29 +4,30 @@
## 启动
**推荐一条命令**`internal-eval.sh` 实际调用 `development.sh`,在同一进程树里启动主站 `main:app`**8000**)、**一份** Celery、内部评测 `internal_app`(默认 **8001**以及 `app-eval-web`(默认 **5174**)。不需要再并行执行两份启动脚本
**推荐一条命令**`./development.sh` 默认启动主站**8000**、Celery、内部评测 API(默认 **7999**、评测 Web**5174**`.env``OTEL_ENABLED=true` 时并起 Grafana 且自动打开浏览器。`./internal-eval.sh` 仅为兼容转发
| | 单一命令 `./internal-eval.sh` |
| | `./development.sh`(默认) |
|---|-------------------------------|
| HTTP | 主站 **8000** + internal **8001** |
| Celery | 仅 **一个** worker(与主站共用队列) |
| 前端 | 默认启动 `app-eval-web``START_EVAL_WEB=0` 可关) |
| HTTP | 主站 **8000** + internal **7999** |
| Celery | 仅 **一个** worker |
| 评测 UI | `open` → http://127.0.0.1:5174/`OPEN_EVAL_WEB=0` 可关) |
| 可观测性 | Grafana :48300`OPEN_OBSERVABILITY_UI=0` 可关) |
**主站 + Celery 已在其他终端**`./development.sh` 跑起来了,只在同一台机器上多开评测 HTTP 与前端、且 **不再起第二份 Worker**
```bash
cd api
# 确保 .env.development / .env 含 INTERNAL_EVAL_API_KEY:8000 已被主站监听
SKIP_INFRA=1 SKIP_INSTALL=1 EVAL_ATTACH_ONLY=1 ./internal-eval.sh
SKIP_INFRA=1 SKIP_INSTALL=1 EVAL_ATTACH_ONLY=1 ./development.sh
```
兼容旧写法:`SKIP_CELERY=1` 会映射为 `EVAL_ATTACH_ONLY=1`(仍要求 **8000 已在监听**)。
仅主业务、不要评测台时照旧:`./development.sh`(不设置 `LIFE_ECHO_WITH_INTERNAL_EVAL`
仅主业务、不要评测台`LIFE_ECHO_WITH_INTERNAL_EVAL=0 ./development.sh`
只需 **8001**刻意不启主站 **8000**请用下文「手动 uvicorn」配合既有 Celery不要用 `./internal-eval.sh`(一键脚本会顺带拉起主站)
若只需 **7999**、不启主站 **8000**下文「手动 uvicorn」;不要用一键脚本
**默认会起 `app-eval-web`,并用 Vite `--open` 尝试打开浏览器**`http://127.0.0.1:5174/`)。不要前端时设 `START_EVAL_WEB=0`;只要前端但不要弹窗时设 `OPEN_EVAL_WEB=0`
**默认会起 `app-eval-web`,并用系统浏览器打开评测台**`http://127.0.0.1:5174/`,与 Grafana 同为 `open`)。不要前端时设 `START_EVAL_WEB=0`;只要前端但不要弹窗时设 `OPEN_EVAL_WEB=0`
数据库与主服务共用;需配置环境变量后启动专用进程:

139
api/docs/observability.md Normal file
View File

@@ -0,0 +1,139 @@
# 可观测性OpenTelemetry + Grafana LGTM
本地开发使用 **OpenTelemetry** 采集 traces / metrics / logs**OTel Collector** 写入 **Tempo / Prometheus / Loki**,在 **Grafana** 统一查看。
配置写在 **`.env`**(由 `.env.development``development.sh` 同步,或从 [`.env.example`](../.env.example) 复制),`app.core.config.settings` 启动时自动读取,**无需**在 shell 里 `export OTEL_*`
## 启动栈
`api/` 目录:
```bash
# 1. 数据库与 Redis
docker compose -f docker-compose.dev.yml up -d
# 2. 可观测性(需已存在 life-echo-dev 网络;端口来自 .env 或下列默认)
docker compose -f docker-compose.dev.yml -f docker-compose.observability.yml up -d
```
| 服务 | 默认宿主机地址 | compose 变量 |
|------|----------------|--------------|
| Grafana | http://127.0.0.1:48300 admin / admin | `GRAFANA_HOST_PORT` |
| Prometheus | http://127.0.0.1:49090 | `PROMETHEUS_HOST_PORT` |
| OTLP gRPC | http://127.0.0.1:48317 | `OTEL_GRPC_HOST_PORT` |
| OTLP HTTP | http://127.0.0.1:48318 | `OTEL_HTTP_HOST_PORT` |
| Collector health | http://127.0.0.1:48333 | `OTEL_COLLECTOR_HEALTH_HOST_PORT` |
容器**内部**仍使用标准端口(如 Collector `4317`);仅宿主机映射使用 `48xxx` 段,与 Postgres `48291`、Redis `48307` 同一风格。
预置 Dashboard**Life Echo** 文件夹):
| Dashboard | 用途 |
|-----------|------|
| Life Echo Overview | API RED、LLM 摘要、依赖延迟 |
| Life Echo LLM | `call_type` / agent / tokens、outcome 分布 |
| Life Echo Business | 回忆录阶段、WS/ASR/TTS、Celery 业务 span |
| Life Echo Logs | Loki 按 `event` / `trace_id` 检索 |
## 启用应用导出
在 [`.env.example`](../.env.example) 已给出本地默认值,同步到 `.env` 即可,例如:
```env
OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:48317
OTEL_TRACES_SAMPLER=always_on
OTEL_SERVICE_NAME=life-echo-api
```
推荐与全栈一并启动(`./development.sh``.env``OTEL_ENABLED=true` 时会起 observability compose并默认打开 Grafana 浏览器标签):
```bash
cd api
./development.sh
```
仅手动起 API不自动开 Grafana
```bash
cd api
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
Celery worker 同一 `.env`;未设 `OTEL_SERVICE_NAME` 时 worker 默认为 `life-echo-celery-worker`
若 API 跑在 **Docker compose** 里,应设 `OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317`(服务名 + 容器内端口),而不是 `localhost`
不需要可观测性时:`.env``OTEL_ENABLED=false`(或未启动 observability compose
## 采集内容
| 类型 | 来源 |
|------|------|
| HTTP | FastAPI 自动 instrumentation`/health` 排除) |
| DB | SQLAlchemy |
| Redis | redis-py |
| 出站 HTTP | httpxDeepSeek 等) |
| Celery | 任务 span + W3C trace 传播 |
| LLM | `llm_telemetry`LangChain / DeepSeek / `llm_call`+ `llm.call.*` / `llm.tokens.*` metrics |
| 业务 | `business_telemetry`WS 回合、回忆录 phase、ASR/TTS、支付等子 span |
| 日志 | loguru patcher 注入 `trace_id`Promtail 解析 `event` / `tid=`;可选 `LOG_JSON_FILE` JSON sink |
日志字段:`request_id``trace_id``span_id`。HTTP 由中间件 `contextualize`**Celery / 后台**由 loguru **patcher** 从当前 OTel span 合并,无需经过 HTTP 中间件。
## 常用排查
1. **API 慢**Grafana → Tempo`service.name=life-echo-api` 查 trace看 DB / httpx / `llm.*` / `conversation.ws.*` 子 span。
2. **LLM 慢****Life Echo LLM** Dashboard或 Loki`{compose_service=~".+"} |= "event=llm_json_call"`
3. **回忆录卡阶段**Tempo 搜 `memoir.phase1` / `memoir.phase2` / `memoir.story_pipeline.*`**Life Echo Business** Dashboard 看 `business_operation_duration_milliseconds`
4. **日志 ↔ Trace**:在 Tempo 复制 `trace_id` → Loki`{compose_service=~".+"} |= "tid=<前12位>"`控制台短格式Promtail 将 `trace_id` 写入 **structured metadata**(非高基数 label
5. **Celery 堆积**Tempo 过滤 `life-echo-celery-worker`Loki `event=celery_task_failed`
6. **无数据**`.env``OTEL_ENABLED=true``OTEL_EXPORTER_OTLP_ENDPOINT` 端口与 `OTEL_GRPC_HOST_PORT` 一致Collector health `http://127.0.0.1:48333`Prometheus target `otel-collector:8889` UP。
### LOG_JSON_FILE 与 Promtail
- **默认**loguru 人类可读行 → Docker stdout → Promtail **regex** 提取 `tid` / `event` / `duration_ms``trace_id` 进 structured metadata**不作为 Loki label**。
- **可选**`LOG_JSON_FILE=/path/to/app.jsonl` 开启 JSON sink`serialize=true`),便于与 OTLP logs 或自建采集对齐;与 Promtail 可**并存**(同一容器 stdout 仍走 regex
## 采样staging/prod 第二阶段)
| 环境 | 建议 |
|------|------|
| development | `OTEL_TRACES_SAMPLER=always_on` |
| staging/production | `OTEL_TRACES_SAMPLER=parentbased_traceidratio``OTEL_TRACES_SAMPLER_ARG=0.1` |
关闭 telemetry`OTEL_ENABLED=false`,无 exporter 开销。
## Prometheus 指标名OTel → Prometheus
| OTel 仪器 | Prometheus 系列histogram |
|-----------|------------------------------|
| `llm.call.duration` (ms) | `llm_call_duration_milliseconds_bucket` |
| `business.operation.duration` (ms) | `business_operation_duration_milliseconds_bucket` |
| `http.server.request.duration` (s) | `http_server_request_duration_seconds_bucket` |
| `db.client.operation.duration` (s) | `db_client_operation_duration_seconds_bucket` |
| `http.client.request.duration` (s) | `http_client_request_duration_seconds_bucket` |
Counter 示例:`llm_call_total``llm_tokens_input_total`
校验脚本(需 observability compose + 有流量):
```bash
chmod +x scripts/verify_observability_metrics.sh
./scripts/verify_observability_metrics.sh
```
## 验收清单(本地 E2E
- [ ] `OTEL_ENABLED=true`,启动 compose + API + Celery worker
- [ ] 跑一条 WS 对话Tempo 可见 `conversation.ws.process_turn``llm.chat_invoke`
- [ ] 触发 memoir phase1Tempo 可见 `memoir.phase1.*``memoir.story_pipeline.*`
- [ ] Prometheus`call_type` label 存在;真实 LLM 后 `llm_tokens_input_total` > 0
- [ ] Loki`|= "tid=<trace前12位>"` 能查到同次请求日志
- [ ] `./scripts/verify_observability_metrics.sh` 通过
- [ ] Grafana Alerting 页无 provisioning 错误(通知渠道可空)
## 配置目录
- [`deploy/observability/`](../deploy/observability/)Collector、Tempo、Loki、Prometheus、Grafana provisioning
- [`docker-compose.observability.yml`](../docker-compose.observability.yml):本地 overlay

View File

@@ -305,11 +305,13 @@ sudo journalctl -u life-echo-api -f
### 8. 监控与告警
本地开发与预发可观测性栈OpenTelemetry + Grafana LGTM**[可观测性指南](observability.md)**。staging/production 全量接入为第二阶段(`docker-compose` profile
#### 8.1 配置日志监控
建议使用以下工具:
- **Grafana + Loki + Tempo + Prometheus**(仓库内 `deploy/observability/`,推荐)
- ELK Stack (Elasticsearch + Logstash + Kibana)
- Grafana + Loki
- 云服务商的日志服务
#### 8.2 配置性能监控