Files
life-echo/api/docs/observability.md
Sully f09ae248f9 feat: OpenTelemetry LGTM observability, dev tooling, and memoir UX fixes (#31) (#32)
* add staging ios app build script

* feat(api): add OpenTelemetry LGTM stack for local observability

Wire OTel traces, metrics, and logs through a collector to Tempo,
Prometheus, and Loki, with custom LLM instrumentation, dev compose overlay,
Grafana provisioning, env templates, and development.sh auto-start.



* feat: expand observability, harden dev tooling, and fix expo staging UX

Add business and LLM Prometheus metrics with Grafana dashboards, alerting,
and a metrics verification script. Wire telemetry through adapters and core
LLM paths, and document the local LGTM workflow.

Fix development.sh for macOS bash 3.2, open Grafana and eval-web in Chrome,
and repair eval-web auto-open (unbound EVAL_WEB_BROWSER_SCHEDULED). Merge
internal-eval into the main dev script with improved compose handling.

Require EXPO_PUBLIC_* at build time, improve iOS HTTP ATS for staging IPs,
show memoir empty state instead of load errors when no chapters exist, and
add jest env setup plus chapter list response normalization.



* chore: enable Grafana Assistant Cursor plugin



* fix: memoir empty state and repair withdrawn 0020_chapters_book_id stamp

Show empty memoir UI when the chapter list succeeds with no items; treat auth/404 as non-fatal. Extend alembic revision repair so local dev DBs stamped with the removed 0020_chapters_book_id migration can roll back and upgrade to 0019.



---------

Co-authored-by: Kevin <kevin@brighteng.org>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-20 15:14:13 +08:00

6.6 KiB
Raw Blame History

可观测性OpenTelemetry + Grafana LGTM

本地开发使用 OpenTelemetry 采集 traces / metrics / logsOTel Collector 写入 Tempo / Prometheus / Loki,在 Grafana 统一查看。

配置写在 .env(由 .env.developmentdevelopment.sh 同步,或从 .env.example 复制),app.core.config.settings 启动时自动读取,无需在 shell 里 export OTEL_*

启动栈

api/ 目录:

# 1. 数据库与 Redis
docker compose -f docker-compose.dev.yml up -d

# 2. 可观测性(需已存在 life-echo-dev 网络;端口来自 .env 或下列默认)
docker compose -f docker-compose.dev.yml -f docker-compose.observability.yml up -d
服务 默认宿主机地址 compose 变量
Grafana http://127.0.0.1:48300 admin / admin GRAFANA_HOST_PORT
Prometheus http://127.0.0.1:49090 PROMETHEUS_HOST_PORT
OTLP gRPC http://127.0.0.1:48317 OTEL_GRPC_HOST_PORT
OTLP HTTP http://127.0.0.1:48318 OTEL_HTTP_HOST_PORT
Collector health http://127.0.0.1:48333 OTEL_COLLECTOR_HEALTH_HOST_PORT

容器内部仍使用标准端口(如 Collector 4317);仅宿主机映射使用 48xxx 段,与 Postgres 48291、Redis 48307 同一风格。

预置 DashboardLife Echo 文件夹):

Dashboard 用途
Life Echo Overview API RED、LLM 摘要、依赖延迟
Life Echo LLM call_type / agent / tokens、outcome 分布
Life Echo Business 回忆录阶段、WS/ASR/TTS、Celery 业务 span
Life Echo Logs Loki 按 event / trace_id 检索

启用应用导出

.env.example 已给出本地默认值,同步到 .env 即可,例如:

OTEL_ENABLED=true
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:48317
OTEL_TRACES_SAMPLER=always_on
OTEL_SERVICE_NAME=life-echo-api

推荐与全栈一并启动(./development.sh.envOTEL_ENABLED=true 时会起 observability compose并默认打开 Grafana 浏览器标签):

cd api
./development.sh

仅手动起 API不自动开 Grafana

cd api
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Celery worker 同一 .env;未设 OTEL_SERVICE_NAME 时 worker 默认为 life-echo-celery-worker

若 API 跑在 Docker compose 里,应设 OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317(服务名 + 容器内端口),而不是 localhost

不需要可观测性时:.envOTEL_ENABLED=false(或未启动 observability compose

采集内容

类型 来源
HTTP FastAPI 自动 instrumentation/health 排除)
DB SQLAlchemy
Redis redis-py
出站 HTTP httpxDeepSeek 等)
Celery 任务 span + W3C trace 传播
LLM llm_telemetryLangChain / DeepSeek / llm_call+ llm.call.* / llm.tokens.* metrics
业务 business_telemetryWS 回合、回忆录 phase、ASR/TTS、支付等子 span
日志 loguru patcher 注入 trace_idPromtail 解析 event / tid=;可选 LOG_JSON_FILE JSON sink

日志字段:request_idtrace_idspan_id。HTTP 由中间件 contextualizeCelery / 后台由 loguru patcher 从当前 OTel span 合并,无需经过 HTTP 中间件。

常用排查

  1. API 慢Grafana → Temposervice.name=life-echo-api 查 trace看 DB / httpx / llm.* / conversation.ws.* 子 span。
  2. LLM 慢Life Echo LLM Dashboard或 Loki{compose_service=~".+"} |= "event=llm_json_call"
  3. 回忆录卡阶段Tempo 搜 memoir.phase1 / memoir.phase2 / memoir.story_pipeline.*Life Echo Business Dashboard 看 business_operation_duration_milliseconds
  4. 日志 ↔ Trace:在 Tempo 复制 trace_id → Loki{compose_service=~".+"} |= "tid=<前12位>"控制台短格式Promtail 将 trace_id 写入 structured metadata(非高基数 label
  5. Celery 堆积Tempo 过滤 life-echo-celery-workerLoki event=celery_task_failed
  6. 无数据.envOTEL_ENABLED=trueOTEL_EXPORTER_OTLP_ENDPOINT 端口与 OTEL_GRPC_HOST_PORT 一致Collector health http://127.0.0.1:48333Prometheus target otel-collector:8889 UP。

LOG_JSON_FILE 与 Promtail

  • 默认loguru 人类可读行 → Docker stdout → Promtail regex 提取 tid / event / duration_mstrace_id 进 structured metadata不作为 Loki label
  • 可选LOG_JSON_FILE=/path/to/app.jsonl 开启 JSON sinkserialize=true),便于与 OTLP logs 或自建采集对齐;与 Promtail 可并存(同一容器 stdout 仍走 regex

采样staging/prod 第二阶段)

环境 建议
development OTEL_TRACES_SAMPLER=always_on
staging/production OTEL_TRACES_SAMPLER=parentbased_traceidratioOTEL_TRACES_SAMPLER_ARG=0.1

关闭 telemetryOTEL_ENABLED=false,无 exporter 开销。

Prometheus 指标名OTel → Prometheus

OTel 仪器 Prometheus 系列histogram
llm.call.duration (ms) llm_call_duration_milliseconds_bucket
business.operation.duration (ms) business_operation_duration_milliseconds_bucket
http.server.request.duration (s) http_server_request_duration_seconds_bucket
db.client.operation.duration (s) db_client_operation_duration_seconds_bucket
http.client.request.duration (s) http_client_request_duration_seconds_bucket

Counter 示例:llm_call_totalllm_tokens_input_total

校验脚本(需 observability compose + 有流量):

chmod +x scripts/verify_observability_metrics.sh
./scripts/verify_observability_metrics.sh

验收清单(本地 E2E

  • OTEL_ENABLED=true,启动 compose + API + Celery worker
  • 跑一条 WS 对话Tempo 可见 conversation.ws.process_turnllm.chat_invoke
  • 触发 memoir phase1Tempo 可见 memoir.phase1.*memoir.story_pipeline.*
  • Prometheuscall_type label 存在;真实 LLM 后 llm_tokens_input_total > 0
  • Loki|= "tid=<trace前12位>" 能查到同次请求日志
  • ./scripts/verify_observability_metrics.sh 通过
  • Grafana Alerting 页无 provisioning 错误(通知渠道可空)

配置目录