140 lines
6.6 KiB
Markdown
140 lines
6.6 KiB
Markdown
|
|
# 可观测性(OpenTelemetry + Grafana LGTM)
|
|||
|
|
|
|||
|
|
本地开发使用 **OpenTelemetry** 采集 traces / metrics / logs,经 **OTel Collector** 写入 **Tempo / Prometheus / Loki**,在 **Grafana** 统一查看。
|
|||
|
|
|
|||
|
|
配置写在 **`.env`**(由 `.env.development` 经 `development.sh` 同步,或从 [`.env.example`](../.env.example) 复制),`app.core.config.settings` 启动时自动读取,**无需**在 shell 里 `export OTEL_*`。
|
|||
|
|
|
|||
|
|
## 启动栈
|
|||
|
|
|
|||
|
|
在 `api/` 目录:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. 数据库与 Redis
|
|||
|
|
docker compose -f docker-compose.dev.yml up -d
|
|||
|
|
|
|||
|
|
# 2. 可观测性(需已存在 life-echo-dev 网络;端口来自 .env 或下列默认)
|
|||
|
|
docker compose -f docker-compose.dev.yml -f docker-compose.observability.yml up -d
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| 服务 | 默认宿主机地址 | compose 变量 |
|
|||
|
|
|------|----------------|--------------|
|
|||
|
|
| Grafana | http://127.0.0.1:48300 (admin / admin) | `GRAFANA_HOST_PORT` |
|
|||
|
|
| Prometheus | http://127.0.0.1:49090 | `PROMETHEUS_HOST_PORT` |
|
|||
|
|
| OTLP gRPC | http://127.0.0.1:48317 | `OTEL_GRPC_HOST_PORT` |
|
|||
|
|
| OTLP HTTP | http://127.0.0.1:48318 | `OTEL_HTTP_HOST_PORT` |
|
|||
|
|
| Collector health | http://127.0.0.1:48333 | `OTEL_COLLECTOR_HEALTH_HOST_PORT` |
|
|||
|
|
|
|||
|
|
容器**内部**仍使用标准端口(如 Collector `4317`);仅宿主机映射使用 `48xxx` 段,与 Postgres `48291`、Redis `48307` 同一风格。
|
|||
|
|
|
|||
|
|
预置 Dashboard(**Life Echo** 文件夹):
|
|||
|
|
|
|||
|
|
| Dashboard | 用途 |
|
|||
|
|
|-----------|------|
|
|||
|
|
| Life Echo Overview | API RED、LLM 摘要、依赖延迟 |
|
|||
|
|
| Life Echo LLM | `call_type` / agent / tokens、outcome 分布 |
|
|||
|
|
| Life Echo Business | 回忆录阶段、WS/ASR/TTS、Celery 业务 span |
|
|||
|
|
| Life Echo Logs | Loki 按 `event` / `trace_id` 检索 |
|
|||
|
|
|
|||
|
|
## 启用应用导出
|
|||
|
|
|
|||
|
|
在 [`.env.example`](../.env.example) 已给出本地默认值,同步到 `.env` 即可,例如:
|
|||
|
|
|
|||
|
|
```env
|
|||
|
|
OTEL_ENABLED=true
|
|||
|
|
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:48317
|
|||
|
|
OTEL_TRACES_SAMPLER=always_on
|
|||
|
|
OTEL_SERVICE_NAME=life-echo-api
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
推荐与全栈一并启动(`./development.sh` 在 `.env` 里 `OTEL_ENABLED=true` 时会起 observability compose,并默认打开 Grafana 浏览器标签):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd api
|
|||
|
|
./development.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
仅手动起 API(不自动开 Grafana):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd api
|
|||
|
|
uv run uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Celery worker 同一 `.env`;未设 `OTEL_SERVICE_NAME` 时 worker 默认为 `life-echo-celery-worker`。
|
|||
|
|
|
|||
|
|
若 API 跑在 **Docker compose** 里,应设 `OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317`(服务名 + 容器内端口),而不是 `localhost`。
|
|||
|
|
|
|||
|
|
不需要可观测性时:`.env` 中 `OTEL_ENABLED=false`(或未启动 observability compose)。
|
|||
|
|
|
|||
|
|
## 采集内容
|
|||
|
|
|
|||
|
|
| 类型 | 来源 |
|
|||
|
|
|------|------|
|
|||
|
|
| HTTP | FastAPI 自动 instrumentation(`/health` 排除) |
|
|||
|
|
| DB | SQLAlchemy |
|
|||
|
|
| Redis | redis-py |
|
|||
|
|
| 出站 HTTP | httpx(DeepSeek 等) |
|
|||
|
|
| Celery | 任务 span + W3C trace 传播 |
|
|||
|
|
| LLM | `llm_telemetry`(LangChain / DeepSeek / `llm_call`)+ `llm.call.*` / `llm.tokens.*` metrics |
|
|||
|
|
| 业务 | `business_telemetry`:WS 回合、回忆录 phase、ASR/TTS、支付等子 span |
|
|||
|
|
| 日志 | loguru patcher 注入 `trace_id`;Promtail 解析 `event` / `tid=`;可选 `LOG_JSON_FILE` JSON sink |
|
|||
|
|
|
|||
|
|
日志字段:`request_id`、`trace_id`、`span_id`。HTTP 由中间件 `contextualize`;**Celery / 后台**由 loguru **patcher** 从当前 OTel span 合并,无需经过 HTTP 中间件。
|
|||
|
|
|
|||
|
|
## 常用排查
|
|||
|
|
|
|||
|
|
1. **API 慢**:Grafana → Tempo,按 `service.name=life-echo-api` 查 trace;看 DB / httpx / `llm.*` / `conversation.ws.*` 子 span。
|
|||
|
|
2. **LLM 慢**:**Life Echo LLM** Dashboard,或 Loki:`{compose_service=~".+"} |= "event=llm_json_call"`。
|
|||
|
|
3. **回忆录卡阶段**:Tempo 搜 `memoir.phase1` / `memoir.phase2` / `memoir.story_pipeline.*`;**Life Echo Business** Dashboard 看 `business_operation_duration_milliseconds`。
|
|||
|
|
4. **日志 ↔ Trace**:在 Tempo 复制 `trace_id` → Loki:`{compose_service=~".+"} |= "tid=<前12位>"`(控制台短格式);Promtail 将 `trace_id` 写入 **structured metadata**(非高基数 label)。
|
|||
|
|
5. **Celery 堆积**:Tempo 过滤 `life-echo-celery-worker`;Loki `event=celery_task_failed`。
|
|||
|
|
6. **无数据**:`.env` 中 `OTEL_ENABLED=true`、`OTEL_EXPORTER_OTLP_ENDPOINT` 端口与 `OTEL_GRPC_HOST_PORT` 一致;Collector health `http://127.0.0.1:48333`;Prometheus target `otel-collector:8889` UP。
|
|||
|
|
|
|||
|
|
### LOG_JSON_FILE 与 Promtail
|
|||
|
|
|
|||
|
|
- **默认**:loguru 人类可读行 → Docker stdout → Promtail **regex** 提取 `tid` / `event` / `duration_ms`;`trace_id` 进 structured metadata,**不作为 Loki label**。
|
|||
|
|
- **可选**:`LOG_JSON_FILE=/path/to/app.jsonl` 开启 JSON sink(`serialize=true`),便于与 OTLP logs 或自建采集对齐;与 Promtail 可**并存**(同一容器 stdout 仍走 regex)。
|
|||
|
|
|
|||
|
|
## 采样(staging/prod 第二阶段)
|
|||
|
|
|
|||
|
|
| 环境 | 建议 |
|
|||
|
|
|------|------|
|
|||
|
|
| development | `OTEL_TRACES_SAMPLER=always_on` |
|
|||
|
|
| staging/production | `OTEL_TRACES_SAMPLER=parentbased_traceidratio`,`OTEL_TRACES_SAMPLER_ARG=0.1` |
|
|||
|
|
|
|||
|
|
关闭 telemetry:`OTEL_ENABLED=false`,无 exporter 开销。
|
|||
|
|
|
|||
|
|
## Prometheus 指标名(OTel → Prometheus)
|
|||
|
|
|
|||
|
|
| OTel 仪器 | Prometheus 系列(histogram) |
|
|||
|
|
|-----------|------------------------------|
|
|||
|
|
| `llm.call.duration` (ms) | `llm_call_duration_milliseconds_bucket` |
|
|||
|
|
| `business.operation.duration` (ms) | `business_operation_duration_milliseconds_bucket` |
|
|||
|
|
| `http.server.request.duration` (s) | `http_server_request_duration_seconds_bucket` |
|
|||
|
|
| `db.client.operation.duration` (s) | `db_client_operation_duration_seconds_bucket` |
|
|||
|
|
| `http.client.request.duration` (s) | `http_client_request_duration_seconds_bucket` |
|
|||
|
|
|
|||
|
|
Counter 示例:`llm_call_total`、`llm_tokens_input_total`。
|
|||
|
|
|
|||
|
|
校验脚本(需 observability compose + 有流量):
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
chmod +x scripts/verify_observability_metrics.sh
|
|||
|
|
./scripts/verify_observability_metrics.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 验收清单(本地 E2E)
|
|||
|
|
|
|||
|
|
- [ ] `OTEL_ENABLED=true`,启动 compose + API + Celery worker
|
|||
|
|
- [ ] 跑一条 WS 对话;Tempo 可见 `conversation.ws.process_turn`、`llm.chat_invoke`
|
|||
|
|
- [ ] 触发 memoir phase1;Tempo 可见 `memoir.phase1.*`、`memoir.story_pipeline.*`
|
|||
|
|
- [ ] Prometheus:`call_type` label 存在;真实 LLM 后 `llm_tokens_input_total` > 0
|
|||
|
|
- [ ] Loki:`|= "tid=<trace前12位>"` 能查到同次请求日志
|
|||
|
|
- [ ] `./scripts/verify_observability_metrics.sh` 通过
|
|||
|
|
- [ ] Grafana Alerting 页无 provisioning 错误(通知渠道可空)
|
|||
|
|
|
|||
|
|
## 配置目录
|
|||
|
|
|
|||
|
|
- [`deploy/observability/`](../deploy/observability/):Collector、Tempo、Loki、Prometheus、Grafana provisioning
|
|||
|
|
- [`docker-compose.observability.yml`](../docker-compose.observability.yml):本地 overlay
|