feat(eval): memoir A/B chapter judging and eval-web parity with dialogue

- Judge baseline excerpt and library chapter separately; build_memoir_compare_summary for gate, nine-dim and leaf deltas.

- Memoir SSE chapter payload: baseline_judge, compare_summary, baseline_judge_error.

- MemoirJudgeOutput: loose score coercion and post-validate clamp; memoir judge prompt caps from settings.

- app-eval-web: two-column MemoirScoreCard layout, MemoirCompareSummary, chapter blocks and CSS.

- Add memoir_compare_summary, log_events, celery_log_context, memoir_pipeline_progress; tests and migration 0014.

- Misc: memory/evidence and enrichment paths, task/orchestrator updates, internal-eval docs, env examples.
This commit is contained in:
Kevin
2026-04-10 10:23:43 +08:00
parent b0251e5b26
commit ac49bc7f23
59 changed files with 4773 additions and 696 deletions

View File

@@ -1,28 +1,45 @@
"""
Agent / LLM 诊断日志:耗时、输入输出规模、截断预览。
- **详情**完整 prompt 预览):仅在 ``LOG_LEVEL`` 为 ``TRACE`` / ``DEBUG`` 时通过 ``logger.debug`` 输出。
- **详情**prompt 预览 / hash / 响应预览):仅在 ``LOG_LEVEL`` 为 ``TRACE`` / ``DEBUG`` 时通过 ``logger.debug`` 输出。
- **摘要**单行耗时、字符数、operation 名):当 ``LOG_AGENT_VERBOSE=1`` 时通过 ``logger.info`` 输出,
便于生产环境在不把全局日志调到 DEBUG 的情况下排查 Agent 性能与路径。
敏感内容DEBUG 下会记录用户相关文本;``AGENT_LOG_MAX_CHARS=0`` 时记录全文,生产环境请勿长期开启 DEBUG。
生产/预发建议 ``LOG_LEVEL=INFO``;需看 Agent 耗时与规模时可设 ``LOG_AGENT_VERBOSE=1``,无需长期 DEBUG。
敏感内容DEBUG 下会记录用户相关文本;``AGENT_LOG_MAX_CHARS=0`` 时预览不截断(完整输出,慎用)。
配置(节选):``AGENT_LOG_OMIT_SYSTEM_MESSAGE_BODY``(默认 true省略聊天 System 正文,仅打 len+sha12
``AGENT_LOG_JSON_PROMPT_PREFIX_CHARS`` + ``AGENT_LOG_JSON_PROMPT_PREFIX_ONLY_IF_LEN_GT`` 在 DEBUG 下跳过
超长单段 prompt 的前缀再预览
超长单段 ``*.prompt`` 的前缀再预览
``AGENT_LOG_PROMPT_MODE=hash_only`` 时 ``*.prompt`` 仅输出 sha12 + 长度,无正文;
``AGENT_LOG_PROMPT_DEDUP=1`` 时同一 label 连续相同全文则跳过重复行。
"""
from __future__ import annotations
import hashlib
import threading
import time
from contextlib import contextmanager
from typing import Any, Iterator
from app.core.config import settings
_dedup_lock = threading.Lock()
_last_prompt_sha256_by_label: dict[str, str] = {}
def _payload_sha256_hex(text: str) -> str:
return hashlib.sha256(text.encode("utf-8", errors="replace")).hexdigest()
def _payload_sha12(text: str) -> str:
return _payload_sha256_hex(text)[:12]
def agent_verbose_enabled() -> bool:
"""是否输出含完整 prompt 预览等 DEBUG 级详情。"""
"""是否输出含 prompt/response 等 DEBUG 级详情。"""
raw = (settings.log_level or "INFO").strip().upper()
return raw in ("TRACE", "DEBUG")
@@ -97,26 +114,55 @@ def log_agent_payload(
*,
max_chars: int | None = None,
) -> None:
"""在 DEBUG 下记录文本长度与截断预览。"""
"""在 DEBUG 下记录文本长度与截断预览*.prompt 可 hash_only / 去重)"""
if not agent_verbose_enabled():
return
raw = text or ""
total_len = len(raw)
digest = _payload_sha256_hex(raw)
sha12 = digest[:12]
is_prompt = label.endswith(".prompt")
if is_prompt and settings.agent_log_prompt_dedup:
with _dedup_lock:
if _last_prompt_sha256_by_label.get(label) == digest:
logger.debug(
"agent_payload_skipped label={} reason=same_as_previous sha12={} total_len={}",
label,
sha12,
total_len,
)
return
_last_prompt_sha256_by_label[label] = digest
preview_source = raw
extra_note = ""
if (
label.endswith(".prompt")
is_prompt
and settings.agent_log_json_prompt_prefix_chars > 0
and total_len > settings.agent_log_json_prompt_prefix_only_if_len_gt
):
skip = settings.agent_log_json_prompt_prefix_chars
preview_source = raw[skip:]
extra_note = f" skipped_prefix_chars={skip}"
mode = (settings.agent_log_prompt_mode or "preview").strip().lower()
if is_prompt and mode == "hash_only":
logger.debug(
"agent_payload label={} total_len={} sha12={} mode=hash_only{}",
label,
total_len,
sha12,
extra_note,
)
return
preview = truncate_for_log(preview_source, max_chars=max_chars)
logger.debug(
"agent_payload {} total_len={}{} preview={}",
"agent_payload {} total_len={}{} sha12={} preview={}",
label,
total_len,
extra_note,
sha12,
preview,
)