数据库与模型:新增多版迁移(章节证据快照、对话血缘、记忆事实/时间线 lineage 等),把「成稿 ↔ 对话/记忆」的溯源信息落到表结构里。 业务链路:会话与 WS、回忆录/故事流水线、记忆写入与 enrichment 等跟着接上线索与快照;新增章节证据快照与评测侧 EvalTraceService 等模块,方便组评审用的证据包。 内部评测:自动化 run 与手工 memoir 评审共用可追溯证据;rubric/ judge 相关脚本与文档有配套调整。 app-eval-web:Memoir/实验详情里能展开看证据摘要与 evidence_trace(含对话轮次 id);Vite 代理与 development.sh 注入的 API 端口与当前默认内部评测端口一致,避免改端口后页面连错服务。 工程杂项:GitHub Actions / 仓库说明有更新;各适配器与支付/配额/plan 等多处为小改动或跟随主改动的收尾;新增/扩充了?
38 lines
1.1 KiB
Python
38 lines
1.1 KiB
Python
"""Transcript chunker — split raw text into retrieval-ready chunks."""
|
||
|
||
|
||
|
||
def chunk_transcript(
|
||
text: str, *, max_chars: int = 800, overlap_chars: int = 100
|
||
) -> list[str]:
|
||
"""
|
||
Split transcript text into overlapping chunks.
|
||
Uses character count as proxy for tokens (~4 chars/token for Chinese).
|
||
"""
|
||
if not text or not text.strip():
|
||
return []
|
||
text = text.strip()
|
||
if len(text) <= max_chars:
|
||
return [text] if text else []
|
||
|
||
chunks: list[str] = []
|
||
start = 0
|
||
step = max_chars - overlap_chars
|
||
|
||
while start < len(text):
|
||
end = start + max_chars
|
||
chunk = text[start:end]
|
||
# 尽量在句末切分
|
||
if end < len(text):
|
||
for sep in ["。", "!", "?", "\n", ";", ".", "!", "?"]:
|
||
last_sep = chunk.rfind(sep)
|
||
if last_sep > max_chars // 2:
|
||
chunk = chunk[: last_sep + 1]
|
||
end = start + len(chunk)
|
||
break
|
||
if chunk.strip():
|
||
chunks.append(chunk.strip())
|
||
start += len(chunk) if chunk else step
|
||
|
||
return chunks
|