Files
life-echo/api/app/features/memory/chunker.py
yangshilin e1341c6d18 feat:
1. 建立问题库大纲,对应每个人生阶段槽位
2. 鼓励使用更生活化的交流语言共情与总结
3. 降低评审模型可能发生截断的概率
4. 成稿质量维度强化情感表达和上下文连贯性
2026-04-09 15:32:35 +08:00

37 lines
1.1 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Transcript chunker — split raw text into retrieval-ready chunks."""
def chunk_transcript(
text: str, *, max_chars: int = 800, overlap_chars: int = 100
) -> list[str]:
"""
Split transcript text into overlapping chunks.
Uses character count as proxy for tokens (~4 chars/token for Chinese).
"""
if not text or not text.strip():
return []
text = text.strip()
if len(text) <= max_chars:
return [text] if text else []
chunks: list[str] = []
start = 0
step = max_chars - overlap_chars
while start < len(text):
end = start + max_chars
chunk = text[start:end]
# 尽量在句末切分
if end < len(text):
for sep in ["", "", "", "\n", "", ".", "!", "?"]:
last_sep = chunk.rfind(sep)
if last_sep > max_chars // 2:
chunk = chunk[: last_sep + 1]
end = start + len(chunk)
break
if chunk.strip():
chunks.append(chunk.strip())
start += len(chunk) if chunk else step
return chunks