本次 squash merge 将 codex-story-first-image-intent 的整体改动合入 development,核心内容包括: 1. 后端数据与迁移:新增 stories、story_versions、story_image_intents、chapter_cover_intents、assets 等模型与 Alembic 迁移,建立 story-first、markdown-first、asset-first 的主数据链路。 2. 生成与任务链:引入 StoryBuilderOrchestrator、ChapterComposerOrchestrator、story_image_tasks、chapter_cover_tasks,图片生成从正文占位符改为结构化 intent -> asset -> markdown 回填。 3. 并发与一致性:为 story/chapter intent 增加 claim_token、claimed_at、attempt_count,采用数据库原子 claim 为主、Redis 锁为辅,避免重复生成、锁误删和 processing 卡死。 4. Memoir 读写路径:章节 canonical_markdown 成为正文真源,列表/详情接口补齐 markdown、cover_asset、word_count 等字段,PDF 与 asset 解析链路同步升级。 5. Memory / Retrieval:扩展 transcript ingest、chunking、evidence 检索与 story 聚合基础设施,为后续 story-first RAG 与多 agent 编排提供底座。 6. App 端体验:章节页继续走 MarkdownRenderer 阅读链,同时吸收 fix3-19 的跨平台 UI glitch 修复;更新对话页、首页、文案资源与章节列表映射逻辑。 7. 测试与文档:补充 asset resolver、story image task、章节封面派发、markdown 映射等回归测试,并加入图片占位符退役设计文档。
39 lines
1.1 KiB
Python
39 lines
1.1 KiB
Python
"""Transcript chunker — split raw text into retrieval-ready chunks."""
|
||
|
||
import re
|
||
|
||
|
||
def chunk_transcript(
|
||
text: str, *, max_chars: int = 800, overlap_chars: int = 100
|
||
) -> list[str]:
|
||
"""
|
||
Split transcript text into overlapping chunks.
|
||
Uses character count as proxy for tokens (~4 chars/token for Chinese).
|
||
"""
|
||
if not text or not text.strip():
|
||
return []
|
||
text = text.strip()
|
||
if len(text) <= max_chars:
|
||
return [text] if text else []
|
||
|
||
chunks: list[str] = []
|
||
start = 0
|
||
step = max_chars - overlap_chars
|
||
|
||
while start < len(text):
|
||
end = start + max_chars
|
||
chunk = text[start:end]
|
||
# 尽量在句末切分
|
||
if end < len(text):
|
||
for sep in ["。", "!", "?", "\n", ";", ".", "!", "?"]:
|
||
last_sep = chunk.rfind(sep)
|
||
if last_sep > max_chars // 2:
|
||
chunk = chunk[: last_sep + 1]
|
||
end = start + len(chunk)
|
||
break
|
||
if chunk.strip():
|
||
chunks.append(chunk.strip())
|
||
start += len(chunk) if chunk else step
|
||
|
||
return chunks
|