api/app/features/memory/chunker.py

"""Transcript chunker — split raw text into retrieval-ready chunks."""

import re


def chunk_transcript(
    text: str, *, max_chars: int = 800, overlap_chars: int = 100
) -> list[str]:
    """
    Split transcript text into overlapping chunks.
    Uses character count as proxy for tokens (~4 chars/token for Chinese).
    """
    if not text or not text.strip():
        return []
    text = text.strip()
    if len(text) <= max_chars:
        return [text] if text else []

    chunks: list[str] = []
    start = 0
    step = max_chars - overlap_chars

    while start < len(text):
        end = start + max_chars
        chunk = text[start:end]
        # 尽量在句末切分
        if end < len(text):
            for sep in ["。", "！", "？", "\n", "；", ".", "!", "?"]:
                last_sep = chunk.rfind(sep)
                if last_sep > max_chars // 2:
                    chunk = chunk[: last_sep + 1]
                    end = start + len(chunk)
                    break
        if chunk.strip():
            chunks.append(chunk.strip())
        start += len(chunk) if chunk else step

    return chunks
-												重构回忆录为 story-first / markdown-first 架构并整合图片意图与前端 UI 修复

本次 squash merge 将 codex-story-first-image-intent 的整体改动合入 development，核心内容包括：

1. 后端数据与迁移：新增 stories、story_versions、story_image_intents、chapter_cover_intents、assets 等模型与 Alembic 迁移，建立 story-first、markdown-first、asset-first 的主数据链路。

2. 生成与任务链：引入 StoryBuilderOrchestrator、ChapterComposerOrchestrator、story_image_tasks、chapter_cover_tasks，图片生成从正文占位符改为结构化 intent -> asset -> markdown 回填。

3. 并发与一致性：为 story/chapter intent 增加 claim_token、claimed_at、attempt_count，采用数据库原子 claim 为主、Redis 锁为辅，避免重复生成、锁误删和 processing 卡死。

4. Memoir 读写路径：章节 canonical_markdown 成为正文真源，列表/详情接口补齐 markdown、cover_asset、word_count 等字段，PDF 与 asset 解析链路同步升级。

5. Memory / Retrieval：扩展 transcript ingest、chunking、evidence 检索与 story 聚合基础设施，为后续 story-first RAG 与多 agent 编排提供底座。

6. App 端体验：章节页继续走 MarkdownRenderer 阅读链，同时吸收 fix3-19 的跨平台 UI glitch 修复；更新对话页、首页、文案资源与章节列表映射逻辑。

7. 测试与文档：补充 asset resolver、story image task、章节封面派发、markdown 映射等回归测试，并加入图片占位符退役设计文档。

											
										
										
											2026-03-20 10:30:07 +08:00
+								"""Transcript chunker — split raw text into retrieval-ready chunks."""
 								import re
-												Merge branch 'refactor/backend-architecture' into development

											
										
										
											2026-03-18 17:18:23 +08:00
-												chore/ 删除无用文件

											
										
										
											2026-03-19 14:36:14 +08:00
+								def chunk_transcript(
-												重构回忆录为 story-first / markdown-first 架构并整合图片意图与前端 UI 修复

本次 squash merge 将 codex-story-first-image-intent 的整体改动合入 development，核心内容包括：

1. 后端数据与迁移：新增 stories、story_versions、story_image_intents、chapter_cover_intents、assets 等模型与 Alembic 迁移，建立 story-first、markdown-first、asset-first 的主数据链路。

2. 生成与任务链：引入 StoryBuilderOrchestrator、ChapterComposerOrchestrator、story_image_tasks、chapter_cover_tasks，图片生成从正文占位符改为结构化 intent -> asset -> markdown 回填。

3. 并发与一致性：为 story/chapter intent 增加 claim_token、claimed_at、attempt_count，采用数据库原子 claim 为主、Redis 锁为辅，避免重复生成、锁误删和 processing 卡死。

4. Memoir 读写路径：章节 canonical_markdown 成为正文真源，列表/详情接口补齐 markdown、cover_asset、word_count 等字段，PDF 与 asset 解析链路同步升级。

5. Memory / Retrieval：扩展 transcript ingest、chunking、evidence 检索与 story 聚合基础设施，为后续 story-first RAG 与多 agent 编排提供底座。

6. App 端体验：章节页继续走 MarkdownRenderer 阅读链，同时吸收 fix3-19 的跨平台 UI glitch 修复；更新对话页、首页、文案资源与章节列表映射逻辑。

7. 测试与文档：补充 asset resolver、story image task、章节封面派发、markdown 映射等回归测试，并加入图片占位符退役设计文档。

											
										
										
											2026-03-20 10:30:07 +08:00
+								    text: str, *, max_chars: int = 800, overlap_chars: int = 100
-												chore/ 删除无用文件

											
										
										
											2026-03-19 14:36:14 +08:00
+								) -> list[str]:
-												重构回忆录为 story-first / markdown-first 架构并整合图片意图与前端 UI 修复

本次 squash merge 将 codex-story-first-image-intent 的整体改动合入 development，核心内容包括：

1. 后端数据与迁移：新增 stories、story_versions、story_image_intents、chapter_cover_intents、assets 等模型与 Alembic 迁移，建立 story-first、markdown-first、asset-first 的主数据链路。

2. 生成与任务链：引入 StoryBuilderOrchestrator、ChapterComposerOrchestrator、story_image_tasks、chapter_cover_tasks，图片生成从正文占位符改为结构化 intent -> asset -> markdown 回填。

3. 并发与一致性：为 story/chapter intent 增加 claim_token、claimed_at、attempt_count，采用数据库原子 claim 为主、Redis 锁为辅，避免重复生成、锁误删和 processing 卡死。

4. Memoir 读写路径：章节 canonical_markdown 成为正文真源，列表/详情接口补齐 markdown、cover_asset、word_count 等字段，PDF 与 asset 解析链路同步升级。

5. Memory / Retrieval：扩展 transcript ingest、chunking、evidence 检索与 story 聚合基础设施，为后续 story-first RAG 与多 agent 编排提供底座。

6. App 端体验：章节页继续走 MarkdownRenderer 阅读链，同时吸收 fix3-19 的跨平台 UI glitch 修复；更新对话页、首页、文案资源与章节列表映射逻辑。

7. 测试与文档：补充 asset resolver、story image task、章节封面派发、markdown 映射等回归测试，并加入图片占位符退役设计文档。

											
										
										
											2026-03-20 10:30:07 +08:00
+								    """
 								    Split transcript text into overlapping chunks.
 								    Uses character count as proxy for tokens (~4 chars/token for Chinese).
 								    """
 								    if not text or not text.strip():
 								        return []
 								    text = text.strip()
 								    if len(text) <= max_chars:
 								        return [text] if text else []
 								    chunks: list[str] = []
 								    start = 0
 								    step = max_chars - overlap_chars
 								    while start < len(text):
 								        end = start + max_chars
 								        chunk = text[start:end]
 								        # 尽量在句末切分
 								        if end < len(text):
 								            for sep in ["。", "！", "？", "\n", "；", ".", "!", "?"]:
 								                last_sep = chunk.rfind(sep)
 								                if last_sep > max_chars // 2:
 								                    chunk = chunk[: last_sep + 1]
 								                    end = start + len(chunk)
 								                    break
 								        if chunk.strip():
 								            chunks.append(chunk.strip())
 								        start += len(chunk) if chunk else step
 								    return chunks