life-echo/api/docs/memoir_reliability.md

# Memoir & memory reliability

This document summarizes production-oriented behavior for the memoir narrative pipeline, memory evidence, compaction, and async orchestration.

## Correlation ID (`memoir_correlation_id`)

- Phase 1 (`process_memoir_phase1`) generates a UUID at task start and logs `event=memoir_phase1_* … memoir_correlation_id=`.
- Phase 2 receives it via Celery `kwargs` and combines with `effective_correlation_id` (explicit id wins, else Celery task id).
- The same id is passed into `run_story_pipeline_for_category_batch`, structured logs, and `compaction_extra` when scheduling memory compaction after Phase 2.

## Feature flags (`app.core.config.Settings`)

| Flag | Default | Purpose |
|------|---------|---------|
| `memoir_fidelity_fail_open_on_parse_error` | `False` | When `True`, fidelity JSON/LLM failures pass the gate even for new stories (rollback only via ops need). |
| `memoir_narrative_evidence_overlap_min_chars` | `14` | Deterministic overlap check between body and evidence plain text. |
| `memoir_title_slots_require_body_or_oral_match` | `True` | Narrows title-generation slot inputs to body/oral overlap. |
| `memory_compaction_enabled` | `True` | Near-duplicate chunk soft-exclude; requires Celery worker + **Beat** for periodic `memory_compaction_sweep`. |
| `memoir_recompose_retry_on_lock_contention` | `True` | Chapter recompose retries with backoff when the chapter pipeline lock is held. |
| `memoir_phase2_singleflight_immediate` | `True` | Immediate Phase 2 `send_task` uses a stable `task_id` per user/category to reduce duplicate queue entries. |
| `chapter_pipeline_lock_ttl_seconds` | `360` | Shared lock TTL for Phase 2 and `recompose_chapter`; tune with longest expected runtimes. |

## Memory compaction → facts

When a chunk is soft-excluded as a near-duplicate loser, `mark_facts_stale_for_excluded_chunk_sync` sets linked `MemoryFact` rows (`source_chunk_id`, statuses `confirmed`/`candidate`) to **`stale`**. Downstream fact retrieval uses `confirmed` only for default search/browse paths.

## Acceptance-oriented metrics (log queries)

Monitor structured log events:

- `event=fidelity_parse_fail_closed` / `fidelity_check_fail`
- `event=memoir_phase2_*` with `memoir_correlation_id`
- `memory_compaction_exclude` / `memory_compaction_facts_staled`
- `event=recompose_chapter status=lock_busy_retry`

## Tests

Targeted regressions live under `api/tests/`:

- `test_fidelity_gate.py`, `test_narrative_boundary_regressions.py`
- `test_memory_consistency_rules.py`, `test_memoir_idempotency.py`
- `test_recompose_retry_policy.py`
- `test_llm_json_call.py`, `test_stage_slot_registry.py`

## LLM JSON (`llm_json_call`) and compat strip

- Standard path: `response_format=json_object` → `json.loads` → Pydantic validate.
- On decode failure only, `extract_json_payload` runs once (fence / brace strip). A hit emits **`event=llm_json_compat_strip_hit`** at WARNING.
- **Step 13 (sunset)**: observe this event in production for ~1–2 weeks; if zero hits, remove the compat branch from `app.core.llm_call` and migrate remaining callers off `extract_json_payload` for JSON-mode paths.