Synthesis, Footnotes, and Citations

By the time a deal reaches synthesis, the domain agents have already done the work: the financial, risk, comparables, market, property, and exit agents have each produced a section of analysis backed by retrieved evidence (see retrieval and reranking). Synthesis is where those sections become one cited institutional memo — assembled, citation-anchored, footnoted, chart-annotated, and validated for consistency before it is handed to Canvas.

This page covers the two stages that own that transformation:

Synthesis — merges per-section outputs into a single memo and enforces that the canonical underwriting numbers (IRR, DSCR, LTV, cap rate, debt yield, equity multiple) survive into the prose unchanged.
The footnote pipeline — a fixed-order, multi-stage process that turns inline [SRC:n] evidence markers into numbered footnotes with definitions, source clustering, freshness warnings, FormulaGraph lineage, and a consistency audit.

It also documents the chart insertion boundary validator — the last gate a chart passes before it is encoded into the memo.

The evidence-citation contract: `[SRC:n]`

Every claim in a Memosa memo is meant to trace back to a piece of retrieved evidence. The mechanism is an inline marker the agents emit directly in their prose:

The property's in-place NOI of $4.2M [SRC:12] supports a going-in cap
rate of 5.1% [SRC:7] against the $82M basis.

[SRC:12] says “this claim is supported by retrieval source #12.” The footnote pipeline later converts each marker into a numbered footnote reference and emits a definition block that names the underlying document and page.

The citation regex (and why it tolerates a space)

The canonical pattern is:

SRC_MARKER_PATTERN = re.compile(r'\[SRC:\s*(\d+)\]')

The \s* is load-bearing, not cosmetic. Production LLM output sometimes emits [SRC: 12] with a space after the colon; an earlier pattern without \s* silently failed to match those, dropping real citations. The capture group is the integer ID. This exact pattern is duplicated across the pipeline — claim_injector.py, deduplicator.py, src_id_normalizer.py, section_quality_scorer.py, plus the shared src/utils/citation_extractor.py and src/utils/footnotes.py — and the rule is that all of them stay identical. A drift in one regex would mean a marker that one stage sees and another misses.

Synthesis: merging sections into one memo

Synthesis lives under src/langchain/workflows/tools/synthesis/. Its orchestration pipeline (orchestration/synthesis_pipeline.py) merges the per-section agent outputs, persists the result, and computes quality scores. Two production lessons shape how it behaves.

State and Store must not diverge

The memo content of record lives in PGStore, not in graph state — state carries only a metadata manifest (see namespacing and isolation and the CLAUDE.md store-backed-persistence rule). On the “Store has all sections” path, the final merged memo_sections are assembled in a parameter, and that parameter has to be written back to state. If it isn’t, the downstream quality scorer reads an empty state["memo_sections"], scores everything 0.0, and the quality gate concludes “no sections were produced” — when in fact the Store holds a complete memo.

The fix (resilience pattern R4.1, in _build_result()) is to prefer the non-empty memo_sections parameter unconditionally and fall back to state only when the parameter is empty. A three-layer silent-failure defense in agent_executor.py backs this up: when all scores are 0.0 it queries the Store directly, rehydrates state from the Store when the Store version is materially longer, and re-scores with a content-aware formula instead of a hard floor.

Canonical metrics are grammar-enforced, not prompt-begged

A recurring, expensive failure mode was the executive-summary LLM hallucinating the wrong IRR into narrative prose — substituting a nearby percentage from RAG context (a vacancy rate, a sensitivity-scenario value) for the canonical target_irr from the underwriting model. The wrong number never shipped (a runtime corrector rewrote it with a [VERIFIED: corrected from N per source data] annotation), but the corrector fired on every memo, masking the real bug: the LLM’s output grammar should never have allowed the substitution in the first place. Seven-plus prompt patches each attacked a symptom without changing the grammar.

The architectural resolution moved the canonical numbers out of the prose and into the structured output schema. SynthesisExecutiveSummaryResponse (and the parallel risk and financial schemas) now carry six Optional[Decimal] canonical-metric fields — target_irr_pct, dscr, ltv_pct, cap_rate_pct, debt_yield_pct, equity_multiple. Anthropic constrained decoding grammar-enforces type and range, so the LLM writes the canonical value into a structured slot, separate from the narrative. Two enforcement layers run on top:

Structured-vs-canonical check. The caller compares each populated structured field against key_metrics[...]; a delta above the shared STRUCTURED_MISMATCH_THRESHOLD of 0.20 (20%) triggers a retry.
Prose @field_validator decorators. Five validators scan the prose fields for canonical-metric phrasings (IRR / DSCR / LTV / cap rate / debt yield / equity multiple) and raise ValueError at decode time when an un-cited canonical-metric number appears, tagging the response ParseQuality.partial and forcing a retry.

The runtime corrector stays as a safety net, but its log lines were escalated to ERROR — post-fix, any firing is a P1 signal that the structured-output enforcement has a gap, not steady-state noise. There is one deliberate release valve: cite-anchored canonical values (a number followed or preceded by a [SOURCE-X] / [TABLE-N] anchor in the same sentence) are preserved, because an evidence-backed number is allowed to diverge from the headline (e.g. the risk section is explicitly instructed to discuss a sponsor-IRR vs computed-IRR discrepancy). Only un-cited divergences are rejected.

The footnote pipeline: fixed order, one orchestrator

Once the merged memo exists, the footnote pipeline turns [SRC:n] markers into a finished citation apparatus. The orchestrator is src/langchain/workflows/tools/final_editor/footnotes/orchestrator.py, and it runs a sequence of numbered stages. Stage order is non-negotiable — each stage depends on the data shape produced by the prior one (claim injection must precede budget enforcement; deduplication must precede renumbering).

The pipeline runs an ordered set of stages, the first labeled 0a-strip and the last 7.1 , marked by # STAGE <label>: comments in the orchestrator. A static-scan regression test (tests/unit/langchain/workflows/tools/final_editor/test_footnote_stage_ordering.py) parses those comment markers and compares them against a frozen golden list (EXPECTED_STAGE_LABELS). Reordering, deleting, or inserting a stage without updating that list breaks the test by design — silent drift is exactly what it prevents. As of this writing the list holds 36 stages, but it grows as new injection stages ship; the test’s golden list, not any hardcoded number, is the source of truth.

What the stages do, in groups

Stages 0a-strip → 0.5 — normalization. Before any aggregation, the pipeline harmonizes marker syntax: it strips defensive [CANONICAL: …] markers, per-section post-hoc citation injection and a document-level rescue salvage zero-citation sections from deal metadata, legacy data-source tags ([CoStar], [Excel], [PDF]) and [SOURCE-X] tags are converted to [SRC:n], marker positions are sanitized, and a global [SRC:n] normalization (stage 0e, covered below) resolves cross-section ID collisions.

Stages 1 → 2.9 — aggregation and injection. Footnotes are aggregated from all sources, metrics-table superscripts are processed, claim-level citations are injected ([^n]), same-source footnotes are consolidated, and a per-section citation budget enforces a minimum citation count for sections over ~500 words. Four KG-enriched injections follow — analyst IC-Notes sidenotes (2.7), model-quality red-flag warnings (2.8), structured source-conflict footnotes (2.85), and assumption-lineage footnotes (2.9, covered below). Each KG stage is gated on model_intelligence_context / llm_factory presence and silently no-ops when absent.

Stage 3 is permanently skipped. Section-header superscripts are structural, not claims, so the orchestrator unconditionally emits a section_headers_skipped event. The stage-ordering test pins the skip so it cannot be silently re-enabled.

Stages 4 → 4.95 — fallback and quality. Remaining [SRC:n] markers get a final fallback pass, orphans and unrecognized tags are stripped with recovery, cross-section coherence-conflict footnotes are collected and injected, citation freshness warnings flag sources older than 180 days (when ingestion_date is in chunk metadata), and per-section citation quality grades (A/B/C/D) are computed.

Stages 5 → 5.8 — definitions and source rollups. The footnote definitions block and Data Source Notes are appended, definitions are consolidated under a heading, a “Sources Consulted” summary table is built, and definitions are clustered by source document under bold sub-headings.

Stages 6 → 7.1 — cleanup, renumber, validate. References are deduplicated, consecutive reference clusters are capped at three, low-confidence references are marked, footnotes are renumbered by document reading order (not by Store-dict order), citation density is re-verified after dedup (and re-injected if it dropped below budget), consistency is validated, and a final orphan-reference auto-repair runs.

A subset of stages emit waterfall checkpoints — citation-count snapshots used for diagnostics. There are nine of them, opening with pre_pipeline and closing with final , in order: pre_pipeline, after_stage_2_claim_injection, after_stage_2.5_budget, after_stage_2.7_insights, after_stage_2.8_model_warnings, after_stage_2.85_conflicts, after_stage_2.9_lineage, before_stage_6_dedup, final. They let an operator see exactly which stage gained or lost citations on a given memo.

A key implementation contract

The footnote stages accept memo_sections directly as a parameter — they do not read or write graph-state objects. This avoids a redundant Store→State→Store round-trip that the synthesis layer already has to manage carefully (see R4.1 above). document_paths is also passed explicitly so that an orphaned [SRC:n] marker can be resolved back to a real document even when its definition was lost upstream.

Footnote ID re-indexing across sections

The hardest part of the pipeline is keeping footnote IDs globally unique while N sections are each producing their own local [SRC:1], [SRC:2], … numbering. Two mechanisms cooperate.

Per-section injection uses a running global offset

When a section’s post-hoc injection assigns local IDs, the injector threads a running global_offset so each section’s markers land above the previous section’s range (citation_injection.py). When a section already carries existing IDs that overlap the offset, the injector shifts them up by the overlap amount — and it does the shift in reverse numeric order. The reverse order is not arbitrary: rewriting [SRC:1] before [SRC:10] would let the 1 substring match inside 10, corrupting the higher ID. Processing highest-first avoids that class of bug. The offset always advances afterward so the next section starts clean.

The global normalizer resolves true collisions

Stage 0e runs normalize_global_src_ids() in src_id_normalizer.py — a single-pass algorithm that splits the merged content by H2 headers, collects the [SRC:n] IDs per section, finds IDs that appear in two or more sections, and remaps the duplicates to fresh IDs assigned strictly above max(all_ids). The remap is applied to the document text via a regex substitution and to the structured claim_citations in memo_sections, with a final sweep that catches sections whose headers didn’t resolve to a text index.

Crucially, not every shared ID is a collision. The same retrieval chunk can be validly cited from multiple sections (one chunk legitimately informs financial and risk and exit content). The normalizer looks up each apparent collision’s chunk identity — (doc_id, page) — from each host section’s claim_citations; if every host resolves to the same chunk, the marker is preserved as legitimate cross-section reuse rather than split into three different footnotes for the same source.

This normalizer is a safety net, not the primary defense. The real fix is a process-wide GlobalSourceIdAllocator wired into WorkflowContext.source_id_allocator, which hands every subgraph a reserved, non-overlapping ID range up front. When the allocator is doing its job, collisions are zero.

FormulaGraph provenance in footnotes

Stage 2.9 — the assumption-lineage injector (lineage_injector.py) — is where the Excel underwriting model’s structure reaches the reader. Memosa builds a FormulaGraph of every cell, formula, and dependency in the workbook; for the memo’s most influential metrics, the lineage injector documents the assumption chain that drives each metric and the depth of its formula chain, as a footnote.

The data comes from metric_lineage in model_intelligence_context, which the orchestrator enriches with FormulaGraph-bridge provenance for every predicate the bridge can resolve. The injector applies a per-section cap (default 8 ), a minimum influence score of 0.5 to qualify, and a one-per-paragraph limit so lineage footnotes don’t stack. The per-section cap replaced an older global cap of 5 once the bridge began resolving a long tail of predicates — capping globally would have silently dropped the tail.

CRE metric tags are mapped to display labels (irr → IRR, levered_irr → Levered IRR, …) via a _CRE_TAG_LABELS map kept in lockstep with the bridge’s own display map, so a single metric never renders two different spellings across the footnote and visualization paths. The result is a footnote that tells an analyst not just what the IRR is, but which assumptions the model used to get there — turning an opaque spreadsheet output into traceable, defensible evidence.

The chart insertion boundary validator (P32)

Synthesis also inserts charts into the memo as ProseMirror ChartNodes. The last gate every chart passes is format_chart_html() in src/langchain/workflows/tools/synthesis/chart_inserter.py, and it fails closed: a chart that can’t be validated becomes an empty string (a no-op) rather than a blank <div> in the memo.

_validate_for_insertion() enforces two gates:

Schema validity. ChartConfig.model_validate() catches structural errors — missing type, empty series, malformed data points, non-numeric scatter/bubble X labels.
Reader-usefulness invariants. The chart must carry an insightCaption of at least 10 characters and a decisionQuestion of at least 5 characters. A chart with neither a caption that explains what it shows nor a question it helps the reader answer is dropped.

Every drop is logged at ERROR (promoted from WARNING after charts with real data silently disappeared because production log filters hid WARNING), naming the chart’s suggested_id and the specific reason, so a missing chart is always traceable to its upstream builder.

Decision questions drive a distinctness filter

The decisionQuestion field is not only a gate — it is also how select_charts_for_section() avoids redundancy. Candidate charts are sorted by confidence (with a density penalty as a soft tiebreaker), then a distinctness filter drops any later candidate whose decisionQuestion (case-insensitive, whitespace-tolerant) duplicates an already-selected chart. Two charts that answer the same IC question collapse to one.

There is one protected exception: a chart whose ID satisfies a critical canonical IC decision question (returns, debt safety, capital stack, downside protection) is never dropped by distinctness, family-cap, or max-charts truncation — silently dropping it would leave the memo unable to claim it answered that question. Which questions a memo’s inserted charts cover is rolled up into a memo-level decision-coverage scorecard (src/intel/readiness/decision_coverage.py) computed over the full sanitized memo (never the 80KB-truncated synthesis-state copy, whose tail charts would be silently lost), and surfaced to analysts as a quality-and-readiness dimension.

Where this fits

Synthesis and the footnote pipeline are the convergence point of everything upstream — they consume the agents’ analysis, the retrieved evidence, the reranked sources, and the FormulaGraph provenance, and emit a single cited memo. From here the memo enters Canvas, where analysts edit it, collaborate on it, and — once it is approved — export it. Approval-gated exports (the Investor Packet, Canvas PDF) are documented under the approval gate; the per-chart and per-section evidence an analyst sees while editing comes from the citation apparatus built here.

Sources

src/langchain/workflows/tools/final_editor/footnotes/orchestrator.py — footnote pipeline orchestrator; the # STAGE <label>: markers are the canonical stage inventory.
tests/unit/langchain/workflows/tools/final_editor/test_footnote_stage_ordering.py — frozen EXPECTED_STAGE_LABELS (36 stages) and EXPECTED_WATERFALL_STAGES (9 checkpoints); the source of truth for stage order/count.
src/langchain/workflows/tools/final_editor/footnotes/claim_injector.py, deduplicator.py, src_id_normalizer.py, section_quality_scorer.py — the [SRC:n] regex \[SRC:\s*(\d+)\] and the global ID normalizer / collision logic.
src/utils/citation_extractor.py — citation-marker pattern with the \s* fix (production emitted [SRC: n] with a space).
src/langchain/workflows/tools/final_editor/footnotes/citation_injection.py — R47 per-section global_offset re-indexing (reverse-order shift to avoid substring collisions).
src/langchain/workflows/tools/final_editor/footnotes/lineage_injector.py — Stage 2.9 assumption-lineage footnotes; per-section cap (_MAX_PER_SECTION_DEFAULT = 8), min influence 0.5, FormulaGraph-bridge provenance.
src/langchain/workflows/tools/synthesis/chart_inserter.py — format_chart_html / _validate_for_insertion P32 boundary validator (insightCaption ≥ 10, decisionQuestion ≥ 5) and select_charts_for_section distinctness filter.
src/langchain/workflows/tools/synthesis/orchestration/synthesis_pipeline.py and orchestrators/components/agent_executor.py — R4.1 state/Store divergence fix + 3-layer silent-failure defense.
src/llm/schemas/synthesis_executive_summary_schemas.py and src/langchain/workflows/tools/shared/structured_metric_validation.py — structured canonical-metric fields + prose validators; STRUCTURED_MISMATCH_THRESHOLD = 0.20.
src/intel/readiness/decision_coverage.py, decision_question_catalog.py — memo-level decision-coverage rollup and the 10 canonical IC questions.
Native memory: footnote_system.md (stage pipeline + two-hop alias routing + global re-indexing), synthesis_resilience_patterns.md (R4.1/R2/R5/R7), irr_substitution_leak_paths.md (structured-output enforcement), chart_decision_coverage.md (P32 scorecard).