Document Processing

A deal arrives as a pile of heterogeneous files: a sponsor’s PDF offering memorandum, an underwriting model in Excel, and one or more CoStar market exports. Memosa’s processor layer (src/processors/) turns each into namespaced, source-tagged chunks ready for retrieval. The three document families take three very different paths — a PDF is a render-and-read problem, an Excel workbook is a structure-extraction problem, and a CoStar report is a classify-and-table problem — but they converge on the same contract: every chunk is written into the deal’s Pinecone namespace (see Namespacing and Tenant Isolation) and tagged with its DataSource so the precedence ladder can resolve conflicts downstream.

This page focuses on what is non-obvious and production-hardened: the Excel LLM header rescue (when the model is allowed to infer a sheet’s structure, and the budget that bounds it) and the two-layer Excel validation stack (a heuristic cross-tab validator plus an additive, default-off SQL invariant layer). The deep mechanics of Excel formula evaluation live on the Formula Graph page; the retrieval side lives on Retrieval Pipeline.

The three processors

Source	Entry point	What it does
PDF	`PDFProcessor.process_pdf(self, pinecone_namespace, thread_ts, pdf_path, …)` in `src/processors/pdf/pdf_processor.py`	Renders pages, profiles them, runs OCR strategies and image discovery/classification, builds chunks, persists image artifacts. The page-render pipeline is what powers Image Intelligence.
Excel	`ExcelProcessor.process_excel_async(…)` in `src/processors/excel/excel_processor.py`	Classifies sheets, parses each financial tab with a family-specific parser, builds a formula graph, runs cross-tab validation, and vectorizes the extracted metrics + raw sheets.
CoStar	`CoStarProcessor.process_costar_report(…)` in `src/processors/costar/costar_processor.py`	Classifies pages, parses tables and charts, serializes visualization context, and builds chunks.

All three take the deal’s pinecone_namespace as a required argument — there is no path that writes chunks outside the deal boundary.

The Excel pipeline is by far the most involved, because an institutional underwriting model is a dense, idiosyncratic artifact. A standard model — 151 Avenue A, for instance, runs roughly 21 sheets and on the order of 119,000 formula-graph nodes — is the normal workload, not an outlier. The rest of this page is about the two things that make Excel extraction robust against the variety of real workbooks: header rescue and layered validation.

Excel LLM header rescue

Memosa’s Excel parsers are heuristic first. Each financial family — rent roll, OpEx, debt schedule, waterfall, sensitivity grid, assumptions, construction draws — has a dedicated parser that finds its header row and columns by matching against an alias registry. For the canonical case (an EquityMultiple-style “EM” model with standard headers), the heuristic ColumnMatcher succeeds outright and no LLM is ever invoked.

The problem is non-canonical workbooks. A sponsor’s model might label its rent column “Monthly Scheduled” instead of anything the alias table knows, and the heuristic parser silently returns nothing — leaving a tab unparsed, which cascades into thinner cross-tab coverage and a lower model-integrity score. LLM header rescue is the second chance: when the heuristic returns nothing on a sheet the classifier has already endorsed as belonging to a given family, the parser calls an LLM inferencer to read a snapshot of the sheet and propose its structure.

When rescue fires — and when it cannot

Rescue is deliberately narrow. It fires only on the classifier-routed, hinted-sheet path — the second-chance retry the orchestrator runs after a parser’s first-pass heuristic fails on a sheet the classifier flagged. It does not run during the first-pass alias iteration, and it never runs on sheets the classifier didn’t hint. The gate is structural: only the hinted-sheet path is given a budget tracker, and every rescue facade short-circuits to a no-op when the budget is None.

The consequence that matters for safety: EM workbooks have a zero regression surface. Their sheets resolve via strict alias match on the first pass, so they never reach the rescue path — and golden tests assert the LLM budget is untouched (budget.used == 0) even when an LLM factory is wired. The same is true of any well-formed non-EM workbook: if the heuristic finds the structure, the LLM is never consulted.

The seven inferencer families

Rescue is implemented as per-family inferencers in src/processors/excel/vlm/, each validating the LLM’s JSON output before the parser trusts it. There are seven structural families:

Family	Rescues	Returns
`column_header`	The `ColumnMatcher`-family parsers (rent roll, capex, debt schedule, construction draw)	`(header_row, column_mapping)`
`year_row`	The cash-flow parser	An annual `PeriodHeader`
`amount_column`	The OpEx parser	`(amount_col, year)`
`waterfall`	The waterfall parser	Promote-tier dicts
`sensitivity_grid`	The sensitivity parser	`{axes, metric_names}`
`assumptions_kv`	The assumptions parser	`(label, value)` pairs
`construction_finance`	The construction-loan / GMP / completion-guarantee path	GMP value, retainage %, guarantee type, interest reserve

column_header is the highest-ROI family — four production parsers route through it. Each inferencer enforces its own validation gates (header-row bounds, column-index bounds, no duplicate column assignments, a min_matches floor, required_any presence) before returning; a result that fails any gate becomes a None return, never an exception.

The confidence ceiling

A successful rescue carries a confidence score, and that score is capped at 0.85 . The ceiling sits below the sheet-type classifier’s 0.90 ceiling on purpose: inferring a sheet’s internal structure is a lower-trust act than classifying its type, so an LLM-rescued extraction can never present itself as more certain than 0.85. Every inferencer clamps its output through clamp_confidence(), so the ceiling holds regardless of what the model returns.

The budget that bounds it

Letting an LLM into the parse loop introduces a cost-and-latency risk: a pathological workbook could trigger an unbounded series of inference calls. The budget tracker (_LLMBudgetTracker in src/processors/excel/header_scanning/llm_rescue.py) bounds this on three independent axes:

Axis	Limit	Meaning
Per-family cap	5	No single family can spend more than 5 successful rescues — one family can’t monopolize the budget.
Global success budget	20	Hard ceiling on total successful rescues per workbook — cost-bounded.
Global failure budget	8	Separate ceiling on validation-failed attempts — the pathological-workbook trap.

The split between a success budget and a failure budget is the key design point. Failed attempts (the LLM returned unparseable JSON, or the result flunked a validation gate) charge the failure budget, not the success budget — so a workbook that legitimately needs many rescues isn’t penalized by a few early misfires, while a workbook where the LLM keeps returning garbage hits the failure ceiling at 8 attempts and stops calling. The tracker is shared across parsers running concurrently in a thread pool and guards its counters with a lock, so the budget invariants hold under concurrency.

Provenance

Rescue is never silent. Every successful rescue:

Appends a provenance note — LLM header inference applied to '<tab>' (parser=<p>, family=<f>, confidence=<c>) — which excel_processor.py merges into ExcelKeyMetrics.normalization_notes.
Emits an INFO log line of the same form (grep "LLM header inference applied" for staging telemetry).
Contributes to a per-workbook EXCEL_LLM_RESCUE_SUMMARY event that rolls up per-family call counts and timings — emitted even when zero rescues fired, so the absence of rescue is observable.

Exhaustion (a hinted sheet that wanted rescue but had no budget or no inferencer) emits a one-shot WARNING naming the reason and falls back to heuristic-only.

Two-layer Excel validation

Once the tabs are parsed, Memosa cross-checks them for internal consistency — does the NOI on the cash-flow tab reconcile with effective gross revenue minus OpEx? Do the debt-schedule balances roll forward correctly? This is cross-tab validation, and it runs in two layers that fire side by side inside one function: run_cross_tab_validation() in src/processors/excel/tab_orchestrator.py.

Layer 1 — the heuristic cross-tab validator (always on)

validate_cross_tab_consistency() in src/processors/excel/cross_tab_validator.py is the established validator. It applies 23 rules and carries hard-won, distinctly non-relational logic:

Unit-mismatch detection — spotting 12× / 0.083× ratios that betray a monthly-vs-annual confusion.
Scale-misparse band — an 800–1250× window that catches a thousands-vs-units misread.
Confidence-weighted severity — downgrading a discrepancy’s severity when the underlying parse confidence was low.

It returns a CrossTabValidationResult carrying validation_passed, the list of discrepancies, and checks_performed. This layer is always active — it is core to the model-integrity score.

Layer 2 — the SQL invariant validator (additive, default-off)

SqlInvariantValidator in src/processors/excel/sql_invariant_validator.py is a newer layer that runs a battery of SQL checks over the parsed tabs using an in-memory DuckDB connection. It is layered on top of the heuristic validator, not in place of it. The two are complementary: the SQL invariants target gaps the heuristic doesn’t cover — multi-year reconciliation (the heuristic checks year 1 only), per-tranche debt continuity, sensitivity-grid monotonicity, per-line-item outliers, non-negativity.

The validator iterates ALL_INVARIANTS from src/processors/excel/sql_invariant_definitions.py, which currently holds 20 invariants — the original twelve (INV-01…INV-12) plus a v2 batch (INV-13…INV-20) added in May 2026 for property-facts cross-checks, per-tranche debt math, capex reconciliation, exit-cap derivation, and sensitivity centering. Each invariant declares the parser tables it needs; if a required table is absent (the workbook had no waterfall tab, say), that invariant is marked skipped rather than run. Per finding, the status is one of passed, violated, skipped, or error, and every finding — including skips — is written to the sql_invariant_findings observability table so trend queries can scope to comparable denominators.

It is opt-in, and it never blocks

Two properties make the SQL layer safe to ship before its tolerances are fully tuned:

Default-off. SqlInvariantConfig.enabled defaults to False ; it is turned on per environment by setting SQL_INVARIANTS_ENABLED=true. When disabled, validate() short-circuits before any table construction, returning immediately with disabled=True. This is an emergency-disable switch and a per-environment rollout control, not a per-deal allowlist: the intended sequence is ship-disabled → enable on staging → soak and tune tolerances against real distributions → enable globally in production.
Never raises. validate() wraps its entire body in a top-level try/except. Any exception is logged (at TRANSIENT severity) and surfaced as a validator-level error inside the result — the Excel pipeline keeps going. The SQL layer is observability; it must never break a parse.

The SQL layer reports validation_passed=False only when there is a HIGH-severity violation; lower-severity findings are recorded but don’t fail the banner.

How the two layers compose

Inside run_cross_tab_validation(), the SQL layer runs first (when wired and enabled), then the heuristic validator runs unconditionally. Each is wrapped in its own try/except, so a failure in one never blocks the other. The function returns both results together (a CrossTabValidationOutput carrying the heuristic CrossTabValidationResult and the optional SqlInvariantResult); the caller lands both on the immutable TabOrchestrationResult (cross_tab_validation and sql_invariant_validation). The SQL phase emits its own telemetry line so its cost is visible in production:

EXCEL_PHASE: phase=sql_invariant_validation elapsed_ms=<N> thread_ts=<TS> violations=<N> skipped=<N> errored=<N> disabled=<bool>

Warm cost is on the order of single-digit milliseconds per workbook — negligible against the Excel parse envelope — because the DuckDB module is already resident (the observability stack loads it) and the per-call connection is built fresh and torn down.

PDF and CoStar in brief

The PDF processor is a page-render pipeline: it rasterizes pages, profiles each one, runs OCR where text extraction is thin, discovers and classifies embedded images (charts, site plans, diagrams), persists image artifacts to the store, and builds text chunks. Its output feeds both retrieval and the Image Intelligence subsystem.

The CoStar processor classifies each page of a market export, parses the structured tables and charts it finds, serializes the visualization context so charts are retrievable as text, and builds chunks tagged as CoStar (precedence weight 70 — authoritative for market facts, above the sponsor PDF). Both processors, like Excel, write exclusively into the deal’s namespace.

Sources

src/processors/excel/header_scanning/llm_rescue.py — _LLMBudgetTracker (three-axis budget: PER_FAMILY_CAP=5, GLOBAL_SUCCESS_BUDGET=20, GLOBAL_FAILURE_BUDGET=8), the seven try_*_rescue family facades, provenance notes, and the EXCEL_LLM_RESCUE_SUMMARY roll-up.
src/processors/excel/vlm/ — the seven per-family inferencers; vlm/llm_header_inferencer.py defines MAX_CONFIDENCE = 0.85 and clamp_confidence().
src/processors/excel/cross_tab_validator.py — validate_cross_tab_consistency(): the 23-rule heuristic validator with unit-mismatch / scale-misparse / confidence-weighted severity logic.
src/processors/excel/sql_invariant_validator.py — SqlInvariantValidator.validate(): the default-off, never-raises SQL layer; the disabled short-circuit and the HIGH-severity validation_passed rule.
src/processors/excel/sql_invariant_definitions.py — ALL_INVARIANTS (20 invariants: INV-01…INV-12 plus v2 INV-13…INV-20).
src/processors/excel/sql_invariant_contracts.py — SqlInvariantConfig (enabled defaults to False; reads SQL_INVARIANTS_ENABLED), Severity, InvariantStatus.
src/processors/excel/tab_orchestrator.py — run_cross_tab_validation(): the single hook point where both validators fire side by side, and the EXCEL_PHASE: phase=sql_invariant_validation telemetry.
src/processors/pdf/pdf_processor.py — PDFProcessor.process_pdf() (the namespaced PDF render pipeline).
src/processors/costar/costar_processor.py — CoStarProcessor.process_costar_report() (CoStar classify + table/chart parse).
src/processors/excel/formula_graph/graph_builder.py — the @lru_cache(maxsize=32768)-memoized _expand_range and the 100,000-cell hard ceiling referenced for workbook scale (full detail on the Formula Graph page).
src/di/document_processors.py — wires the LLM factory into ExcelProcessor; llm_factory=None makes rescue a no-op.
Native memory: excel_llm_header_rescue.md, sql_invariant_validator.md; .claude/rules/20-patterns.md “SQL Invariant Validation Layer”.