Skip to content

FormulaGraph

FormulaGraph is the dependency graph of every formula in a deal’s underwriting workbook. When an analyst uploads an Excel model, Memosa parses each sheet, extracts the cell references inside every formula, and builds a directed graph whose nodes are cells and whose edges are “this cell feeds that cell.” That graph is what lets the system answer questions a flat parse never could: where does this IRR come from?, which assumptions would move if I change the exit cap rate?, is there a circular reference hiding in the debt schedule?

The graph is the spine of two product surfaces:

  • Metric Ripple — when an analyst changes a headline number the night before IC, the engine walks the FormulaGraph to find exactly which sections depend on that cell and propagates the change surgically instead of regenerating the whole memo.
  • Cell-level financial provenance — footnotes, the executive summary’s key-driver block, and the Investor Packet’s critical-assumptions list all trace headline metrics back to their root input cells through the graph, rather than asking an LLM to guess the lineage.

This page covers how the graph is built (out-of-process, with hard memory and time guards), the load-bearing range-expansion cache, circular-reference detection, the watchdog that keeps a stuck build from wedging the worker, and the bridge that exposes the live graph to consumers outside the financial pipeline.

Building the graph for a real institutional workbook is memory- and CPU-heavy: it loads the workbook twice (a values pass and a formula pass via openpyxl), walks every formula, expands ranges, and finalizes a compact adjacency representation. Cold-cache, the openpyxl load alone drives a ~1.17 GB RSS delta on the reference workbook. If that work ran in the main LangGraph worker process, a single heavy build could starve every other deal sharing the worker — and a pathological formula could wedge the whole process.

So the build is isolated in a bounded pool of long-running subprocess slots. The pool (src/utils/formula_graph_pool.py) owns pool_size slots , each a dedicated subprocess connected over a Unix domain socket, plus a FIFO wait queue of depth queue_depth . A caller acquires a slot, runs one build_full request on it, and the slot is reset and returned to the idle set on context exit:

from src.utils.formula_graph_pool import get_formula_graph_pool
pool = get_formula_graph_pool()
with pool.acquire() as slot:
result = slot.build_full(excel_path, all_sheet_names, thread_ts)

The pool is invoked from tab_orchestrator.py during Excel processing. When the pool is exhausted (all slots busy and the queue full), or its circuit breaker has tripped, the caller falls through to an in-process build as a backstop — but the subprocess path is the primary one.

Memory isolation is enforced by the OS, not by Python

Section titled “Memory isolation is enforced by the OS, not by Python”

Each slot subprocess runs under a Linux RLIMIT_AS address-space cap of subprocess_address_limit_mb . This is the load-bearing memory bound. Without it, an over-allocation is a kernel SIGKILL — a silent socket EOF on the parent with no diagnostic. With it, the over-allocation surfaces as a Python MemoryError inside the worker’s exception handler, which returns a structured {"ok": False, "error": ...} response: the slot stays alive and the parent gets a real trace.

The per-slot RSS target and hard-cap labels (slot_peak_rss_target_mb, slot_peak_rss_hard_cap_mb) are observability only — the slot does not self-OOM on them. RLIMIT_AS is the real ceiling; the labels just decide when a slot_rss_hard_cap_exceeded telemetry line fires.

Slots are recycled when their idle RSS after a build exceeds slot_idle_rss_recycle_mb . Python’s allocator does not return freed memory to the OS, so a standard reference-class build leaves the slot holding ~1.7 GB post-release — which means the slot recycles on essentially every standard build. A second, lower threshold (slot_post_release_rss_refuse_threshold_mb = 700) closes a dead zone where a slot sitting between 700 MB and 1500 MB would silently refuse the next build without getting recycled.

The IPC wire protocol: the live graph never crosses the boundary

Section titled “The IPC wire protocol: the live graph never crosses the boundary”

The build produces a FormulaGraph whose adjacency is stored as compact NumPy int32 CSR (compressed sparse row) arrays, not Python lists. The wire format between slot and parent is versioned — the slot reports protocol_version in its ping handshake, and the client force-restarts the slot on a mismatch (raising FormulaGraphProtocolMismatch) so a wire-format desync after a deploy can’t corrupt a build.

The framing is multi-buffer: a 4-byte payload length, a 2-byte out-of-band buffer count, the pickle-protocol-5 payload, then each NumPy buffer as an 8-byte length plus raw bytes. Protocol 5’s buffer_callback lets the CSR arrays ride out-of-band as raw byte buffers instead of being serialized inline — this is what keeps the wire size proportional to the graph, not bloated by per-node Python object overhead.

The important architectural property, as of the current protocol: the live FormulaGraph is never shipped across IPC on the success path. The slot pre-runs everything that needs the live graph — topology mapping, time-series detection, circular-reference and other formula-tier validators, the vector-text body, and the Ripple reachability map — then serializes the graph to small JSON-safe dicts and releases the live graph before sending the response frame. The parent receives graph_serialized, model_topology_serialized, time_series_serialized, bucket_a_findings, vector_text_body, and a ripple reachability map — a payload on the order of ~150 KB, not the ~50 MB the live-graph form once cost. The parent does zero CSR rebuild.

Releasing before send means the slot’s idle RSS shrinks pre-IPC, and it removes a footgun the earlier post-IPC release hook carried (a caller could “forget” to release). The 256 MB frame ceiling (max_frame_size_bytes) is kept high purely as headroom in case a future protocol bump re-introduces buffer-bearing payloads.

Serialization has its own memory escape hatches

Section titled “Serialization has its own memory escape hatches”

Serializing a ~119k-node graph to a dict is itself memory-intensive — the materialized dict has to exist somewhere. Two layers protect the slot’s ~2 GB budget:

  1. Streaming serialize. serialize_formula_graph_streaming writes per-node JSON to a temporary JSONL file during the iteration loop, then reads it back at return time, instead of holding the entire dict in memory at once. The during-loop transient peak drops from a few hundred MB to a few MB. Telemetry field graph_serialize_strategy records in_subprocess_streaming on this (~99%) happy path.
  2. Parent-side deferral. If streaming still hits MemoryError, the slot can defer: it keeps the live graph alive, ships it via the out-of-band pickle path, and the parent — which has far more RSS headroom than the slot’s 2 GB — materializes the dict (deferred_to_parent). A pre-emptive headroom guard refuses even this when free RSS is too tight, falling back to an in-process rebuild on the parent instead (failed_no_pickle_headroom).

The two legacy entry points (serialize_formula_graph / deserialize_formula_graph) are retained for Postgres rehydration only — the Canvas Ripple cache cold-stores the graph as a JSON dict, where a pickle would be hostile across schema evolution.

A single formula like =SUM(Overview!M10:N19) references a range, and the builder has to expand that range into the individual cell addresses it touches so it can wire up the edges. Real underwriting workbooks reference the same ranges thousands of times — on the reference workbook, the hottest range (Overview!M10:N19, 20 cells) was referenced by ~68,792 distinct formulas. Expanding it once per formula reference is O(N_formulas × N_refs); expanding it once per unique range is O(N_unique_ranges).

So _expand_range in src/processors/excel/formula_graph/graph_builder.py is memoized with an LRU cache of maxsize entries:

@lru_cache(maxsize=32768)
def _expand_range(range_ref: str) -> List[str]:
...

The 32,768 maxsize is conservative. Most cached entries are single-cell refs with one-element lists; range entries are bounded by a sampling target, so the realistic memory cost is ~10 MB and the worst case is well inside the slot’s address limit.

A naive expansion of a pathological reference can be catastrophic. A formula referencing a whole column on a phantom-column sheet (where openpyxl reports max_column = 16384, Excel’s last column XFD) can imply tens of millions of cells. On one incident, the legacy path allocated the full cell list before sampling and locked the walk for ~656 s inside a single expansion.

The fix is a wrapper, _expand_range_aware, that runs two integer-only guards before the memoized lookup so a giant first-hit can’t even start the allocation:

  • Quarantine. If the reference targets a sheet in the process-global _QUARANTINED_SHEETS set (populated at build start from the caller’s skip_sheets), it returns an empty list immediately — cross-sheet refs into a quarantined sheet get no node downstream anyway, so this produces the same graceful-degradation shape without the allocation.
  • Hard cell-count ceiling. If the parsed bounding box exceeds _RANGE_EXPANSION_HARD_CEILING_CELLS cells, it refuses, logs expand_range_refused_giant, and returns empty.

Both guards parse the range bounds with integers only — they allocate nothing — which keeps _expand_range itself pure (same key, same return shape). The ceiling is checked before the cache lookup specifically so a giant-range first hit cannot start the allocation that the cache would otherwise memoize.

Circular-reference detection without the O(deg²) trap

Section titled “Circular-reference detection without the O(deg²) trap”

detect_circular_refs (src/processors/excel/model_quality/circular_ref_detector.py) finds cycles in the dependency graph with an iterative depth-first search using three-colour marking (white = unvisited, gray = on the current path, black = finished). It caps reported cycles (default 10) and DFS depth at 100 so a pathological graph can’t hang it.

The subtlety is performance. FormulaGraph.dependencies_of(addr) materializes a fresh tuple of neighbour addresses on every call — each call does an int-to-string lookup per neighbour off the CSR arrays, which is O(degree) work. The DFS while-loop iterates degree + 1 times per node. If the code called dependencies_of inside the loop body, per-node cost would be O(deg²).

The invariant that prevents this: the DFS stack stores (node_addr, dep_idx, deps) tuples — the deps tuple is materialized once when a node’s frame is pushed and reused for every iteration until the frame pops.

stack: List[Tuple[str, int, Tuple[str, ...]]] = [(start, 0, start_deps)]

When the graph is unavailable, or its tier is values_only (no formula edges to walk), the detector returns an empty list by design — there is no structural fallback, because cycle detection requires the formula reference text. Downstream consumers handle the empty result transparently; the formula_graph_summary.unavailable_reason field carries the why (for example slot_pre_build_rss_refused, size_capped, workbook_no_formulas) so the absence is operationally visible, not silent.

The watchdog: keeping a stuck build from wedging the worker

Section titled “The watchdog: keeping a stuck build from wedging the worker”

A subprocess that hangs mid-build is worse than one that crashes — the parent waits, the slot holds its lock, and downstream the deal’s whole Excel grant can be consumed before anything notices. The pool runs a dedicated watchdog thread (_watchdog_loop) that SIGKILLs any slot whose stderr has gone silent during an active build for slot_watchdog_silent_kill_secs , polling every slot_watchdog_poll_secs (5 s). It is decoupled from the per-slot health thread, which only detects process death after a build returns; the watchdog catches the in-progress hang.

Liveness flows through stderr. The slot emits structured formula_graph_* log lines as it works; the parent’s stderr-reader thread updates _last_heartbeat_ts for any line matching the heartbeat rule. That rule is deliberately broad and level-agnostic — it accepts any line containing "event": "formula_graph_ (or the rolling formula_graph_profile_v1 sampler line), at any log level:

if self._build_active and (
"formula_graph_profile_v1" in line
or '"event": "formula_graph_' in line
):
self._last_heartbeat_ts = time.monotonic()

The 30 s threshold is intentional and is not the knob to turn when the watchdog false-kills — six heartbeat windows fit inside it. The correct response to a false-kill is to add a heartbeat at the newly-silent path. The threshold stays sharp so it still catches a real pathological hang (the motivating incident was a slot wedged for 656 s on one sheet with zero observability).

The watchdog only fires on silent slots. A slot that emits heartbeats but grinds through one sheet far too slowly would never trip it. So the builder also enforces a per-sheet wall-clock cap, per_sheet_timeout_secs : when a single sheet’s walk exceeds it, that sheet’s partial edges are dropped and the build continues to the next sheet (sheet-local quarantine, not a workbook-wide unwind). The timed-out sheet is recorded on the graph so the degradation is visible.

These thresholds form a strict ladder, locked by a config invariant test so inverting any pair can’t ship:

slot_watchdog_silent_kill (30s)
< per_sheet_timeout (90s)
< slot_peak_timeout (1150s)
≤ the Excel tier grant

The outer slot_peak_timeout_secs is the wall-clock ceiling for a whole build_full request — sized so the slot times out (SIGTERM, fall to in-process) before the outer Excel grant cancels with no fallback window. It accommodates cold-cache P95 variance on a standard institutional workbook; warm-cache builds, the typical repeat-upload case, complete in roughly 70 s and never approach it.

When the pool itself is genuinely broken — subprocess startup failing, a protocol mismatch, repeated OOMs — every new build would otherwise spawn a fresh subprocess that also fails, each caller waiting tens of seconds for the health thread to detect death and respawn. The pool circuit breaker collapses that cascade to a fast-fail: after circuit_breaker_threshold consecutive failures within a 60 s window, it opens, and acquire() raises PoolCircuitOpen (a PoolBusy subclass, so existing fallback handlers catch it unchanged) — routing builds straight to in-process. After a cooldown it half-opens to probe a single build, then closes on success or re-opens on failure. Counters reset on any successful build, so a single bad workbook in a stream of healthy ones never trips it.

The bridge: exposing the live graph to non-financial consumers

Section titled “The bridge: exposing the live graph to non-financial consumers”

The six financial deep-agent tools have always reached the live graph through FormulaGraphProvider — a lazy, per-instance memoized loader that deserializes the persisted dict on first .graph() call. But synthesis, footnotes, the Investor Packet, and readiness needed cell-level answers too, and most of them were reading only the flat formula_graph_summary dict. They couldn’t answer “where does this IRR come from?”

FormulaGraphBridgeService (src/intel/formula_graph/bridge_service.py) closes that gap. It wraps FormulaGraphProvider and exposes a narrow surface tuned to non-financial callers:

MethodReturnsUsed by
is_available(deal_data) / tier(deal_data)cheap dict-level pre-checks (no deserialize)all consumers, gating
trace_dependencies(deal_data, metric_addr)root-assumption cell addresses (bounded BFS)synthesis numeric assertions
top_drivers(deal_data, n=5)global top-N influence-ranked cells with labelsIP critical-assumptions
reachability_lookup(deal_data, src, dst)does src reach dst? (bounded forward BFS)impact analysis
formula_provenance(deal_data, metric_key)one-call (output_cell, root_assumptions, depth, validator_signature)footnote cell-level provenance
validator_signature(deal_data)stable audit string FormulaGraphValidator.v{protocol_version}.{content_hash[:8]}citation stability

Tenancy: two DI shapes, no cross-process facade

Section titled “Tenancy: two DI shapes, no cross-process facade”

The bridge is wired differently in Canvas vs. web/worker, and that split is structural — not an accident to be “cleaned up” into one facade:

  • Canvas wires it as IntelService.formula_graph, passing canvas_cache=app_state.formula_graph_cache so the bridge shares the Canvas Ripple LRU. Each call still constructs a fresh FormulaGraphProvider, but the provider hits the shared cache on its first .graph().
  • Web/worker instantiates it directly in DI (next to the signals and patterns services) and threads it through the orchestrator into the final-editor and footnote subgraphs, with canvas_cache=None. Here each call pays a per-call deserialize (~200–300 ms cold), matching the existing financial-tools pattern.

The reason web/worker does not get an IntelService facade is that the facade’s readiness-derived services depend on DocumentReadinessService, which is Canvas-only. Wrapping a partial facade with optional readiness fields would create two contracts sharing one class — a consumer couldn’t tell “running in web/worker” from “wiring bug.” Web/worker takes the exact narrow slice it needs instead.

Canvas serves Metric Ripple previews on keystrokes, and deserializing a ~119k-node graph (~200–300 ms) on every edit would be unusable. FormulaGraphCache (src/canvas/services/formula_graph_cache.py) is an LRU keyed by (deal_id, content_hash). The content_hash in the key is what makes re-upload safe: a new workbook upload yields a new hash, so the stale entry is evicted automatically on the next access — no invalidation choreography. A threading.Lock keeps the LRU consistent under concurrent metric edits.

The graph is built at a tier, and consumers degrade rather than fail when a richer tier isn’t available:

  • full — formula edges plus computed values; everything works.
  • formulas_only — formula edges without values.
  • values_only — no formula edges; circular-ref detection and dependency tracing return empty, because there’s nothing to walk.

When the graph is absent entirely, formula_graph_summary.unavailable_reason records the cause, and the bridge’s coverage_report returns a confidence=0.0 report naming every canonical CRE tag as missing, so consumers can gate their behaviour off a structured signal instead of a binary present/absent. A standard institutional workbook should reach full on the slot path; the in-process fallback, the per-sheet quarantine, and the parent-side serialize deferral exist so that even under memory pressure the deal gets some graph rather than none.

FormulaGraph is built during document processing, persisted into the deal’s data, and then read by:

The financial deep-agent tools consume it directly through FormulaGraphProvider for sensitivity analysis and driver-hierarchy construction.

  • src/processors/excel/formula_graph/graph_builder.py_expand_range (@lru_cache(maxsize=32768), line 188), _expand_range_aware quarantine + 100k hard-ceiling guards, _QUARANTINED_SHEETS, _RANGE_EXPANSION_HARD_CEILING_CELLS.
  • src/processors/excel/formula_graph/contracts.pyFormulaGraph dataclass, CSR fields (_dependents_offsets/_dependents_neighbors/_dependencies_offsets/_dependencies_neighbors/_cross_sheet_pairs), __hash__ = object.__hash__, __setstate__, release(), dependencies_of/dependents_of, serialize_formula_graph_streaming.
  • src/processors/excel/model_quality/circular_ref_detector.py — iterative DFS with (addr, dep_idx, deps) stack; deps-on-stack invariant; values_only/None early returns.
  • src/config/formula_graph_pool_config.pypool_size, queue_depth, slot_idle_rss_recycle_mb, slot_post_release_rss_refuse_threshold_mb, slot_watchdog_silent_kill_secs, per_sheet_timeout_secs, slot_peak_timeout_secs, protocol_version, subprocess_address_limit_mb, circuit_breaker_threshold, max_frame_size_bytes.
  • src/utils/formula_graph_pool.pyFormulaGraphPool, _watchdog_loop, FormulaGraphPoolCircuitBreaker, PoolBusy/PoolCircuitOpen.
  • src/utils/formula_graph_subprocess_worker.py — slot main loop, _run_build_full, graph-release-before-send, graph_serialize_strategy values, multi-buffer frame I/O, orphan self-exit.
  • src/utils/formula_graph_subprocess_client.py — per-slot client, _stderr_reader_loop heartbeat matcher (level-agnostic), sigkill_and_mark_unhealthy, FormulaGraphProtocolMismatch, graph_serialized deserialize.
  • src/processors/excel/tab_orchestrator.pyget_formula_graph_pool() consumer, slot acquire + build_full, in-process fallback, formula_graph_summary standardization.
  • src/intel/formula_graph/bridge_service.pyFormulaGraphBridgeService (is_available, tier, trace_dependencies, reachability_lookup, top_drivers, formula_provenance, validator_signature, coverage_report); never-raise contract; tenancy split.
  • src/langchain/workflows/tools/financial/deep_agent/tools/formula_graph_provider.pyFormulaGraphProvider lazy deserialize + memoize, is_available, graph, primary_cell_for.
  • src/canvas/services/formula_graph_cache.pyFormulaGraphCache LRU keyed by (deal_id, content_hash).
  • src/canvas/ripple/ — Metric Ripple engine (engine.py, dependency_resolver.py) consuming the FormulaGraph.
  • Native memory: formula_graph_subprocess_ipc.md, formula_graph_bridge.md, formula_graph_range_expansion_cache.md, formula_graph_circular_ref_dfs_invariant.md, formula_graph_watchdog_heartbeat_invariant.md, formula_graph_subprocess_serialize_streaming.md, feedback_151_avenue_a_is_standard_size.md.