Skip to content

Retrieval Pipeline

Every section of a Memosa memo is grounded in evidence pulled from the deal’s own documents. That evidence comes from the CRE retrieval pipeline (CRE = “consolidated retrieval engine”) — the layer that turns a section’s query into a ranked, deduplicated, source-diverse pool of chunks for the synthesis agent to cite.

This page traces a single section’s retrieval from query construction through final chunk assembly. It names the real modules, states exactly what fires when, and explains the production lessons that shaped each stage. For how the resulting chunks are scored and ordered, see Reranking.

A retrieval call for one section runs through an ordered list of phases. The orchestrator (RetrievalPipelineOrchestrator in pipeline/orchestrator.py) sequences them and checks for early termination between each — it contains no business logic, only sequencing and deadline/cancel checks.

setup → validation → [structural_routing] → primary → chunk_retry → recovery → assembly
PhaseModuleResponsibility
setuppipeline/phases/setup_phase.pyQuery optimization, adaptive top-k, vector config
validationpipeline/phases/validation_phase.pyCircuit-breaker + namespace validation
structural_routingpipeline/phases/structural_routing_phase.pyOptional Phase 0.5 — document structure-tree lookup (only when a tree store is injected)
primarypipeline/phases/primary_phase.pyStage A — main vector query (hybrid dense+sparse)
chunk_retrypipeline/phases/chunk_retry_phase.pyDoc-filter retry on zero/sparse results
recoverypipeline/phases/recovery_phase.pyThe 4-tier recovery cascade for whole-empty pools
assemblypipeline/phases/assembly_phase.pyStage B + Stage C, reranking, fairness, diversity, final result

structural_routing is conditionally inserted — when no tree_store is injected, the phase list is exactly the six core phases and behavior is unchanged.

Between every phase the orchestrator does two safety checks:

  • Deadline check. ctx.check_deadline() — if the section’s wall budget is exhausted, it terminates early with deadline_exceeded_before_<phase>.
  • Cooperative soft-cancel. A synchronous poll of timeout_budget_manager.is_phase_aborting("vector_operations"). The budget watchdog raises this flag at 85% of the hard limit so the pipeline exits cleanly at a safe phase boundary before a hard task.cancel() fires — which would be a no-op against uncancellable thread-pool Pinecone IO. (This fix followed a May 2026 production cascade where a hard cancel left a 65s vector-operations overrun.) The poll is synchronous on purpose: awaiting here would introduce a new cancellation point.

The setup phase (RetrievalSetupBuilder in setup/retrieval_setup_builder.py) builds the query and decides how many chunks to fetch.

The number of chunks requested from Pinecone — the adaptive top-k — is derived per section from a SECTION_COMPLEXITY table, scaled by document count and clamped to a per-section cap. It is not a single global number.

  • Each section has a base complexity (e.g. financial_analysis = 28, risk_market = 11, market_comparables = 30).
  • A document-count scale factor applies: ≤2 docs → 1.75× (thin pools need more per-doc candidates, not fewer — a lesson re-learned on a 2-document deal whose pools were starving the citation injector), 3–10 docs → 1.0×, >10 docs → 1.5×.
  • The result is clamped to a floor of _MIN_TOP_K = 8 chunks (raised from 5 after risk_analysis at top_k=6 returned zero chunks) and a per-section ceiling (up to 50 for the comparables sections).

The per-section reference values live in src/config/retrieval_parameter_defaults.py (RETRIEVAL_DEFAULTS), where the configured top_k ranges from 8 (risk_dimension) to 30 (exit_strategy, market_comparables, rental_comparables, sales_comparables). The CRE pipeline’s actual Pinecone top-k is derived from SECTION_COMPLEXITY and the two must stay aligned (enforced by test_retrieval_config_consistency.py).

A namespace-health check can increase top-k further when the namespace is sparse or degraded (the “inverted logic” fix): weak signal means more retrieval attempts, not fewer — the opposite would create a death spiral of progressively sparser assessments.

The primary phase issues the main vector query. By default this is a hybrid query — dense (cosine) and sparse (BM25/lexical) executed together and fused with reciprocal rank fusion. Hybrid-primary is enabled globally (HYBRID_PRIMARY_ENABLED = True in config.py); the sparse query runs in parallel with the dense one, so latency is neutral while recall on keyword-heavy tabular Excel data improves 15–25% (tabular numbers have low cosine similarity to natural-language queries but strong BM25 signal). The sparse weight is 0.3 — i.e. 70% dense / 30% sparse, mirroring the fallback reranker’s internal blend.

Two server-side Pinecone metadata filters tighten Stage A at zero extra latency:

  • Quality floor. QUALITY_FLOOR_ENABLED = True, default 0.25, with per-section overrides (executive_summary 0.35 down to sponsor_background 0.15). Filters OCR noise, boilerplate, and malformed chunks out of the top-k slots before they are ever fetched.
  • Section-hint filter. SECTION_HINT_FILTER_ENABLED = True — eligible sections filter on a section_hint metadata field instead of doing a Phase 0.5 PageIndex tree lookup, saving ~5s of tree I/O per query.

Each section also carries a score floor — the minimum relevance score a chunk must clear. Floors are section-tuned in retrieval_parameter_defaults.py and range from 0.10 (sponsor_background, exit_strategy) to 0.18 (the system default and several PDF-primary sections). They are deliberately tuned per section, not derived from pool size — a sponsor’s name rarely self-identifies as “sponsor” and scores lower, so its floor is relaxed.

Stage B — section-aware source-type recovery

Section titled “Stage B — section-aware source-type recovery”

Stage A can return a healthy pool overall yet be missing a source type the section actually depends on — for example, a financial_analysis pool with plenty of PDF chunks but no Excel. Stage B (recovery/source_type_recovery.py, dispatched from assembly_phase._maybe_recover_missing_source) fixes exactly that case.

Stage B fires a single source-type-targeted recovery sub-query only when a section-required source type is absent — or under its per-section quantity floor — in a non-empty primary pool. The required/optional split is governed by a per-section policy table, SECTION_EXPECTED_TYPES:

Section familyExample sectionsRequired (Stage B recovers)Optional (never triggers)
CoStar-primaryrisk_market, market_analysis, market_comparablescostarexcel
Excel-primaryfinancial_analysis, key_metrics, risk_financialexcelcostar
Mixedinvestment_rationale, exit_strategyexcelcostar
PDF-primarysponsor_background, property_*, risk_operational, risk_regulatory(none)excel, costar

The distinction is load-bearing. Stage B does not fire for optional types regardless of whether they appear in allowed_indexes. Excel financial chunks legitimately do not match a risk_market market/demographics query — they live in Pinecone for financial_analysis and key_metrics, not risk_market. Treating Excel as universally recoverable produced nine months of false filter_returned_wrong_type alarms before the policy table replaced it.

Two source types are permanently excluded via _NEVER_RECOVERED:

  • pdf — the natural precedence floor; Stage C’s pre-rerank quota handles PDF monopolization without spending a recovery slot.
  • user — no ingest path produces document_type="user" chunks (only PDF, Excel, and CoStar builders exist). User-input precedence lives at synthesis-prompt assembly, not in Pinecone.

The recovery query is a raw Pinecone metadata-filtered query (direct_vector_query) keyed on the canonical source_type field — bypassing the CRE pipeline entirely (no reranker, no score floor, no chunk-retry filter-clearing). This is deliberate: the filter is then authoritative, and chunks returned always carry the requested type.

Filtering on source_type (not document_type) matters because specific Excel vectorizers write suffixed document_type values like excel_metrics / excel_debt while keeping source_type="excel" canonical — filtering on document_type=="excel" would silently miss every suffixed Excel chunk. The recovery has a 12-second per-call ceiling (SOURCE_RECOVERY_TIMEOUT_SECS = 12.0 ) and goes through the recovery semaphore so it bypasses primary-pool admission pressure. A per-(namespace, missing_type) diagnosis cache short-circuits repeat probes across sections so a genuinely-absent type is not re-probed five times per deal.

Recovered chunks are merged into the candidate pool and reranked alongside the primary chunks. Because the reranker can later cut recovered chunks below rerank_top_n, a post-rerank protection step re-admits up to 8 of them, and Excel-required sections additionally enforce a per-section Excel survivor floor (EXCEL_MIN_SURVIVORS_BY_SECTION, e.g. financial_analysis = 4) so a recovered-then-cut Excel pool cannot silently collapse back to PDF/CoStar.

Where Stage B recovers a missing type, Stage C reserves room for minority types that are already present but at risk of being squeezed out of the top-k slice. The over-fetch multiplier STAGE_C_OVER_FETCH = 1.5 causes Pinecone to return adaptive_top_k × 1.5 chunks; the pre-rerank quota at the primary phase trims back to adaptive_top_k while guaranteeing minority-source presence. The multiplier is bounded by each section’s cap so high-complexity sections never exceed their headroom — and it must not be raised above 1.6× without proving pool-pressure headroom first.

When the primary pool comes back whole-empty (zero chunks), the recovery phase runs a four-tier cascade (phases/recovery_cascade.py). Each tier is attempted in order with exponential backoff between stages (≈0.5s → 1s → 2s → 4s, with ±25% jitter) to avoid a retry storm against a latency-degraded Pinecone:

  1. Doc-filter relaxation — retry with the document-ID restriction dropped. (Financial analysis hit this: zero chunks despite 104 Excel chunks visible, because the doc-filter path was broken.)
  2. Namespace sweep — drop all filters and query the whole namespace. For sections that never send doc IDs (e.g. risk_analysis, sponsor_guarantor) this is the expected path, not a failure, so it logs at INFO.
  3. Deep index sweep — sweep across all candidate physical indexes. By this point the namespace sweep has proven the original index restriction empty, so it is dropped.
  4. Hybrid lexical (sparse/BM25) recovery — a final pure-lexical pass.

Each tier validates recovered chunks against a quality gate before accepting them (validate_recovered_chunk_quality); a tier whose chunks are all rejected is treated as failed and falls through to the next. If a structural routing phase identified relevant page ranges, a structure-scoped recovery is tried before the broad namespace sweep — and only short-circuits the sweep if it yields ≥3 quality-passing chunks. Every tier emits a recovery_tier_result JSON event; the terminal fallback_exhausted event marks a section that genuinely has no retrievable evidence.

Retrieval quality starts at ingest. Memosa embeds with the Voyage 4 family using asymmetric retrieval — a different model for documents than for queries:

RoleModelNote
Documentsvoyage-4-large MoE architecture, SOTA retrieval
Queriesvoyage-4 Asymmetric pairing within the Voyage 4 family

Vectors are Matryoshka-truncated from a native 2048 dimensions to 1024 output dimensions (output_dimensions = 1024 ) — the migration from 3072-dim vectors made Pinecone I/O cheaper and let the pipeline feed the reranker more candidates.

Documents are chunked first (the chunk_first mode), at 768 tokens per chunk with 128 tokens of overlap (rag_chunk_max_tokens = 768 , rag_chunk_overlap_tokens = 128 ), then each chunk is embedded independently. Before embedding, contextual retrieval prepends an Anthropic-LLM-generated context paragraph to each chunk (Claude Sonnet; 30 concurrent; 300-chunk budget ceiling) so a numbers-heavy chunk still embeds with the document context that explains it. (An alternative late_chunking mode leverages Voyage 4’s 32K window and skips the contextual-retrieval step; chunks from the two modes cannot be mixed in one namespace.)

All vectors live in Pinecone serverless (AWS us-west-2, cosine metric) across three physical indexes: a consolidated deals index (memobot-deals-1024 — PDF, Excel, and CoStar differentiated by source_type metadata, not separate indexes), a corpus index, and a multimodal index. Namespace isolation keeps every deal’s vectors separate; the namespace format is described in Namespacing and Isolation.

Concurrent Pinecone access is bounded by a connection pool and a pool-aware admission controller (src/vector_db/retrieval_executor.py). The controller is the primary gate — it checks real-time pool utilization before admitting a query — and the per-semaphore caps are a backstop.

The pool is sized at 96 connections (pool_size = 96 ). This was tuned through several production iterations (50 → 72 → 96) after cascading low-chunk-assembly failures at 82–93% utilization. Retrieval and ingest each get their own 16-slot semaphore (retrieval_max_parallel / ingest_max_parallel, both with an audit-mandated floor of 16) so heavy upserts stop competing with analysis-phase reads; the dataclass refuses construction if either drops below 16 or their sum exceeds the pool size.

The admission controller does graduated load-shedding by utilization:

ThresholdAction
88%+ (84/96 active)Hard reject (source-diversity-exempt) — lowered 95 → 88 after a production cascade saturated the pool 95→100% in one surge; 88% leaves 12 spare slots for the source-diversity exemption and recovery-priority bypass
90%+ with budget < 12sReject — short-budget queries would queue for a connection and time out before executing, wasting the slot
80%+ with budget < 2sReject — triage low-budget queries early
92%+Skip shielded recovery on cascade-cancelled sections — spawning a recovery task at this pressure would deepen starvation

These thresholds are centralized in PineconeConfig with __post_init__ invariants enforcing their relative ordering (triage < hard_reject ≤ short_budget_reject < shielded_recovery_skip), so they cannot be reordered into an incoherent ladder. They were tuned through 7+ iterations and must not be lowered without production-log evidence.

Embedding fan-out at ingest is bounded not just by a static cap but by a live RSS-pressure curve: effective concurrency = min(config, rss_derived), floored at 2. This closed the one Voyage-tensor path that previously had no admission control (the reranker and the retrieval pool already had theirs), and it root-caused a recurring class of batch_embeddings memory recoveries that had survived 11 prior reactive fixes — the real driver was unbounded in-flight embedding concurrency, not opaque vendor-SDK memory.

After assembly, the pipeline returns a deduplicated, reranked, source-diverse, fairness-balanced chunk pool. The ordering and scoring of that pool — Voyage rerank-2.5, the SemanticBM25 fallback, the circuit breaker, the section score floors, and the density-adaptive pre-rerank floor — are covered in Reranking. For how those cited chunks become a memo, see Synthesis and Footnotes.

  • src/utils/consolidated_retrieval/pipeline/orchestrator.py — phase sequencing, deadline + soft-cancel checks
  • src/utils/consolidated_retrieval/pipeline/phases/assembly_phase.py — Stage B dispatch, Stage C quota, reranking, fairness/diversity, recovered-chunk protection
  • src/utils/consolidated_retrieval/recovery/source_type_recovery.py — Stage B trigger semantics, SECTION_EXPECTED_TYPES, _NEVER_RECOVERED, detect_missing_high_precedence_types
  • src/utils/consolidated_retrieval/phases/recovery_cascade.py — the 4-tier whole-empty recovery cascade + backoff
  • src/utils/consolidated_retrieval/config.py — hybrid-primary, quality floors, section-hint filter, rerank top-N
  • src/utils/consolidated_retrieval/setup/retrieval_setup_builder.pySECTION_COMPLEXITY, adaptive top-k, _MIN_TOP_K, STAGE_C_OVER_FETCH
  • src/config/retrieval_parameter_defaults.py — per-section top_k / score_floor (0.10–0.18), EXCEL_MIN_SURVIVORS_BY_SECTION
  • src/config/pinecone_config.pypool_size=96, retrieval/ingest semaphores, admission thresholds
  • src/config/embedding_config.py — Voyage 4 asymmetric models, 2048→1024 Matryoshka, chunk_first vs late_chunking
  • src/config/chunking_config.py — 768-token chunks / 128 overlap, contextual retrieval prepend
  • memory/embedding_concurrency_memory_coupling.md — ingest concurrency ↔ RSS coupling fix