Skip to content

Namespacing and Tenant Isolation

Memosa is multi-tenant at the vector layer. Every deal’s chunks live in a single Pinecone namespace, and that namespace is the hard boundary that keeps one organization’s documents — and one deal’s documents — from leaking into another’s retrieval. There is no row-level filter doing the isolation after the fact; the isolation is structural, applied at write time and at every read. This page documents the namespace format, the one class that produces it, the invariants that keep it safe, and the precedence ladder Memosa uses when two sources disagree about a number.

A deal’s Pinecone namespace has one of two shapes, produced by NamespaceManager.get_deal_identifier() in src/utils/namespace_manager.py:

{org_id}_{normalized-deal-name}-{tts8} # org-scoped (current)
{normalized-deal-name}-{tts8} # legacy / no-org

Three parts, assembled left to right:

PartSourcePurpose
org_idThe organization’s UUID stringPrimary tenant boundary. Prepended with a trailing underscore so Pinecone list() prefix matching enumerates one org’s namespaces. Omitted entirely for legacy deals with no org.
normalized-deal-nameThe user-supplied deal name, normalizedHuman-legible deal identity inside the namespace.
tts8First 8 digits of the deal’s Slack thread_tsDeterministic per-deal disambiguator — two deals with the same name in the same org never collide.

The tts8 suffix is deterministic, not random. generate_unique_identifier() derives it by stripping the dot from the thread timestamp and taking the first eight characters :

# "1736608800.123456" -> "17366088"
thread_id = thread_ts.replace(".", "")[:8]

This determinism is load-bearing: the same conversation always resolves to the same namespace, so re-runs, recovery, and late-arriving documents all land in the same place. generate_unique_identifier() raises if no thread_ts is supplied — there is deliberately no random fallback, because a non-deterministic namespace would silently fragment a deal’s chunks across two namespaces and break retrieval.

The organization UUID is prepended ({org_id}_…), not stored as metadata to be filtered on. This is intentional. Because Pinecone supports prefix matching on list(), an org-prefixed namespace lets Memosa enumerate exactly one tenant’s deals without scanning the whole index. The trailing underscore ({org_id.strip()}_) is the prefix delimiter — every namespace for org acme-uuid starts with acme-uuid_, and nothing else does.

When org_id is absent (a legacy deal, or a deal created before org scoping), the namespace falls back to the bare {normalized-deal-name}-{tts8} form. Both shapes are valid; the org prefix is additive.

The deal name a user types (“151 Avenue A — Mixed Use”) is not safe to drop into a namespace verbatim. normalize_deal_name_for_namespace() runs a fixed pipeline:

  1. Collapse all runs of whitespace to a single space and strip the ends.
  2. Lowercase.
  3. Replace whitespace with hyphens.
  4. Drop every character that is not [a-z0-9-].
  5. Collapse consecutive hyphens to one.
  6. Strip leading and trailing hyphens.

So "My Deal Name" becomes my-deal-name. If normalization produces an empty string (a name that was all punctuation, say), the method raises ValueError rather than emitting an empty namespace — an empty namespace would be a silent isolation hole.

Pinecone namespaces have a hard ceiling of 512 characters. get_deal_identifier() takes max_length=512 and, if the assembled identifier exceeds it, truncates only the deal-name portion — the org_id prefix and the -{tts8} suffix are preserved in full, because those two parts carry the isolation and determinism guarantees. The truncation re-strips any trailing hyphen so the result is still well-formed. In practice deal names never approach this limit, but the truncation logic guarantees the namespace stays both valid and isolated even for pathological inputs.

Idempotent resolution: generate once, then reuse

Section titled “Idempotent resolution: generate once, then reuse”

A deal’s namespace must be computed exactly once and then reused for the rest of that deal’s life. NamespaceManager.get_or_use_existing() enforces this:

  • It reads the conversation state for an existing pinecone_namespace.
  • If one is present and non-empty, it returns that (origin restored) — it does not recompute.
  • Otherwise it generates via get_deal_identifier(), persists the result back to conversation state, and returns it (origin generated).

Both branches log a NAMESPACE-INVARIANT line so the origin is auditable in production. This “restore-or-generate” shape is what makes namespacing safe across the asynchronous, retry-heavy lifecycle of a deal: the first writer wins, and everyone else binds to the same namespace.

normalize_namespace: defensive edge handling

Section titled “normalize_namespace: defensive edge handling”

get_deal_identifier() is the canonical producer of a deal namespace. A second, lower-level helper — normalize_namespace() — exists for the cases where some arbitrary value (a document ID, an internal identifier) needs to be coerced into a safe namespace-shaped string. Its edge handling is the safety net:

InputResult
None"default" (logs a warning)
Empty / whitespace-only string"default" (logs a warning)
A normal stringRun through the same normalize_deal_name_for_namespace() pipeline
A string that normalizes to empty"default" (logs a warning)

The deliberate choice here is that normalize_namespace() never raises for None/empty — it falls back to the literal "default" and logs. This is the opposite of normalize_deal_name_for_namespace(), which raises. The difference is intent: producing a deal namespace from a missing name is a bug worth failing on; coercing a stray internal value is best-effort, and a logged fallback to "default" is safer than a crash deep in a vector-write path. Crucially, normalize_namespace() reuses the exact same normalization core as deal names, which is what keeps a pinecone_namespace and any document-ID filter derived from it consistent — a mismatch there was historically a silent RAG failure mode.

Isolation is only as good as its weakest call site. In Memosa the namespace is a required positional argument threaded through the entire ingestion and retrieval surface — it is not optional, not defaulted, and not inferred. The processor entry points make this explicit in their signatures:

src/processors/pdf/pdf_processor.py
async def process_pdf(self, pinecone_namespace: str, thread_ts: str, pdf_path: str, ...)

The CoStar and Excel processors follow the same contract, and at the bottom of the stack every Pinecone upsert, query, and delete call passes namespace=… explicitly. There is no code path that writes to or reads from Pinecone without naming the namespace. That is the isolation invariant: a chunk physically cannot be written outside its deal’s namespace, and a query physically cannot see chunks from another namespace, because the namespace is supplied at the call boundary every single time.

This matters because the alternative — a shared index with a metadata filter applied after retrieval — fails open: forget the filter on one query and you leak. Namespacing fails closed: forget the namespace and the call doesn’t compile (it’s a required argument) or returns nothing (an unknown namespace is empty), never another tenant’s data.

Physical index consolidation does not weaken isolation

Section titled “Physical index consolidation does not weaken isolation”

Memosa’s abstract index names (pdf, excel, costar, multimodal, corpus) resolve to a smaller set of physical Pinecone indexes after the index-consolidation work. Isolation is unaffected: consolidation changes which index a chunk lands in, but the namespace within that index is still the per-deal boundary. A query scoped to a deal’s namespace returns only that deal’s chunks regardless of how many abstract indexes were folded into one physical index.

Isolation answers “whose data is this?” Precedence answers “when two of a deal’s own sources disagree, which wins?” A deal routinely carries the same metric in several places — a cap rate in the Excel model, the same cap rate in the PDF offering memorandum, perhaps a third value a user typed in chat. Memosa resolves these with a fixed precedence ladder defined in src/utils/data_source.py.

The canonical ordering is User > Excel > CoStar > PDF > Unknown, with explicit numeric weights:

SourceWeightRationale
USER100 A human typed it deliberately — highest authority.
EXCEL90 The underwriting model is the deal’s computed truth.
COSTAR70 Third-party market data — authoritative for market facts.
PDF50 Narrative sponsor document — useful but lowest of the real sources.
UNKNOWN10 Source could not be inferred — a precedence floor, but visible to audits.

These weights live in one dict, DATA_SOURCE_PRECEDENCE_WEIGHTS, and everything that needs a numeric precedence score (the BM25 reranker’s source weighting, diversity enforcement, conflict resolution) derives from it via get_numeric_precedence_score(). There is no second copy of these numbers to drift.

Raw source-type strings are messy: "spreadsheet", "uw_spreadsheet", "xlsx", and a dozen suffixed Excel-vectorizer tags ("excel_rent_roll", "excel_waterfall", …) all mean Excel; "sponsor_om", "om", and "offering_memorandum" all mean PDF. SOURCE_TYPE_ALIASES is the single normalization table that collapses every variant to a canonical DataSource enum member, and classify_source_type() is the one function call sites use. Unrecognized inputs return DataSource.UNKNOWN rather than silently defaulting to PDF — so a misclassification is visible (it lands at weight 10 and trips provenance audits) instead of masquerading as a real source.

When a higher-precedence source is unavailable and Memosa falls back to a lower one — Excel parsing failed, so a metric comes from the PDF — that substitution is not silent. SourceFallbackTracker records each fallback (metric, expected source, actual source, reason) and can emit disclosure footnotes, so the finished memo can state that a value came from the sponsor PDF because the model didn’t yield it. Precedence resolution itself (the typed conflict kernel) lives in src/intel/deal_graph/conflict_resolver.py; data_source.py owns the ordering, the aliases, the numeric weights, and the fallback tracking that feed it.

How precedence and isolation work together

Section titled “How precedence and isolation work together”

The two boundaries are orthogonal and both necessary:

  • Isolation (the namespace) guarantees you only ever resolve precedence within a single deal’s own documents. Precedence never reaches across deals or orgs.
  • Precedence (the ladder) then decides which of that deal’s sources wins a disagreement.

A retrieval is scoped to exactly one namespace; the chunks that come back are all from one deal; and when those chunks carry conflicting values for the same metric, the precedence weights pick the winner and the fallback tracker discloses any downgrade. Cross-tenant safety is the namespace’s job; intra-deal truth is precedence’s job.

  • src/utils/namespace_manager.pyNamespaceManager: get_deal_identifier() (the {org_id}_{name}-{tts8} format + length truncation), generate_unique_identifier() (the 8-digit tts8 derivation, raises without thread_ts), normalize_deal_name_for_namespace() (the normalization pipeline, raises on empty), normalize_namespace() (None/empty → "default" edge handling), and get_or_use_existing() (idempotent restore-or-generate with the NAMESPACE-INVARIANT log).
  • src/utils/data_source.pyDataSource enum, PRECEDENCE_ORDER, DATA_SOURCE_PRECEDENCE_WEIGHTS (User 100 / Excel 90 / CoStar 70 / PDF 50 / Unknown 10), SOURCE_TYPE_ALIASES + classify_source_type(), get_numeric_precedence_score(), and SourceFallbackTracker.
  • src/processors/pdf/pdf_processor.pyprocess_pdf(self, pinecone_namespace: str, …): the required-namespace contract at the ingestion boundary (CoStar and Excel processors mirror it).
  • src/vector_db/ (pinecone_pool.py, multi_index_vector_store.py, quality_updater.py) — every Pinecone upsert/query/delete passing an explicit namespace=.
  • CLAUDE.md — the namespace format and the Data Source Precedence ladder, governance.