Skip to content

Observability and Infrastructure

This page covers two things engineers reach for constantly: the analytical observability warehouse that every cross-deal read flows through, and the deployment substrate — the Railway services, the per-service Postgres pool budgets, and the Pinecone connection pool — that the whole system runs on.

Memosa has its own self-hosted lakehouse for telemetry: a DuckLake warehouse (DuckDB 1.5.2 engine). The catalog lives on the existing Postgres; the data files live in a native Railway Bucket (S3-compatible — not AWS). For local dev and tests, a local file catalog stands in. This replaced an earlier Parquet-glob design, and the replacement is total: the old ObservabilityReader, the read_parquet glob, and union_by_name are deleted.

Two halves: producers write, consumers read

Section titled “Two halves: producers write, consumers read”

The write path and the read path are deliberately separate.

Producers call get_observability_writer().append(table, row) from hot paths. append is synchronous, cheap, and NEVER raises — a production hot path can emit telemetry without any risk of breaking on a warehouse problem. The writer buffers rows (default flush at 100 rows or every 30 seconds ) and a single daemon flush thread serializes them into the catalog via insert_arrow. One flush thread is not an accident — serializing writes sidesteps a known multithreaded-Python DuckLake+Postgres deadlock.

Consumers read through ObservabilityLake — a process-wide singleton (get_observability_lake()), one lock-guarded, attach-once connection per process, with heal-on-error. There is also a read_only attach mode (used by the local telemetry MCP analyst surface) that attaches the catalog READ_ONLY and skips schema creation.

The ObservabilityLake singleton — why attach-once

Section titled “The ObservabilityLake singleton — why attach-once”

The load-bearing design constraint is connection budget. DuckDB’s Postgres extension opens its own connection pool per attached database. The deleted glob reader created a fresh in-memory connection on every query() call (15+ call sites) — which would exhaust the per-service Postgres budgets the moment cross-deal reads got busy. The fix is a single shared attached connection per process, with the catalog’s Postgres pool capped at pg_pool_max_connections = 2 so each service process holds ~2 catalog connections, not dozens.

The lake heals on duckdb.Error: any error disposes the connection and the next call re-attaches (and re-runs CREATE TABLE IF NOT EXISTS for every registered table, so reads are safe before the first write). Tables are month(ts)-partitioned on first appearance, guarded so a partition spec doesn’t churn a fresh snapshot on every attach.

The registry — 15 tables, the source of truth

Section titled “The registry — 15 tables, the source of truth”

TABLE_SCHEMAS in src/observability/schemas.py is the registry and the source of truth. It currently holds 15 tables — among them deal_finalizations, retrieval_events, agent_discourse_events, user_activity_events, llm_cost_events, readiness_snapshots, pinecone_upsert_events, formula_graph_events, sql_invariant_findings, conflict_resolution_events, excel_zero_chunks_attribution, semantic_critique_events, and checkpoint_size_events. The count drifts upward as tables are added; _attach iterates the registry dynamically, so a new table is created and partitioned on the next attach without a migration.

Product-facing services that need cross-deal / analytical reads use the warehouse, not a fresh round of Postgres aggregation. AuditTrailService is the architectural precedent; about eleven services follow the same shape. The pattern is fixed:

  1. Stateless class, constructor (reader: Optional[ObservabilityLake] = None) defaulting to get_observability_lake() — the process-wide singleton, never a per-request connect.
  2. Parameter-bound SQL via reader.query(sql, params). Table and column names are hardcoded; every external value is bound through a ? placeholder.
  3. A _safe_query wrapper catches (duckdb.Error, Exception) and returns [], logging at DEBUG. Observability reads must NEVER break a route.
  4. ISO-stringify timestamps before returning — DuckDB returns datetime objects; FastAPI needs JSON-serializable output.

Most consumers are route-backed Canvas dashboards (PortfolioInsightsService, UserActivityInsightsService, AuditTrailService, …), but the pattern also runs on hot paths. NamespaceCountService reads SUM(chunks_uploaded) from pinecone_upsert_events as a ground-truth count during source-type recovery, and RetrievalHealthService aggregates retrieval_events per section on the worker’s live memo-generation path. Both are synchronous, shared-singleton, _safe_query-guarded reads — the live-path precedent for treating the warehouse as an active input, not just a dashboard backend.

PortfolioInsightsService surfaces only aggregate statistics — medians, counts, ranges — never an individual deal’s name, ID, or raw metrics. The privacy floor is N ≥ 3 deals before any statistic surfaces, and that guard lives in the Python service layer, not in a SQL HAVING clause. Every aggregation query returns a single row, and the service maps a below-threshold count to None (if deal_count < _MIN_DEALS_FOR_AGGREGATE: return None, repeated across each aggregation method). _MIN_DEALS_FOR_AGGREGATE is 3 .

This is intentional on two counts: it keeps the SQL portable (no version-sensitive HAVING semantics), and it makes the privacy guard explicit and auditable in code rather than buried in a query. The universal base filter underneath every aggregation is finalization_status = 'success' — degraded finalizations are recorded but excluded, because their numeric fields (cap rate, LTV, …) are unreliable.

Schema changes are native ALTER TABLE obs.<table> ADD COLUMN (run by ObservabilityLake). This is safe because old snapshots surface NULL for the new column, and DuckDB’s MEDIAN and AVG ignore NULL naturally — so a historical row still contributes to COUNT(*) but not to a median, which is the right semantic for aggregate stats. A contract test (tests/unit/observability/test_schema_additivity.py) locks this by doing an ADD COLUMN round-trip and asserting the prior row surfaces NULL.

What is forbidden: removing a column, retyping a column, or reusing a name for a different meaning. If you must change semantics, bump a new table name instead.

Memosa runs on Railway across two environments — production (main branch, custom memosa.io domains) and staging (staging branch, Railway-generated URLs). Both run the same code; branch→environment mapping is via deployment triggers.

Production runs four application services; staging runs a fifth, Evergreen (the nightly autonomous audit worker — production is intentionally asymmetric at four). Plus two Railway-managed infrastructure services.

Dashboard namerailway.json keyRole
APIwebBackend HTTP API — health endpoint, image serving, REST endpoints for external callers
LangGraphworkerSlack bot + LangGraph deal-analysis workflows (deep agents, RAG) — the heavy AI tier
Web AppcanvasCanvas document editor — FastAPI + ProseMirror, JWT/RBAC, brushes; runs WEB_CONCURRENCY=4
Marketing SitelandingPublic landing page + waitlist — lightweight uvicorn, Resend email
Evergreen (staging only)evergreenNightly P-series audit worker (python -m src.evergreen_worker.main); APScheduler fires midnight Pacific
Redis-Checkpointer(infra)LangGraph AsyncRedisSaver; Canvas pub/sub bridge, edit locks, presence, caches
Postgres(infra)Structured data via PGStoreClient; also hosts the DuckLake observability catalog

There is also a daily DuckLake Maintenance cron service that runs scripts/ducklake_maintenance.py — flush-inlined / merge / expire / cleanup — to materialize inlined rows to Parquet and bound catalog growth.

Railway terminates idle Postgres connections at roughly 60 seconds, and its usable connection ceiling is 97 across all services. Earlier, every service claimed a full ~94-connection budget independently, giving a worst case of 4 × 97 = 388 connections against a 97 ceiling. It only ever worked because min_size=0 (on-demand) plus admission control held the real peak far below capacity.

The current design gives each service a budgeted slice, resolved at startup from RAILWAY_SERVICE_NAME (Railway-provided service identity — immutable per service, so it is identity, not a config knob, and does not violate the no-env-var rule). The budgets are sized so asyncpg pools + DuckLake catalog connections stay under the ceiling:

Serviceasyncpg max_sizeDuckLake catalog connsTotal
worker (LangGraph)38 2 (single process)40
canvas (Web App)20 8 (4 workers × 2)28
web (API)15015
landing (Marketing Site)505
evergreen (staging only)505
DuckLake Maintenance cron~02 (transient daily)2
Grand total95 (≤ 97)

The canvas row is the subtle one. The DuckLake reader is a per-process singleton, and the Web App runs WEB_CONCURRENCY=4 . So Canvas holds 4 workers × 2 catalog connections = 8, not 2 — a single-worker assumption here undershot by 6 and pushed the real worst case to 101. The asyncpg budget was lowered to 20 (20 + 8 = 28 ≤ 30) to absorb it. worker is single-process (slack_bot.py), so its writer and NamespaceCountService share one lake — 2 catalog connections.

A second mechanism keeps per-process pools bounded under scale-up: _calculate_pool_max_size() divides the service budget by max(RAILWAY_REPLICA_COUNT, 1) × max(WEB_CONCURRENCY, 1), with a floor of 4 per process.

Retrieval runs against Pinecone serverless through a connection pool sized in PineconeConfig (src/config/pinecone_config.py). The values are production-validated and the frozen dataclass refuses construction below their floors:

  • pool_size = 96 . Raised through 50 → 72 → 96, each bump a response to pool exhaustion during cascading low-chunk retrievals at 80–93% utilization. Do not lower below 96 without production log evidence; __post_init__ enforces 1 ≤ pool_size ≤ 100.
  • retrieval_max_parallel = 16 and ingest_max_parallel = 16 . P36 split the formerly-shared max_parallel_index_queries=24 into separate retrieval and ingest semaphores so heavy upserts stop competing with analysis-phase reads on one semaphore. Their combined ceiling of 32 stays well under the admission threshold against a 96-slot pool; __post_init__ enforces both ≥ 16 and retrieval + ingest ≤ pool_size.
  • connection_timeout_secs = 10.0 (must stay below the 30s query-timeout ceiling) and upsert_timeout_secs = 20.0 (per attempt, 3 retries with exponential backoff).

The primary gate is not the pool size — it is the PoolAwareAdmissionController in src/vector_db/retrieval_executor.py, which checks real-time pool utilization before admitting a query and does graduated load-shedding. The hard-reject threshold is 88% utilization (admission_hard_reject_pct = 0.88 , source-diversity exempt) — lowered from 95% after a production cascade where the pool saturated 95→100% in a single surge; at 88% (84/96 active) the remaining 12 slots cover the source-diversity exemption and the recovery-priority bypass. Above 92% utilization, shielded recovery on cascade-cancelled sections is skipped, since spawning a recovery task during pool exhaustion only deepens starvation. The ordering of these thresholds is invariant-checked at construction.

How the pool feeds the retrieval stages and reranking is covered on the Retrieval Pipeline and Reranking pages.

  • src/observability/ducklake.pyObservabilityLake (attach-once singleton, pg_pool_max_connections=2, heal-on-error, read_only mode, month(ts) partitioning, run_maintenance)
  • src/observability/writer.pyObservabilityWriter.append (sync, never-raises) + the single daemon flush thread; get_observability_writer()
  • src/observability/schemas.pyTABLE_SCHEMAS registry (15 tables; native ADD COLUMN discipline in the module docstring)
  • src/observability/finalize_hook.pyrecord_deal_finalization() canonical write path (success + degraded)
  • src/observability/namespace_count_service.py, src/observability/retrieval_health_service.py — live-path singleton consumers (_safe_query, _MIN_SAMPLE)
  • src/canvas/services/portfolio_insights_service.py + portfolio_insights_queries.py — N≥3 anonymity guard in Python (_MIN_DEALS_FOR_AGGREGATE), finalization_status='success' filter
  • src/utils/production_capacity_config.py — per-service Postgres budgets (POSTGRES_PER_SERVICE_MAX_SIZE, POSTGRES_DUCKLAKE_RESERVED, POSTGRES_RAILWAY_USABLE_CEILING, _RAILWAY_SERVICE_NAME_TO_KEY)
  • src/config/pinecone_config.pyPineconeConfig (pool size, retrieval/ingest semaphores, admission thresholds, __post_init__ invariants)
  • railway.json — service definitions (Web App, Marketing Site, Evergreen, DuckLake Maintenance, …)
  • .claude/rules/10-domains.md, .claude/rules/20-patterns.md — Railway service table, the observability-warehouse consumer pattern, the per-service pool budgets
  • memory/observability_ducklake_migration.md, memory/observability_warehouse_consumers.md, memory/postgres_pool_event_loop_invariants.md — DuckLake migration, the consumer census, and the pool event-loop invariants