Sunday Night AISAM / DTSMBackend architecture

SAM Backend Technical Build

SAM's moat is not a chatbot. It is a proprietary Downtown Santa Monica knowledge system: recorded merchant interviews, structured local facts, source-grounded recommendations, and live city/beach/transit data behind one visitor-facing answer layer.

Recommended default: build this as a source-grounded RAG + tools system on Postgres/Supabase + pgvector first, with every answer backed by source citations and freshness metadata. Add Pinecone only if vector scale/query patterns justify it later.

1. Backend system map

Capture Layer

Audio/video from Chad's 70 interviews, consent metadata, business profiles, owner preferences, raw files.

Processing Layer

Transcription, speaker cleanup, topic segmentation, fact extraction, quote extraction, human review queue.

Knowledge Layer

Canonical merchant/place/event records + searchable chunks + embeddings + source provenance.

Live Data Layer

Connectors for GTFS/Big Blue Bus, NOAA water/weather, city events, parking, EV charging, etc.

Answer Layer

Query router, retrieval, tool calls, answer composer, guardrails, confidence, citations.

Ops Layer

Admin dashboard, analytics, feedback, content freshness, data-source health checks, alerting.

2. Core data model

Table	Purpose	Key fields
`businesses`	Canonical merchant/place entity.	name, slug, category, address, lat/lng, hours, phone, website, accessibility, price_band, tags, status
`people`	Owners, chefs, managers, interviewees.	name, role, business_id, bio, permissions, contact_private
`interviews`	Raw capture record.	business_id, people[], recording_url, transcript_url, date, consent, interviewer, processing_status
`transcript_segments`	Timestamped transcript chunks.	interview_id, start_sec, end_sec, speaker, text, topic_labels
`knowledge_chunks`	Vector-searchable proprietary knowledge.	entity_type, entity_id, text, embedding, source_id, source_type, freshness, visibility, confidence
`claims`	Structured facts extracted from interviews.	subject, predicate, object, qualifiers, source_segment_id, confidence, reviewed_by, expires_at
`recommendation_edges`	Graph-like local relationships.	from_entity, to_entity, relation_type, reason, source_segment_id
`live_observations`	Time-series cache from APIs.	source, observed_at, location, metric, value, units, raw_json
`data_sources`	Connector registry and health.	name, url, refresh_interval, last_success, last_error, ttl_seconds
`conversations` / `messages`	Usage analytics and QA.	session_id, query, answer, citations, tools_used, rating, created_at

3. Interview ingestion pipeline

Record: Chad records interview in phone/Zoom/Granola. Store raw file in Drive/S3/R2 with business + date naming.
Transcribe: Whisper/Groq/OpenAI; preserve timestamps and speaker labels if possible.
Normalize: clean filler, segment by topic: menu, vibe, best time, hidden gems, founder story, kid-friendly, dietary, locals tip, avoid/edge cases.
Extract claims: turn conversation into structured facts and quotable source chunks.
Human review: lightweight approval screen for merchant-sensitive claims before public use.
Embed + index: chunk by semantic topic, not arbitrary token windows. Store embedding with source pointer.
Publish: mark approved chunks/claims as available to SAM, with freshness date and source citation.

4. Live data connectors found in research

Data need	Likely source	Implementation note
City datasets	Santa Monica Open Data CKAN API	Portal exposes `/api/3` and CKAN package/datastore endpoints. Useful datasets include Calendar Events 90401, Active Business Licenses, Parking Rates, EV Charging Stations.
Events	Calendar Events 90401	CSV/datastore resource has title, start/end date, description, location, address, age groups, event types, detail URL.
Transit	Big Blue Bus GTFS + GTFS-Realtime	Static `current.zip`; realtime `alerts.bin`, `tripupdates.bin`, `vehiclepositions.bin`. Use GTFS parser and protobuf decoder.
Weather	National Weather Service API	No-key point API gives forecast and hourly forecast for DTSM coordinates.
Water temperature	NOAA station 9410840 / ICAC1 Santa Monica Pier	NOAA Tides & Currents latest water_temperature endpoint returned station metadata + 65.8°F sample. NDBC realtime text feed also available.
Beach advisories / water quality	LA County Public Health + Heal the Bay	Likely scrape/API investigation needed. Treat as high-value but not MVP-blocking unless DTSM wants beach safety answers.
Parking availability	ParkMe / city parking app / DTSM direct access	Public pages show rates and app links; realtime occupancy appears likely partner/vendor data. Contract should require DTSM/city to provide official access or accept fallback to static parking guidance.

5. Answer architecture

Visitor asks question
  ↓
Intent router: recommendation | factual | live condition | navigation | event | safety | unknown
  ↓
Retrieval plan:
  - Proprietary merchant/interview chunks
  - Structured claims / business profiles
  - Live tool calls if time-sensitive
  - City/event/transit/parking cache
  ↓
Answer composer:
  - cite source type: owner interview / city data / NOAA / BBB / DTSM
  - include confidence and freshness when useful
  - never invent live conditions
  - escalate / say unknown when source is missing
  ↓
Analytics + feedback log

6. MVP stack recommendation

App/API: Next.js or FastAPI. If frontend is Next, API routes are fine for MVP; graduate to FastAPI if pipeline/workers grow.
Database: Supabase Postgres with pgvector for one-system simplicity: entities, chunks, embeddings, analytics, admin.
Object storage: Supabase Storage, S3, or Cloudflare R2 for recordings/transcripts.
Workers: Inngest, Trigger.dev, or simple cron/queue for transcription, extraction, and live connector refreshes.
Embeddings: OpenAI text-embedding-3-small/large or equivalent. Store embedding version per chunk.
LLM: Use model routing: cheap model for extraction/classification; stronger model for final answer composition.
Admin: Internal dashboard for interview status, claim review, source health, and top unanswered visitor questions.

7. Moat design

The moat comes from turning interviews into a private, continuously maintained local knowledge graph. LLMs know generic Santa Monica. SAM knows that a specific chef recommends a specific order after 8pm, that a manager says Tuesdays are quiet, that a shop carries a hidden local item, and that DTSM has live event/parking/transit context today.

Proprietary input

70 recorded merchant interviews, not public web data.

Structured memory

Facts, recommendations, constraints, and relationships extracted into durable tables.

Freshness loop

Merchant updates + city APIs + unanswered-question review keep SAM current.

Source trust

Answers can say where they came from: owner interview, city feed, NOAA, BBB, DTSM.

8. Build sequence

Week 1: schema, storage, admin skeleton, interview upload form, transcription pipeline.
Week 2: extraction prompts, chunking/embedding, business profile pages, human review queue.
Week 3: chat retrieval API, answer composer with citations, first live connectors: events, weather, NOAA water temp, BBB GTFS.
Week 4: analytics, unanswered questions, source-health dashboard, pilot QR flow with 10–15 businesses.
Weeks 5–8: scale interviews to 70, add parking if official access is available, harden guardrails, reporting dashboard for DTSM.

9. Immediate contract/data asks for DTSM

Official permission to use DTSM/city branding and names where needed.
Access path for parking realtime occupancy or vendor relationship with ParkMe/city parking systems.
Event calendar source of truth and refresh expectations.
Merchant intro list, consent language, and approval rules for publishing interview-derived tips.
Clarify whether beach safety/water quality is in MVP or phase 2.

Research sources checked: Santa Monica Open Data portal/CKAN API, Big Blue Bus developer/GTFS pages, NOAA/NDBC Santa Monica Pier station, NWS point API, DTSM parking info, LA County/Heal the Bay search results.