Sunday Night AISAM / DTSMBackend architecture

SAM Backend Technical Build

SAM's moat is not a chatbot. It is a proprietary Downtown Santa Monica knowledge system: recorded merchant interviews, structured local facts, source-grounded recommendations, and live city/beach/transit data behind one visitor-facing answer layer.

Recommended default: build this as a source-grounded RAG + tools system on Postgres/Supabase + pgvector first, with every answer backed by source citations and freshness metadata. Add Pinecone only if vector scale/query patterns justify it later.

1. Backend system map

Capture Layer

Audio/video from Chad's 70 interviews, consent metadata, business profiles, owner preferences, raw files.

Processing Layer

Transcription, speaker cleanup, topic segmentation, fact extraction, quote extraction, human review queue.

Knowledge Layer

Canonical merchant/place/event records + searchable chunks + embeddings + source provenance.

Live Data Layer

Connectors for GTFS/Big Blue Bus, NOAA water/weather, city events, parking, EV charging, etc.

Answer Layer

Query router, retrieval, tool calls, answer composer, guardrails, confidence, citations.

Ops Layer

Admin dashboard, analytics, feedback, content freshness, data-source health checks, alerting.

2. Core data model

TablePurposeKey fields
businessesCanonical merchant/place entity.name, slug, category, address, lat/lng, hours, phone, website, accessibility, price_band, tags, status
peopleOwners, chefs, managers, interviewees.name, role, business_id, bio, permissions, contact_private
interviewsRaw capture record.business_id, people[], recording_url, transcript_url, date, consent, interviewer, processing_status
transcript_segmentsTimestamped transcript chunks.interview_id, start_sec, end_sec, speaker, text, topic_labels
knowledge_chunksVector-searchable proprietary knowledge.entity_type, entity_id, text, embedding, source_id, source_type, freshness, visibility, confidence
claimsStructured facts extracted from interviews.subject, predicate, object, qualifiers, source_segment_id, confidence, reviewed_by, expires_at
recommendation_edgesGraph-like local relationships.from_entity, to_entity, relation_type, reason, source_segment_id
live_observationsTime-series cache from APIs.source, observed_at, location, metric, value, units, raw_json
data_sourcesConnector registry and health.name, url, refresh_interval, last_success, last_error, ttl_seconds
conversations / messagesUsage analytics and QA.session_id, query, answer, citations, tools_used, rating, created_at

3. Interview ingestion pipeline

  1. Record: Chad records interview in phone/Zoom/Granola. Store raw file in Drive/S3/R2 with business + date naming.
  2. Transcribe: Whisper/Groq/OpenAI; preserve timestamps and speaker labels if possible.
  3. Normalize: clean filler, segment by topic: menu, vibe, best time, hidden gems, founder story, kid-friendly, dietary, locals tip, avoid/edge cases.
  4. Extract claims: turn conversation into structured facts and quotable source chunks.
  5. Human review: lightweight approval screen for merchant-sensitive claims before public use.
  6. Embed + index: chunk by semantic topic, not arbitrary token windows. Store embedding with source pointer.
  7. Publish: mark approved chunks/claims as available to SAM, with freshness date and source citation.

4. Live data connectors found in research

Data needLikely sourceImplementation note
City datasetsSanta Monica Open Data CKAN APIPortal exposes /api/3 and CKAN package/datastore endpoints. Useful datasets include Calendar Events 90401, Active Business Licenses, Parking Rates, EV Charging Stations.
EventsCalendar Events 90401CSV/datastore resource has title, start/end date, description, location, address, age groups, event types, detail URL.
TransitBig Blue Bus GTFS + GTFS-RealtimeStatic current.zip; realtime alerts.bin, tripupdates.bin, vehiclepositions.bin. Use GTFS parser and protobuf decoder.
WeatherNational Weather Service APINo-key point API gives forecast and hourly forecast for DTSM coordinates.
Water temperatureNOAA station 9410840 / ICAC1 Santa Monica PierNOAA Tides & Currents latest water_temperature endpoint returned station metadata + 65.8°F sample. NDBC realtime text feed also available.
Beach advisories / water qualityLA County Public Health + Heal the BayLikely scrape/API investigation needed. Treat as high-value but not MVP-blocking unless DTSM wants beach safety answers.
Parking availabilityParkMe / city parking app / DTSM direct accessPublic pages show rates and app links; realtime occupancy appears likely partner/vendor data. Contract should require DTSM/city to provide official access or accept fallback to static parking guidance.

5. Answer architecture

Visitor asks question
  ↓
Intent router: recommendation | factual | live condition | navigation | event | safety | unknown
  ↓
Retrieval plan:
  - Proprietary merchant/interview chunks
  - Structured claims / business profiles
  - Live tool calls if time-sensitive
  - City/event/transit/parking cache
  ↓
Answer composer:
  - cite source type: owner interview / city data / NOAA / BBB / DTSM
  - include confidence and freshness when useful
  - never invent live conditions
  - escalate / say unknown when source is missing
  ↓
Analytics + feedback log

6. MVP stack recommendation

7. Moat design

The moat comes from turning interviews into a private, continuously maintained local knowledge graph. LLMs know generic Santa Monica. SAM knows that a specific chef recommends a specific order after 8pm, that a manager says Tuesdays are quiet, that a shop carries a hidden local item, and that DTSM has live event/parking/transit context today.

Proprietary input

70 recorded merchant interviews, not public web data.

Structured memory

Facts, recommendations, constraints, and relationships extracted into durable tables.

Freshness loop

Merchant updates + city APIs + unanswered-question review keep SAM current.

Source trust

Answers can say where they came from: owner interview, city feed, NOAA, BBB, DTSM.

8. Build sequence

  1. Week 1: schema, storage, admin skeleton, interview upload form, transcription pipeline.
  2. Week 2: extraction prompts, chunking/embedding, business profile pages, human review queue.
  3. Week 3: chat retrieval API, answer composer with citations, first live connectors: events, weather, NOAA water temp, BBB GTFS.
  4. Week 4: analytics, unanswered questions, source-health dashboard, pilot QR flow with 10–15 businesses.
  5. Weeks 5–8: scale interviews to 70, add parking if official access is available, harden guardrails, reporting dashboard for DTSM.

9. Immediate contract/data asks for DTSM

Research sources checked: Santa Monica Open Data portal/CKAN API, Big Blue Bus developer/GTFS pages, NOAA/NDBC Santa Monica Pier station, NWS point API, DTSM parking info, LA County/Heal the Bay search results.