SAM Backend Technical Build
SAM's moat is not a chatbot. It is a proprietary Downtown Santa Monica knowledge system: recorded merchant interviews, structured local facts, source-grounded recommendations, and live city/beach/transit data behind one visitor-facing answer layer.
1. Backend system map
Capture Layer
Audio/video from Chad's 70 interviews, consent metadata, business profiles, owner preferences, raw files.
Processing Layer
Transcription, speaker cleanup, topic segmentation, fact extraction, quote extraction, human review queue.
Knowledge Layer
Canonical merchant/place/event records + searchable chunks + embeddings + source provenance.
Live Data Layer
Connectors for GTFS/Big Blue Bus, NOAA water/weather, city events, parking, EV charging, etc.
Answer Layer
Query router, retrieval, tool calls, answer composer, guardrails, confidence, citations.
Ops Layer
Admin dashboard, analytics, feedback, content freshness, data-source health checks, alerting.
2. Core data model
| Table | Purpose | Key fields |
|---|---|---|
businesses | Canonical merchant/place entity. | name, slug, category, address, lat/lng, hours, phone, website, accessibility, price_band, tags, status |
people | Owners, chefs, managers, interviewees. | name, role, business_id, bio, permissions, contact_private |
interviews | Raw capture record. | business_id, people[], recording_url, transcript_url, date, consent, interviewer, processing_status |
transcript_segments | Timestamped transcript chunks. | interview_id, start_sec, end_sec, speaker, text, topic_labels |
knowledge_chunks | Vector-searchable proprietary knowledge. | entity_type, entity_id, text, embedding, source_id, source_type, freshness, visibility, confidence |
claims | Structured facts extracted from interviews. | subject, predicate, object, qualifiers, source_segment_id, confidence, reviewed_by, expires_at |
recommendation_edges | Graph-like local relationships. | from_entity, to_entity, relation_type, reason, source_segment_id |
live_observations | Time-series cache from APIs. | source, observed_at, location, metric, value, units, raw_json |
data_sources | Connector registry and health. | name, url, refresh_interval, last_success, last_error, ttl_seconds |
conversations / messages | Usage analytics and QA. | session_id, query, answer, citations, tools_used, rating, created_at |
3. Interview ingestion pipeline
- Record: Chad records interview in phone/Zoom/Granola. Store raw file in Drive/S3/R2 with business + date naming.
- Transcribe: Whisper/Groq/OpenAI; preserve timestamps and speaker labels if possible.
- Normalize: clean filler, segment by topic: menu, vibe, best time, hidden gems, founder story, kid-friendly, dietary, locals tip, avoid/edge cases.
- Extract claims: turn conversation into structured facts and quotable source chunks.
- Human review: lightweight approval screen for merchant-sensitive claims before public use.
- Embed + index: chunk by semantic topic, not arbitrary token windows. Store embedding with source pointer.
- Publish: mark approved chunks/claims as available to SAM, with freshness date and source citation.
4. Live data connectors found in research
| Data need | Likely source | Implementation note |
|---|---|---|
| City datasets | Santa Monica Open Data CKAN API | Portal exposes /api/3 and CKAN package/datastore endpoints. Useful datasets include Calendar Events 90401, Active Business Licenses, Parking Rates, EV Charging Stations. |
| Events | Calendar Events 90401 | CSV/datastore resource has title, start/end date, description, location, address, age groups, event types, detail URL. |
| Transit | Big Blue Bus GTFS + GTFS-Realtime | Static current.zip; realtime alerts.bin, tripupdates.bin, vehiclepositions.bin. Use GTFS parser and protobuf decoder. |
| Weather | National Weather Service API | No-key point API gives forecast and hourly forecast for DTSM coordinates. |
| Water temperature | NOAA station 9410840 / ICAC1 Santa Monica Pier | NOAA Tides & Currents latest water_temperature endpoint returned station metadata + 65.8°F sample. NDBC realtime text feed also available. |
| Beach advisories / water quality | LA County Public Health + Heal the Bay | Likely scrape/API investigation needed. Treat as high-value but not MVP-blocking unless DTSM wants beach safety answers. |
| Parking availability | ParkMe / city parking app / DTSM direct access | Public pages show rates and app links; realtime occupancy appears likely partner/vendor data. Contract should require DTSM/city to provide official access or accept fallback to static parking guidance. |
5. Answer architecture
Visitor asks question ↓ Intent router: recommendation | factual | live condition | navigation | event | safety | unknown ↓ Retrieval plan: - Proprietary merchant/interview chunks - Structured claims / business profiles - Live tool calls if time-sensitive - City/event/transit/parking cache ↓ Answer composer: - cite source type: owner interview / city data / NOAA / BBB / DTSM - include confidence and freshness when useful - never invent live conditions - escalate / say unknown when source is missing ↓ Analytics + feedback log
6. MVP stack recommendation
- App/API: Next.js or FastAPI. If frontend is Next, API routes are fine for MVP; graduate to FastAPI if pipeline/workers grow.
- Database: Supabase Postgres with pgvector for one-system simplicity: entities, chunks, embeddings, analytics, admin.
- Object storage: Supabase Storage, S3, or Cloudflare R2 for recordings/transcripts.
- Workers: Inngest, Trigger.dev, or simple cron/queue for transcription, extraction, and live connector refreshes.
- Embeddings: OpenAI text-embedding-3-small/large or equivalent. Store embedding version per chunk.
- LLM: Use model routing: cheap model for extraction/classification; stronger model for final answer composition.
- Admin: Internal dashboard for interview status, claim review, source health, and top unanswered visitor questions.
7. Moat design
The moat comes from turning interviews into a private, continuously maintained local knowledge graph. LLMs know generic Santa Monica. SAM knows that a specific chef recommends a specific order after 8pm, that a manager says Tuesdays are quiet, that a shop carries a hidden local item, and that DTSM has live event/parking/transit context today.
Proprietary input
70 recorded merchant interviews, not public web data.
Structured memory
Facts, recommendations, constraints, and relationships extracted into durable tables.
Freshness loop
Merchant updates + city APIs + unanswered-question review keep SAM current.
Source trust
Answers can say where they came from: owner interview, city feed, NOAA, BBB, DTSM.
8. Build sequence
- Week 1: schema, storage, admin skeleton, interview upload form, transcription pipeline.
- Week 2: extraction prompts, chunking/embedding, business profile pages, human review queue.
- Week 3: chat retrieval API, answer composer with citations, first live connectors: events, weather, NOAA water temp, BBB GTFS.
- Week 4: analytics, unanswered questions, source-health dashboard, pilot QR flow with 10–15 businesses.
- Weeks 5–8: scale interviews to 70, add parking if official access is available, harden guardrails, reporting dashboard for DTSM.
9. Immediate contract/data asks for DTSM
- Official permission to use DTSM/city branding and names where needed.
- Access path for parking realtime occupancy or vendor relationship with ParkMe/city parking systems.
- Event calendar source of truth and refresh expectations.
- Merchant intro list, consent language, and approval rules for publishing interview-derived tips.
- Clarify whether beach safety/water quality is in MVP or phase 2.
Research sources checked: Santa Monica Open Data portal/CKAN API, Big Blue Bus developer/GTFS pages, NOAA/NDBC Santa Monica Pier station, NWS point API, DTSM parking info, LA County/Heal the Bay search results.