Production RAG for places
that can’t afford to bluff.
A peer-to-peer note for the AI Dev 26 audience. We ship grounded, auditable AI for government — where a hallucinated answer is a liability event, not a tweet. Here’s how we approach it, minus the parts we’d rather not hand to a competitor.
The constraint shapes the system
A government assistant can’t treat “don’t know” as a failure state. Refusing to answer when the official sources don’t support a claim is the most important thing the system does. Everything else — retrieval quality, synthesis style, latency budget, ops — serves that one behavior.
We build managed Retrieval-Augmented Generation on Google Vertex AI Search and Gemini, deployed per-agency on Cloud Run. The interesting part isn’t the model choice. The interesting parts are how we structured retrieval, how we layered safety, how we treat audit logs, and how we draw tenant boundaries. Those four choices are what make the product defensible to a city attorney — and what we’d enjoy comparing notes on at the conference.
Tiered retrieval beats single-shot RAG
Our first production architecture used a managed end-to-end answer endpoint — one opaque call that did retrieval and synthesis together. It was easy to wire up. It was the wrong abstraction.
What we hit
- A throughput ceiling we couldn’t engineer around. The bundled answer endpoint enforced a project-wide LLM-requests-per-minute quota that sat well below our target concurrency. A staged load test failed at a small fraction of expected traffic.
- No way to express precedence. A city’s charter overrides its municipal code, which overrides council reports, which overrides the website. A flat retriever can’t represent that. Treating documents as a single pool quietly let informational sources dilute authoritative ones.
- Cost we couldn’t shape. We were paying synthesis prices for retrieval we couldn’t inspect.
What we replaced it with
A tiered pipeline that fans out parallel retrieval-only calls, one per source tier, and synthesizes a single precedence-aware answer with a small, fast Gemini model.
question
├─ search(charter) ┐
├─ search(ordinances) ├─ parallel, retrieval-only
├─ search(council records)│
└─ search(website) ┘
│
per-tier dedupe + adaptive
relevance cutoff (score gap)
│
Flash-Lite synthesis with a
precedence-aware system prompt
│
one answer, tiered citationsThree things made this work in practice:
- Per-tier dedupe. Multiple chunks from the same parent document collapse into one entry whose context is the union of its siblings. The synthesis prompt sees fewer, richer sources.
- Adaptive cutoff, not top-K. A hard floor drops confidently irrelevant chunks, then a gap detector cuts at the natural cliff in the relevance distribution. Top-K is a guess; gap detection is data.
- Precedence in the prompt, not the retriever. The retriever returns tiered evidence; the synthesis model is told, in plain language, what overrides what. Citations are emitted with their tier, so the rendered answer can show authority visually.
Net effect: parallel retrieval cleared the throughput ceiling, dedupe and cutoff cut token spend, and the precedence prompt made the answers feel correct in a way the single-shot pipeline never did.
Defense in depth, not single-shot safety
Treating the model’s built-in safety as your safety layer is a category error. Every layer is probabilistic, every layer drifts, and the failure modes you care about (prompt injection, persona hijacking, self-harm signal) are easy to phrase in ways any single classifier will miss.
We compose four independent layers, in order, and a query blocked at any layer never reaches the next one:
- Regex input sanitizer. Sub-millisecond, deterministic, anchored to the small set of patterns where false positives are unacceptable. The trick is anchoring on first-person grammar so informational queries about hard topics aren’t blocked.
- Semantic guard. A pre-computed catalog of curated adversarial and crisis examples, embedded once at build time, baked into the image. At request time we embed the user query and compare cosine similarity. No serving infrastructure, no numpy, no surprises.
- Vertex’s adversarial classification. Catches what the static layers miss, with the inverse failure mode — non-deterministic, but trained on a much wider distribution.
- System-prompt hardening. The synthesis prompt is the last line of defense, not the first. It tells the model to ignore embedded instructions, refuse out-of-scope answers, and cite numerically with a fixed format. We trust this least and treat it as a backstop.
The shape we keep coming back to: deterministic where it’s cheap, probabilistic where it must be, and never alone. Refusals and safety routes feed the same audit pipeline as normal answers, so “what we declined to answer” is a first-class observable, not a silent log line.
The audit log is the product
For us this is the part that’s most under-discussed at AI conferences. In a consumer chat product, logs are operational telemetry. In a public-sector chat product, every interaction is a public record from the moment it’s created. That single fact changes the whole shape of the system.
- Records-shaped, not log-shaped. Each interaction is structured for archive at write time: timestamp, agency, request and response language, the resident’s question and its translation, the assistant’s answer and its translation, the citations used, and any refusal or safety-route decision. No after-the-fact ETL.
- Ownership lives with the customer. Records land in a dedicated dataset in the agency’s own GCP project — not a shared multi-tenant store. The customer can SQL it directly.
- Delivery, not retention. Each month a structured report ships to the agency’s records officer, formatted for direct archive against their existing retention schedule. We hold a fixed two-month rolling window so a delivery can be re-issued, then permanently purge. There is no long-term archive on our side waiting to leak.
- Public transparency dashboard. The same audit log powers a public dashboard at
{agency}.agencychat.ai/dashboardshowing real volume, categories, citation rates, and refusal rates. Live proof beats a slide deck.
The refactor that unlocks this: stop thinking of logging as observability and start thinking of it as the primary write path of the system. The chat answer is one consumer of that record. The dashboard is another. The monthly delivery is a third. None of them are afterthoughts.
Tenancy is GCP projects, not rows
We don’t do row-level multi-tenancy. Each agency gets its own GCP project — isolated IAM, isolated billing, isolated BigQuery datasets, isolated GCS buckets. The backend, worker, frontend, and search infrastructure for one agency never share runtime resources with another’s.
This sounds heavy and isn’t. Three reasons we’d make this trade again:
- The threat model is the marketing. Procurement keeps asking “is our data ever in the same database as another customer’s?” The honest answer is “no, ever” and that’s easier than “yes, but our row scoping is correct.”
- Vendor exit is built in. The customer’s documents, indexed corpus, and audit log all live in their project. There is no proprietary data layer we can hold hostage. That changes the negotiation.
- Operational blast radius shrinks. A bad config in one agency’s project cannot affect another. Quotas, IAM, and budget alarms scope cleanly.
We pay for it with a unified-Makefile-and-overlays repo layout, a strict “vary by config, not by forking code” discipline, and a single backend image parameterized per deployment. That part is uninteresting and that’s the point.
Capacity control on stateless Cloud Run
Stateless serving is great for scaling and bad for fairness. A Firestore-backed token-bucket limiter throttles peak traffic into a configurable concurrency window; high-burst events are absorbed by an asynchronous Cloud Tasks queue that keeps the synchronous path responsive. The interesting design choice is that limit decisions are centralized in Firestore, so any Cloud Run instance making a decision sees the same state — no per-instance approximation, no “mostly correct” rate limiting.
Things we believe, so far
- Refusal is a feature. Build the refusal path before you tune answers.
- Don’t hide retrieval inside synthesis. Bundled answer endpoints are convenient until you need to inspect, cap, or shape what they retrieve.
- Use a small fast model on great retrieval over a big model on mediocre retrieval. Synthesis is downstream of retrieval quality. Spend the budget upstream.
- Evaluate at the boundary, not in the middle. The contract that matters is “is the answer supported by the cited passages?” Everything else is a proxy.
- Audit logs are a write path, not a side effect. If your records story is “we’ll add a logger,” you don’t have a records story.
- Tenancy lives in your cloud account boundary, not your code. Project isolation is cheaper than the trust you have to rebuild after a data-leak headline.
- Latency budget is shaped by parallelism, not model speed. Four parallel retrieval calls plus a fast synthesis beat one sequential mega-call almost every time.
For the architecture, safety, and records-handling detail your CIO or city attorney would ask for, see the security & architecture overview. For a live system you can throw real questions at, see the Oakland deployment in the CTA below.
Find us at AI Dev 26
We’d love to compare notes on production RAG, defense-in-depth safety, audit logging as a product, and per-tenant cloud isolation. Try the live City of Oakland deployment, then get in touch — or grab us at the conference.