From SELECT LIKE to RAG Search
Every search system answers the same question: given a query, which documents matter? The interesting part is how the answer has evolved – from a single SQL clause to a multi-stage pipeline involving inverted indexes, dense vectors, rerankers, and language models. Each layer earns its place by fixing a specific failure mode of the previous one.
SQL LIKE – the starting point
SELECT * FROM posts WHERE title LIKE '%cancel%';
Substring match. No ranking, no stemming, no synonyms. "How do I end my subscription?" returns nothing because the literal word "cancel" isn't there. The leading % kills the B-tree (Balanced Tree) index, so every query is a full scan. Fine for admin panels. Useless for product search.
OpenSearch and BM25
OpenSearch (and Elasticsearch, its closed-source twin) flips the model: build an inverted index mapping each term to the documents containing it.
After tokenizing and stemming three posts:
cancel → [doc1, doc2]
subscript → [doc1, doc3]
plan → [doc2]
A query "cancel subscription" becomes two hash lookups – sublinear, not a scan.
Ranking uses BM25 (Best Matching 25), a formula combining:
- TF (Term Frequency) – how often the term appears, with diminishing returns
- IDF (Inverse Document Frequency) – rare terms outweigh common ones
- Length normalization – short focused docs beat long meandering ones
from opensearchpy import OpenSearch
client = OpenSearch("https://localhost:9200")
client.search(index="posts", body={
"query": {"match": {"content": "cancel subscription"}}
})
Add synonym files, fuzzy matching, phrase queries – BM25 gets you a long way. But it still can't bridge "I want to stop paying you every month" to "cancel subscription." Zero lexical overlap, zero recall.
Hybrid – BM25 plus vectors
Embed each document into a dense vector with a model like voyage-3-large or text-embedding-3-large. Cosine similarity now captures meaning, not just words.
Run both retrievers in parallel and fuse with RRF (Reciprocal Rank Fusion):
RRF(d) = Σ 1 / (k + rank_i(d)) // k ≈ 60
BM25 catches ERR_4021. Vectors catch "stop paying you." A cross-encoder reranker (Cohere Rerank, BGE) reorders the top 50 by jointly scoring [query; doc] pairs. This is the modern retrieval baseline.
RAG – Retrieval-Augmented Generation
Hybrid search returns ten passages. RAG (Retrieval-Augmented Generation) takes those passages, hands them to an LLM (Large Language Model), and asks it to synthesize a grounded answer with citations. The summary card at the top of modern search – Google AI Overviews, Notion Q&A, Atlassian Rovo – is RAG.
The pipeline
query → rewrite → retrieve (hybrid) → rerank → assemble prompt → generate → verify → answer
Each stage earns its place. Skip rewriting and multi-hop questions fail. Skip reranking and the LLM drowns in noise. Skip verification and you ship hallucinations.
Indexing – building the corpus
Before retrieval, documents are chunked, embedded, and upserted. Run this offline, ideally as a Temporal workflow or Airflow DAG (Directed Acyclic Graph) triggered by CDC (Change Data Capture) from the source database.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI
import psycopg
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
oai = OpenAI()
def index_post(post, conn):
# Prepend title to each chunk so context survives the split
chunks = [f"{post['title']}\n\n{c}" for c in splitter.split_text(post["content"])]
vectors = oai.embeddings.create(
model="text-embedding-3-large",
input=chunks,
).data
with conn.cursor() as cur:
# Idempotent: delete old chunks first, then insert fresh
cur.execute("DELETE FROM post_chunks WHERE post_id = %s", (post["id"],))
for chunk, v in zip(chunks, vectors):
cur.execute(
"INSERT INTO post_chunks (post_id, title, chunk, embedding) "
"VALUES (%s, %s, %s, %s)",
(post["id"], post["title"], chunk, v.embedding),
)
In production: batch the embedding calls (up to ~2048 inputs per request), wrap in retry-with-backoff, version your chunker so re-indexing is intentional, store a content hash per chunk to skip no-op updates.
Retrieval – hybrid plus rerank
import cohere
co = cohere.Client()
def retrieve(query: str, k: int = 5) -> list[dict]:
q_vec = oai.embeddings.create(
model="text-embedding-3-large", input=[query]
).data[0].embedding
# Run BM25 and vector search in parallel (use asyncio in real code)
bm25_hits = opensearch.search(index="posts", body={
"query": {"match": {"chunk": query}}, "size": 50
})["hits"]["hits"]
vec_hits = pg.execute("""
SELECT id, chunk, 1 - (embedding <=> %s::vector) AS score
FROM post_chunks
ORDER BY embedding <=> %s::vector
LIMIT 50
""", (q_vec, q_vec)).fetchall()
# RRF fusion
fused = {}
for rank, h in enumerate(bm25_hits):
fused[h["_id"]] = fused.get(h["_id"], 0) + 1 / (60 + rank)
for rank, h in enumerate(vec_hits):
fused[h["id"]] = fused.get(h["id"], 0) + 1 / (60 + rank)
top = sorted(fused.items(), key=lambda x: -x[1])[:50]
docs = [load_chunk(doc_id) for doc_id, _ in top]
# Cross-encoder rerank for final precision
reranked = co.rerank(
model="rerank-english-v3.0",
query=query, documents=[d["chunk"] for d in docs], top_n=k,
)
return [docs[r.index] for r in reranked.results]
Generation – grounded synthesis
import anthropic
client = anthropic.Anthropic()
SYSTEM = """Answer using ONLY the provided sources.
Cite each claim inline as [chunk_id].
If the sources don't contain the answer, say "I don't know."
Never invent facts, URLs, or citations."""
def answer(query: str) -> dict:
chunks = retrieve(query, k=5)
context = "\n\n".join(f"[{c['id']}] {c['chunk']}" for c in chunks)
resp = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=SYSTEM,
messages=[{"role": "user", "content": f"{context}\n\nQuestion: {query}"}],
)
return {"text": resp.content[0].text, "sources": chunks}
Production concerns
Caching. Embed-the-query and rerank are hot paths. Cache by normalized query hash with a 5-minute TTL (Time To Live) and you cut tail latency hard.
Access control. Every retrieval must filter by the calling user's permissions. Push the ACL (Access Control List) into the metadata filter, not into a post-filter – post-filtering after top-k means you may return zero results to a user who has access to the 11th hit.
Observability. Wrap every stage in OTel (OpenTelemetry) spans. Arize Phoenix ingests traces, runs LLM-as-judge evals (faithfulness, retrieval relevance, answer relevance) nightly, and shows per-cluster regression dashboards. Without this you ship blind.
Eval harness. Curate 200 canonical queries with expected sources. Re-run on every prompt or retriever change. Compare faithfulness, citation accuracy, p95 latency, cost-per-query. CI (Continuous Integration) fails the build on regressions.
Bad-query filtering. A small classifier in front of retrieval rejects toxicity, prompt injection, and off-scope questions before they reach the LLM. A confidence threshold on the top reranker score decides whether to summarize or fall back to a "here are some links" surface. A groundedness check after generation – a second cheap LLM call that asks "is every claim in the answer supported by the sources?" – is the last line of defense against hallucination.
Why this works
BM25 alone misses meaning. Vectors alone miss exact tokens. Hybrid fixes both. Reranking sharpens precision. Citations make answers verifiable. Observability makes the system improvable. Each layer is boring on its own – stacked, they're the difference between LIKE '%cancel%' and an answer good enough to put on a homepage.