From SELECT LIKE to RAG Search

2026-05-18

Every search system answers the same question: given a query, which documents matter? The interesting part is how the answer has evolved – from a single SQL clause to a multi-stage pipeline involving inverted indexes, dense vectors, rerankers, and language models. Each layer earns its place by fixing a specific failure mode of the previous one.

SQL LIKE – the starting point

SELECT * FROM posts WHERE title LIKE '%cancel%';

Substring match. No ranking, no stemming, no synonyms. "How do I end my subscription?" returns nothing because the literal word "cancel" isn't there. The leading % kills the B-tree (Balanced Tree) index, so every query is a full scan. Fine for admin panels. Useless for product search.

OpenSearch and BM25

OpenSearch (and Elasticsearch, its closed-source twin) flips the model: build an inverted index mapping each term to the documents containing it.

After tokenizing and stemming three posts:

cancel    → [doc1, doc2]
subscript → [doc1, doc3]
plan      → [doc2]

A query "cancel subscription" becomes two hash lookups – sublinear, not a scan.

Ranking uses BM25 (Best Matching 25), a formula combining:

TF (Term Frequency) – how often the term appears, with diminishing returns
IDF (Inverse Document Frequency) – rare terms outweigh common ones
Length normalization – short focused docs beat long meandering ones

from opensearchpy import OpenSearch
client = OpenSearch("https://localhost:9200")
client.search(index="posts", body={
    "query": {"match": {"content": "cancel subscription"}}
})

Add synonym files, fuzzy matching, phrase queries – BM25 gets you a long way. But it still can't bridge "I want to stop paying you every month" to "cancel subscription." Zero lexical overlap, zero recall.

Hybrid – BM25 plus vectors

Embed each document into a dense vector with a model like voyage-3-large or text-embedding-3-large. Cosine similarity now captures meaning, not just words.

Run both retrievers in parallel and fuse with RRF (Reciprocal Rank Fusion):

RRF(d) = Σ 1 / (k + rank_i(d))   // k ≈ 60

BM25 catches ERR_4021. Vectors catch "stop paying you." A cross-encoder reranker (Cohere Rerank, BGE) reorders the top 50 by jointly scoring [query; doc] pairs. This is the modern retrieval baseline.

RAG – Retrieval-Augmented Generation

Hybrid search returns ten passages. RAG (Retrieval-Augmented Generation) takes those passages, hands them to an LLM (Large Language Model), and asks it to synthesize a grounded answer with citations. The summary card at the top of modern search – Google AI Overviews, Notion Q&A, Atlassian Rovo – is RAG.

The pipeline

query → rewrite → retrieve (hybrid) → rerank → assemble prompt → generate → verify → answer

Each stage earns its place. Skip rewriting and multi-hop questions fail. Skip reranking and the LLM drowns in noise. Skip verification and you ship hallucinations.

Indexing – building the corpus

Before retrieval, documents are chunked, embedded, and upserted. Run this offline, ideally as a Temporal workflow or Airflow DAG (Directed Acyclic Graph) triggered by CDC (Change Data Capture) from the source database.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI
import psycopg

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=120)
oai = OpenAI()

def index_post(post, conn):
    # Prepend title to each chunk so context survives the split
    chunks = [f"{post['title']}\n\n{c}" for c in splitter.split_text(post["content"])]

    vectors = oai.embeddings.create(
        model="text-embedding-3-large",
        input=chunks,
    ).data

    with conn.cursor() as cur:
        # Idempotent: delete old chunks first, then insert fresh
        cur.execute("DELETE FROM post_chunks WHERE post_id = %s", (post["id"],))
        for chunk, v in zip(chunks, vectors):
            cur.execute(
                "INSERT INTO post_chunks (post_id, title, chunk, embedding) "
                "VALUES (%s, %s, %s, %s)",
                (post["id"], post["title"], chunk, v.embedding),
            )

In production: batch the embedding calls (up to ~2048 inputs per request), wrap in retry-with-backoff, version your chunker so re-indexing is intentional, store a content hash per chunk to skip no-op updates.

Retrieval – hybrid plus rerank

import cohere
co = cohere.Client()

def retrieve(query: str, k: int = 5) -> list[dict]:
    q_vec = oai.embeddings.create(
        model="text-embedding-3-large", input=[query]
    ).data[0].embedding

    # Run BM25 and vector search in parallel (use asyncio in real code)
    bm25_hits = opensearch.search(index="posts", body={
        "query": {"match": {"chunk": query}}, "size": 50
    })["hits"]["hits"]

    vec_hits = pg.execute("""
        SELECT id, chunk, 1 - (embedding <=> %s::vector) AS score
        FROM post_chunks
        ORDER BY embedding <=> %s::vector
        LIMIT 50
    """, (q_vec, q_vec)).fetchall()

    # RRF fusion
    fused = {}
    for rank, h in enumerate(bm25_hits):
        fused[h["_id"]] = fused.get(h["_id"], 0) + 1 / (60 + rank)
    for rank, h in enumerate(vec_hits):
        fused[h["id"]] = fused.get(h["id"], 0) + 1 / (60 + rank)

    top = sorted(fused.items(), key=lambda x: -x[1])[:50]
    docs = [load_chunk(doc_id) for doc_id, _ in top]

    # Cross-encoder rerank for final precision
    reranked = co.rerank(
        model="rerank-english-v3.0",
        query=query, documents=[d["chunk"] for d in docs], top_n=k,
    )
    return [docs[r.index] for r in reranked.results]

Generation – grounded synthesis

import anthropic
client = anthropic.Anthropic()

SYSTEM = """Answer using ONLY the provided sources.
Cite each claim inline as [chunk_id].
If the sources don't contain the answer, say "I don't know."
Never invent facts, URLs, or citations."""

def answer(query: str) -> dict:
    chunks = retrieve(query, k=5)
    context = "\n\n".join(f"[{c['id']}] {c['chunk']}" for c in chunks)

    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM,
        messages=[{"role": "user", "content": f"{context}\n\nQuestion: {query}"}],
    )
    return {"text": resp.content[0].text, "sources": chunks}

Production concerns

Caching. Embed-the-query and rerank are hot paths. Cache by normalized query hash with a 5-minute TTL (Time To Live) and you cut tail latency hard.

Access control. Every retrieval must filter by the calling user's permissions. Push the ACL (Access Control List) into the metadata filter, not into a post-filter – post-filtering after top-k means you may return zero results to a user who has access to the 11th hit.

Observability. Wrap every stage in OTel (OpenTelemetry) spans. Arize Phoenix ingests traces, runs LLM-as-judge evals (faithfulness, retrieval relevance, answer relevance) nightly, and shows per-cluster regression dashboards. Without this you ship blind.

Eval harness. Curate 200 canonical queries with expected sources. Re-run on every prompt or retriever change. Compare faithfulness, citation accuracy, p95 latency, cost-per-query. CI (Continuous Integration) fails the build on regressions.

Bad-query filtering. A small classifier in front of retrieval rejects toxicity, prompt injection, and off-scope questions before they reach the LLM. A confidence threshold on the top reranker score decides whether to summarize or fall back to a "here are some links" surface. A groundedness check after generation – a second cheap LLM call that asks "is every claim in the answer supported by the sources?" – is the last line of defense against hallucination.

Why this works

BM25 alone misses meaning. Vectors alone miss exact tokens. Hybrid fixes both. Reranking sharpens precision. Citations make answers verifiable. Observability makes the system improvable. Each layer is boring on its own – stacked, they're the difference between LIKE '%cancel%' and an answer good enough to put on a homepage.