Modern PostgreSQL Superstack Series
    Part 4 of 7

    ParadeDB for Full-Text Search and BM25

    Tantivy-backed search inside Postgres — the realistic Elasticsearch replacement for application search workloads.

    50 minutes
    4 GB+ recommended
    Prerequisites

    Part 1 + Part 2

    Time to Complete

    45–60 minutes

    The Case Against tsvector

    Postgres' built-in full-text search is fine for "does this column contain this word." It is not fine for application search:

    • • No BM25 ranking — ts_rank is a coverage metric, not a relevance metric
    • • Weak phrase matching, no slop control
    • • No fuzzy search out of the box (you can bolt on pg_trgm but the integration is awkward)
    • • No native faceting, no highlighting

    What pg_search Is

    ParadeDB's pg_search is a Postgres extension that embeds Tantivy (the Rust port of Lucene) as an index access method. You get the index quality of Lucene/Elasticsearch directly inside the database, with no external service to deploy and no replication pipeline to keep in sync.

    Install pg_search

    # Add the ParadeDB apt repo:
    curl -fsSL https://repo.paradedb.com/apt/paradedb.gpg \
      | sudo gpg --dearmor -o /usr/share/keyrings/paradedb.gpg
    
    echo "deb [signed-by=/usr/share/keyrings/paradedb.gpg] \
      https://repo.paradedb.com/apt $(lsb_release -cs) main" | \
      sudo tee /etc/apt/sources.list.d/paradedb.list
    
    sudo apt update
    sudo apt install -y postgresql-17-pg-search
    
    # Add to shared_preload_libraries (requires restart):
    echo "shared_preload_libraries = 'pg_search'" | \
      sudo tee -a /etc/postgresql/17/main/conf.d/20-paradedb.conf
    
    sudo systemctl restart postgresql@17-main
    sudo -u postgres psql -d appdb -c "CREATE EXTENSION pg_search;"

    Create a BM25 Index

    CREATE TABLE products (
      id          bigserial PRIMARY KEY,
      sku         text NOT NULL,
      name        text NOT NULL,
      description text NOT NULL,
      category    text NOT NULL,
      price_cents integer NOT NULL,
      in_stock    boolean NOT NULL DEFAULT true,
      created_at  timestamptz NOT NULL DEFAULT now()
    );
    
    CREATE INDEX products_search_idx ON products
    USING bm25 (id, name, description, category, price_cents, in_stock, created_at)
    WITH (key_field='id');

    The @@@ Operator

    -- Basic keyword search across the indexed fields:
    SELECT id, name, paradedb.score(id) AS score
    FROM products
    WHERE products @@@ 'wireless headphones'
    ORDER BY score DESC LIMIT 20;

    Field Configurations

    Per-field tuning lives in the index WITH clause. Specify text-field analyzers, JSON paths, numeric and datetime indexing:

    CREATE INDEX products_search_idx ON products
    USING bm25 (id, name, description, category, price_cents, in_stock, created_at)
    WITH (
      key_field='id',
      text_fields='{
        "name":        { "tokenizer": { "type": "default" }, "fast": true },
        "description": { "tokenizer": { "type": "en_stem" } },
        "category":    { "tokenizer": { "type": "raw" }, "fast": true }
      }',
      numeric_fields='{ "price_cents": { "fast": true } }',
      boolean_fields='{ "in_stock": { "fast": true } }',
      datetime_fields='{ "created_at": { "fast": true } }'
    );

    fast: true stores the column as a column-oriented "fast field" — required for fast faceting and sorting on that field.

    Tokenizers, Stemmers, Stopwords

    • default — splits on whitespace + punctuation, lowercases
    • whitespace — splits on whitespace only, preserves case
    • raw — single token, exact match (perfect for SKUs and category strings)
    • en_stem / de_stem / etc. — language-specific stemmers
    • ngram — substring matching for autocomplete
    • source_code — preserves identifiers, splits on camelCase boundaries

    Boolean Queries

    SELECT id, name FROM products
    WHERE products @@@ paradedb.boolean(
      must    => ARRAY[paradedb.term('description','wireless')],
      must_not=> ARRAY[paradedb.term('category','accessories')],
      should  => ARRAY[paradedb.term('description','noise-cancelling')]
    );

    Phrase, Fuzzy, and Regex Queries

    -- Phrase with slop:
    WHERE products @@@ paradedb.phrase('description',
            ARRAY['noise','cancelling','headphones'], slop => 2)
    
    -- Fuzzy (edit distance 2):
    WHERE products @@@ paradedb.fuzzy_term('name','heaphones', distance => 2)
    
    -- Regex:
    WHERE products @@@ paradedb.regex('sku','^WH-[0-9]+#x27;)

    Faceting and Aggregations

    -- Counts per category for matching docs:
    SELECT category, COUNT(*) AS hits
    FROM products
    WHERE products @@@ 'headphones'
    GROUP BY category
    ORDER BY hits DESC;
    
    -- Date histogram bucketed by month:
    SELECT date_trunc('month', created_at) AS bucket, COUNT(*)
    FROM products
    WHERE products @@@ 'headphones'
    GROUP BY 1 ORDER BY 1;

    Highlighting Matched Terms

    SELECT id,
           paradedb.snippet(description) AS snippet,
           paradedb.score(id) AS score
    FROM products
    WHERE products @@@ 'wireless noise-cancelling'
    ORDER BY score DESC LIMIT 10;

    Pagination at Scale

    Deep OFFSET is slow on every search backend. Use search-after with the score + id tiebreaker as the cursor:

    -- Page 1 (no cursor):
    SELECT id, name, paradedb.score(id) AS score
    FROM products
    WHERE products @@@ 'headphones'
    ORDER BY score DESC, id DESC
    LIMIT 25;
    
    -- Page N (cursor = last row's score and id):
    SELECT id, name, paradedb.score(id) AS score
    FROM products
    WHERE products @@@ 'headphones'
      AND (paradedb.score(id), id) < ($last_score, $last_id)
    ORDER BY score DESC, id DESC
    LIMIT 25;

    Hybrid BM25 + pgvector with RRF

    The retrieval pattern that consistently beats either signal alone:

    WITH lexical AS (
      SELECT id, ROW_NUMBER() OVER (ORDER BY paradedb.score(id) DESC) AS rank
      FROM products
      WHERE products @@@ $1
      LIMIT 100
    ),
    semantic AS (
      SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
      FROM products
      ORDER BY embedding <=> $2 LIMIT 100
    )
    SELECT p.id, p.name,
           COALESCE(1.0/(60+l.rank), 0) + COALESCE(1.0/(60+s.rank), 0) AS rrf
    FROM products p
    LEFT JOIN lexical l USING (id)
    LEFT JOIN semantic s USING (id)
    WHERE l.id IS NOT NULL OR s.id IS NOT NULL
    ORDER BY rrf DESC LIMIT 20;

    When ParadeDB Replaces Elasticsearch — and When It Doesn't

    Replaces it: application search (products, articles, support content), search inside SaaS tenants, hybrid retrieval pipelines, anywhere your search corpus is small enough to live in your transactional database.

    Does not replace it: multi-tenant log search at petabyte scale, heavy multi-cluster ingest pipelines, deep ELK stack integrations. Use Elasticsearch or OpenSearch when those describe your workload.

    Operational Notes

    • • Index rebuild is non-trivial — schema changes that drop or re-add the BM25 index will pause writes briefly. Plan accordingly.
    • • BM25 indexes write-amplify your storage. Budget extra disk: typically 30–80% of the indexed text size.
    • • Monitor index size with SELECT pg_size_pretty(pg_relation_size('products_search_idx'));

    What's Next

    Part 5 brings columnar analytics into the same database with pg_duckdb — query Parquet on object storage and join it with your transactional tables in one SQL.