Part 4 of 7

ParadeDB for Full-Text Search and BM25

Tantivy-backed search inside Postgres — the realistic Elasticsearch replacement for application search workloads.

50 minutes

4 GB+ recommended

Prerequisites

Part 1 + Part 2

Time to Complete

45–60 minutes

Pairs With

Part 3 (pgvector)

The Case Against tsvector

Postgres' built-in full-text search is fine for "does this column contain this word." It is not fine for application search:

• No BM25 ranking — ts_rank is a coverage metric, not a relevance metric
• Weak phrase matching, no slop control
• No fuzzy search out of the box (you can bolt on pg_trgm but the integration is awkward)
• No native faceting, no highlighting

What pg_search Is

ParadeDB's pg_search is a Postgres extension that embeds Tantivy (the Rust port of Lucene) as an index access method. You get the index quality of Lucene/Elasticsearch directly inside the database, with no external service to deploy and no replication pipeline to keep in sync.

Install pg_search

# Add the ParadeDB apt repo:
curl -fsSL https://repo.paradedb.com/apt/paradedb.gpg \
  | sudo gpg --dearmor -o /usr/share/keyrings/paradedb.gpg

echo "deb [signed-by=/usr/share/keyrings/paradedb.gpg] \
  https://repo.paradedb.com/apt $(lsb_release -cs) main" | \
  sudo tee /etc/apt/sources.list.d/paradedb.list

sudo apt update
sudo apt install -y postgresql-17-pg-search

# Add to shared_preload_libraries (requires restart):
echo "shared_preload_libraries = 'pg_search'" | \
  sudo tee -a /etc/postgresql/17/main/conf.d/20-paradedb.conf

sudo systemctl restart postgresql@17-main
sudo -u postgres psql -d appdb -c "CREATE EXTENSION pg_search;"

Create a BM25 Index

CREATE TABLE products (
  id          bigserial PRIMARY KEY,
  sku         text NOT NULL,
  name        text NOT NULL,
  description text NOT NULL,
  category    text NOT NULL,
  price_cents integer NOT NULL,
  in_stock    boolean NOT NULL DEFAULT true,
  created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX products_search_idx ON products
USING bm25 (id, name, description, category, price_cents, in_stock, created_at)
WITH (key_field='id');

The @@@ Operator

-- Basic keyword search across the indexed fields:
SELECT id, name, paradedb.score(id) AS score
FROM products
WHERE products @@@ 'wireless headphones'
ORDER BY score DESC LIMIT 20;

Field Configurations

Per-field tuning lives in the index WITH clause. Specify text-field analyzers, JSON paths, numeric and datetime indexing:

CREATE INDEX products_search_idx ON products
USING bm25 (id, name, description, category, price_cents, in_stock, created_at)
WITH (
  key_field='id',
  text_fields='{
    "name":        { "tokenizer": { "type": "default" }, "fast": true },
    "description": { "tokenizer": { "type": "en_stem" } },
    "category":    { "tokenizer": { "type": "raw" }, "fast": true }
  }',
  numeric_fields='{ "price_cents": { "fast": true } }',
  boolean_fields='{ "in_stock": { "fast": true } }',
  datetime_fields='{ "created_at": { "fast": true } }'
);

fast: true stores the column as a column-oriented "fast field" — required for fast faceting and sorting on that field.

Tokenizers, Stemmers, Stopwords

• default — splits on whitespace + punctuation, lowercases
• whitespace — splits on whitespace only, preserves case
• raw — single token, exact match (perfect for SKUs and category strings)
• en_stem / de_stem / etc. — language-specific stemmers
• ngram — substring matching for autocomplete
• source_code — preserves identifiers, splits on camelCase boundaries

Boolean Queries

SELECT id, name FROM products
WHERE products @@@ paradedb.boolean(
  must    => ARRAY[paradedb.term('description','wireless')],
  must_not=> ARRAY[paradedb.term('category','accessories')],
  should  => ARRAY[paradedb.term('description','noise-cancelling')]
);

Phrase, Fuzzy, and Regex Queries

-- Phrase with slop:
WHERE products @@@ paradedb.phrase('description',
        ARRAY['noise','cancelling','headphones'], slop => 2)

-- Fuzzy (edit distance 2):
WHERE products @@@ paradedb.fuzzy_term('name','heaphones', distance => 2)

-- Regex:
WHERE products @@@ paradedb.regex('sku','^WH-[0-9]+#x27;)

Faceting and Aggregations

-- Counts per category for matching docs:
SELECT category, COUNT(*) AS hits
FROM products
WHERE products @@@ 'headphones'
GROUP BY category
ORDER BY hits DESC;

-- Date histogram bucketed by month:
SELECT date_trunc('month', created_at) AS bucket, COUNT(*)
FROM products
WHERE products @@@ 'headphones'
GROUP BY 1 ORDER BY 1;

Highlighting Matched Terms

SELECT id,
       paradedb.snippet(description) AS snippet,
       paradedb.score(id) AS score
FROM products
WHERE products @@@ 'wireless noise-cancelling'
ORDER BY score DESC LIMIT 10;

Pagination at Scale

Deep OFFSET is slow on every search backend. Use search-after with the score + id tiebreaker as the cursor:

-- Page 1 (no cursor):
SELECT id, name, paradedb.score(id) AS score
FROM products
WHERE products @@@ 'headphones'
ORDER BY score DESC, id DESC
LIMIT 25;

-- Page N (cursor = last row's score and id):
SELECT id, name, paradedb.score(id) AS score
FROM products
WHERE products @@@ 'headphones'
  AND (paradedb.score(id), id) < ($last_score, $last_id)
ORDER BY score DESC, id DESC
LIMIT 25;

Hybrid BM25 + pgvector with RRF

The retrieval pattern that consistently beats either signal alone:

WITH lexical AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY paradedb.score(id) DESC) AS rank
  FROM products
  WHERE products @@@ $1
  LIMIT 100
),
semantic AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
  FROM products
  ORDER BY embedding <=> $2 LIMIT 100
)
SELECT p.id, p.name,
       COALESCE(1.0/(60+l.rank), 0) + COALESCE(1.0/(60+s.rank), 0) AS rrf
FROM products p
LEFT JOIN lexical l USING (id)
LEFT JOIN semantic s USING (id)
WHERE l.id IS NOT NULL OR s.id IS NOT NULL
ORDER BY rrf DESC LIMIT 20;

When ParadeDB Replaces Elasticsearch — and When It Doesn't

Replaces it: application search (products, articles, support content), search inside SaaS tenants, hybrid retrieval pipelines, anywhere your search corpus is small enough to live in your transactional database.

Does not replace it: multi-tenant log search at petabyte scale, heavy multi-cluster ingest pipelines, deep ELK stack integrations. Use Elasticsearch or OpenSearch when those describe your workload.

Operational Notes

• Index rebuild is non-trivial — schema changes that drop or re-add the BM25 index will pause writes briefly. Plan accordingly.
• BM25 indexes write-amplify your storage. Budget extra disk: typically 30–80% of the indexed text size.
• Monitor index size with SELECT pg_size_pretty(pg_relation_size('products_search_idx'));

What's Next

Part 5 brings columnar analytics into the same database with pg_duckdb — query Parquet on object storage and join it with your transactional tables in one SQL.

← Part 3: pgvector Part 5: pg_duckdb