ParadeDB for Full-Text Search and BM25
Tantivy-backed search inside Postgres — the realistic Elasticsearch replacement for application search workloads.
The Case Against tsvector
Postgres' built-in full-text search is fine for "does this column contain this word." It is not fine for application search:
- • No BM25 ranking —
ts_rankis a coverage metric, not a relevance metric - • Weak phrase matching, no slop control
- • No fuzzy search out of the box (you can bolt on
pg_trgmbut the integration is awkward) - • No native faceting, no highlighting
What pg_search Is
ParadeDB's pg_search is a Postgres extension that embeds Tantivy (the Rust port of Lucene) as an index access method. You get the index quality of Lucene/Elasticsearch directly inside the database, with no external service to deploy and no replication pipeline to keep in sync.
Install pg_search
# Add the ParadeDB apt repo:
curl -fsSL https://repo.paradedb.com/apt/paradedb.gpg \
| sudo gpg --dearmor -o /usr/share/keyrings/paradedb.gpg
echo "deb [signed-by=/usr/share/keyrings/paradedb.gpg] \
https://repo.paradedb.com/apt $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/paradedb.list
sudo apt update
sudo apt install -y postgresql-17-pg-search
# Add to shared_preload_libraries (requires restart):
echo "shared_preload_libraries = 'pg_search'" | \
sudo tee -a /etc/postgresql/17/main/conf.d/20-paradedb.conf
sudo systemctl restart postgresql@17-main
sudo -u postgres psql -d appdb -c "CREATE EXTENSION pg_search;"Create a BM25 Index
CREATE TABLE products (
id bigserial PRIMARY KEY,
sku text NOT NULL,
name text NOT NULL,
description text NOT NULL,
category text NOT NULL,
price_cents integer NOT NULL,
in_stock boolean NOT NULL DEFAULT true,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX products_search_idx ON products
USING bm25 (id, name, description, category, price_cents, in_stock, created_at)
WITH (key_field='id');The @@@ Operator
-- Basic keyword search across the indexed fields:
SELECT id, name, paradedb.score(id) AS score
FROM products
WHERE products @@@ 'wireless headphones'
ORDER BY score DESC LIMIT 20;Field Configurations
Per-field tuning lives in the index WITH clause. Specify text-field analyzers, JSON paths, numeric and datetime indexing:
CREATE INDEX products_search_idx ON products
USING bm25 (id, name, description, category, price_cents, in_stock, created_at)
WITH (
key_field='id',
text_fields='{
"name": { "tokenizer": { "type": "default" }, "fast": true },
"description": { "tokenizer": { "type": "en_stem" } },
"category": { "tokenizer": { "type": "raw" }, "fast": true }
}',
numeric_fields='{ "price_cents": { "fast": true } }',
boolean_fields='{ "in_stock": { "fast": true } }',
datetime_fields='{ "created_at": { "fast": true } }'
);fast: true stores the column as a column-oriented "fast field" — required for fast faceting and sorting on that field.
Tokenizers, Stemmers, Stopwords
- • default — splits on whitespace + punctuation, lowercases
- • whitespace — splits on whitespace only, preserves case
- • raw — single token, exact match (perfect for SKUs and category strings)
- • en_stem / de_stem / etc. — language-specific stemmers
- • ngram — substring matching for autocomplete
- • source_code — preserves identifiers, splits on camelCase boundaries
Boolean Queries
SELECT id, name FROM products
WHERE products @@@ paradedb.boolean(
must => ARRAY[paradedb.term('description','wireless')],
must_not=> ARRAY[paradedb.term('category','accessories')],
should => ARRAY[paradedb.term('description','noise-cancelling')]
);Phrase, Fuzzy, and Regex Queries
-- Phrase with slop:
WHERE products @@@ paradedb.phrase('description',
ARRAY['noise','cancelling','headphones'], slop => 2)
-- Fuzzy (edit distance 2):
WHERE products @@@ paradedb.fuzzy_term('name','heaphones', distance => 2)
-- Regex:
WHERE products @@@ paradedb.regex('sku','^WH-[0-9]+#x27;)Faceting and Aggregations
-- Counts per category for matching docs:
SELECT category, COUNT(*) AS hits
FROM products
WHERE products @@@ 'headphones'
GROUP BY category
ORDER BY hits DESC;
-- Date histogram bucketed by month:
SELECT date_trunc('month', created_at) AS bucket, COUNT(*)
FROM products
WHERE products @@@ 'headphones'
GROUP BY 1 ORDER BY 1;Highlighting Matched Terms
SELECT id,
paradedb.snippet(description) AS snippet,
paradedb.score(id) AS score
FROM products
WHERE products @@@ 'wireless noise-cancelling'
ORDER BY score DESC LIMIT 10;Pagination at Scale
Deep OFFSET is slow on every search backend. Use search-after with the score + id tiebreaker as the cursor:
-- Page 1 (no cursor):
SELECT id, name, paradedb.score(id) AS score
FROM products
WHERE products @@@ 'headphones'
ORDER BY score DESC, id DESC
LIMIT 25;
-- Page N (cursor = last row's score and id):
SELECT id, name, paradedb.score(id) AS score
FROM products
WHERE products @@@ 'headphones'
AND (paradedb.score(id), id) < ($last_score, $last_id)
ORDER BY score DESC, id DESC
LIMIT 25;Hybrid BM25 + pgvector with RRF
The retrieval pattern that consistently beats either signal alone:
WITH lexical AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY paradedb.score(id) DESC) AS rank
FROM products
WHERE products @@@ $1
LIMIT 100
),
semantic AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
FROM products
ORDER BY embedding <=> $2 LIMIT 100
)
SELECT p.id, p.name,
COALESCE(1.0/(60+l.rank), 0) + COALESCE(1.0/(60+s.rank), 0) AS rrf
FROM products p
LEFT JOIN lexical l USING (id)
LEFT JOIN semantic s USING (id)
WHERE l.id IS NOT NULL OR s.id IS NOT NULL
ORDER BY rrf DESC LIMIT 20;When ParadeDB Replaces Elasticsearch — and When It Doesn't
Replaces it: application search (products, articles, support content), search inside SaaS tenants, hybrid retrieval pipelines, anywhere your search corpus is small enough to live in your transactional database.
Does not replace it: multi-tenant log search at petabyte scale, heavy multi-cluster ingest pipelines, deep ELK stack integrations. Use Elasticsearch or OpenSearch when those describe your workload.
Operational Notes
- • Index rebuild is non-trivial — schema changes that drop or re-add the BM25 index will pause writes briefly. Plan accordingly.
- • BM25 indexes write-amplify your storage. Budget extra disk: typically 30–80% of the indexed text size.
- • Monitor index size with
SELECT pg_size_pretty(pg_relation_size('products_search_idx'));
What's Next
Part 5 brings columnar analytics into the same database with pg_duckdb — query Parquet on object storage and join it with your transactional tables in one SQL.
