Part 3 of 7

pgvector for Embeddings, RAG, and Semantic Search

A self-hosted vector database that lives in your existing Postgres — joins, transactions, and one storage layer.

55 minutes

4 GB+ recommended

Prerequisites

Part 1 + Part 2

Time to Complete

50–65 minutes

Recommended Plan

4 GB+, ideally 8 GB

Why pgvector Beats a Separate Vector Database

Dedicated vector databases (Pinecone, Weaviate, Qdrant) solve a real problem at extreme scale. For most workloads they introduce more pain than they remove:

• One storage layer — your documents, metadata, and embeddings live in the same database. No two-system consistency to manage.
• Transactional consistency — insert a document and its embedding in the same transaction. Either both land or neither.
• Joins — filter by user, by tenant, by time, by tag, all in plain SQL.
• One backup — Part 7 (pgBackRest) covers vectors and documents in a single PITR-capable archive.

Install pgvector

# PGDG ships pgvector as a versioned package:
sudo apt install -y postgresql-17-pgvector

# Enable per-database:
sudo -u postgres psql -d appdb -c "CREATE EXTENSION vector;"

# Verify:
sudo -u postgres psql -d appdb -c "\dx vector"

Vector Types

• vector(N) — single-precision float, the default. 4 bytes per dimension.
• halfvec(N) — half-precision float (FP16). 2 bytes per dimension. Half the storage and the index size, with negligible recall loss for most embedding models.
• bit(N) — binary embeddings, 1 bit per dimension. Used with quantized models.
• sparsevec(N) — sparse representation for high-dimensional, mostly-zero vectors (e.g. SPLADE).

Distance Operators

Operator	Distance	Typical use
`<->`	L2 (Euclidean)	image embeddings, geometric data
`<=>`	Cosine	text embeddings (default for most LLM use)
`<#>`	Negative inner product	OpenAI embeddings (already normalized)
`<+>`	L1 (Manhattan)	specialized retrieval, sparse vectors

Choosing Dimensionality

Model	Dims	Notes
all-MiniLM-L6-v2	384	Fast, CPU-friendly, good baseline
BAAI/bge-base-en-v1.5	768	Strong open-source default
BAAI/bge-large-en-v1.5	1024	Best open recall; needs more RAM
OpenAI text-embedding-3-small	1536	API; reduce with Matryoshka if cost matters
OpenAI text-embedding-3-large	3072	High accuracy, expensive to store

HNSW vs IVFFlat — A Decision Flowchart

Are inserts continuous (RAG with growing corpus)?
  YES → HNSW. Updates are cheap; recall is high.
  NO  → Static dataset, build once?
        YES + < 1M rows → either works; HNSW is simpler.
        YES + > 1M rows → IVFFlat builds faster and uses less RAM.

Need highest recall at low query latency? → HNSW.
Need smallest index footprint? → IVFFlat.

HNSW Tuning Parameters

CREATE TABLE documents (
  id          bigserial PRIMARY KEY,
  tenant_id   uuid NOT NULL,
  body        text NOT NULL,
  embedding   vector(768) NOT NULL,
  created_at  timestamptz NOT NULL DEFAULT now()
);

-- Build-time params:
--   m = 16 default (graph connectivity; raise to 32 for hard datasets)
--   ef_construction = 64 default (build effort; 200 gives much better recall)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 200);

-- Query-time recall vs latency knob:
SET hnsw.ef_search = 100;   -- raise for higher recall, lower for speed

The halfvec Index Trick

Cut HNSW index size in half with negligible recall loss for most embedding models:

CREATE INDEX ON documents
  USING hnsw ((embedding::halfvec(768)) halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 200);

-- Query against the index by casting the probe vector:
SELECT id, body
FROM documents
ORDER BY embedding::halfvec(768) <=> $1::halfvec(768)
LIMIT 10;

Generating Embeddings Locally

CPU-only sentence-transformers inference is fast enough for many workloads and removes the API bill:

pip install sentence-transformers psycopg[binary]

ingest.py

from sentence_transformers import SentenceTransformer
import psycopg

model = SentenceTransformer("BAAI/bge-base-en-v1.5")
conn = psycopg.connect("dbname=appdb user=appuser host=127.0.0.1 port=6432")

docs = [("doc-1", "Patroni provides leader election for PostgreSQL HA"),
        ("doc-2", "pgBackRest performs incremental block-level backups")]

with conn.cursor() as cur:
    for tid, body in docs:
        emb = model.encode(body, normalize_embeddings=True).tolist()
        cur.execute(
          "INSERT INTO documents (tenant_id, body, embedding) VALUES (%s,%s,%s)",
          (tid, body, emb))
    conn.commit()

Embeddings via Ollama or an External API

If you'd rather offload inference to a second VPS running Ollama, or use OpenAI for accuracy:

# Ollama on a separate RamNode instance:
curl http://10.0.0.20:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Patroni provides leader election for PostgreSQL HA"
}'

Hybrid Search: tsvector + pgvector with RRF

Combine lexical and semantic results with reciprocal rank fusion in a single SQL query:

WITH semantic AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
  FROM documents WHERE tenant_id = $1
  ORDER BY embedding <=> $2 LIMIT 50
),
lexical AS (
  SELECT id, ROW_NUMBER() OVER (
    ORDER BY ts_rank(to_tsvector('english', body), plainto_tsquery($3)) DESC
  ) AS rank
  FROM documents
  WHERE tenant_id = $1
    AND to_tsvector('english', body) @@ plainto_tsquery($3)
  LIMIT 50
)
SELECT d.id, d.body,
       COALESCE(1.0/(60+s.rank), 0) + COALESCE(1.0/(60+l.rank), 0) AS rrf_score
FROM documents d
LEFT JOIN semantic s USING (id)
LEFT JOIN lexical  l USING (id)
WHERE s.id IS NOT NULL OR l.id IS NOT NULL
ORDER BY rrf_score DESC LIMIT 10;

Part 4 (ParadeDB) upgrades the lexical half from ts_rank to a real BM25 implementation.

Metadata Filtering with Partial Indexes

Multi-tenant SaaS pattern — one partial HNSW index per high-traffic tenant or per shard, leaving a global index for the long tail:

CREATE INDEX docs_tenant_a_hnsw ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 200)
  WHERE tenant_id = '11111111-1111-1111-1111-111111111111';

Performance Numbers (4 vCPU, 8 GB)

Rows	Build time (m=16, ef=200)	p50 query	Recall@10
100k	~30s	1.2 ms	0.99
1M	~6 min	3 ms	0.98
10M	~75 min	10–18 ms	0.96

Numbers vary with embedding dimensionality and ef_search. These are starting points, not guarantees.

When pgvector Stops Being Enough

Beyond roughly 50M vectors at low latency, or with very high write throughput on the index, dedicated systems (Qdrant, Vespa, Milvus) start to pull ahead. Below that scale, the operational simplicity of one Postgres almost always wins.

What's Next

Part 4 replaces ts_rank with ParadeDB's pg_search — real BM25, fuzzy queries, faceting, highlighting, and a hybrid-with-RRF query that combines BM25 and pgvector cleanly.

← Part 2: Tuning Part 4: ParadeDB