Modern PostgreSQL Superstack Series
    Part 3 of 7

    pgvector for Embeddings, RAG, and Semantic Search

    A self-hosted vector database that lives in your existing Postgres — joins, transactions, and one storage layer.

    55 minutes
    4 GB+ recommended
    Prerequisites

    Part 1 + Part 2

    Time to Complete

    50–65 minutes

    Recommended Plan

    4 GB+, ideally 8 GB

    Why pgvector Beats a Separate Vector Database

    Dedicated vector databases (Pinecone, Weaviate, Qdrant) solve a real problem at extreme scale. For most workloads they introduce more pain than they remove:

    • One storage layer — your documents, metadata, and embeddings live in the same database. No two-system consistency to manage.
    • Transactional consistency — insert a document and its embedding in the same transaction. Either both land or neither.
    • Joins — filter by user, by tenant, by time, by tag, all in plain SQL.
    • One backup — Part 7 (pgBackRest) covers vectors and documents in a single PITR-capable archive.

    Install pgvector

    # PGDG ships pgvector as a versioned package:
    sudo apt install -y postgresql-17-pgvector
    
    # Enable per-database:
    sudo -u postgres psql -d appdb -c "CREATE EXTENSION vector;"
    
    # Verify:
    sudo -u postgres psql -d appdb -c "\dx vector"

    Vector Types

    • vector(N) — single-precision float, the default. 4 bytes per dimension.
    • halfvec(N) — half-precision float (FP16). 2 bytes per dimension. Half the storage and the index size, with negligible recall loss for most embedding models.
    • bit(N) — binary embeddings, 1 bit per dimension. Used with quantized models.
    • sparsevec(N) — sparse representation for high-dimensional, mostly-zero vectors (e.g. SPLADE).

    Distance Operators

    OperatorDistanceTypical use
    <->L2 (Euclidean)image embeddings, geometric data
    <=>Cosinetext embeddings (default for most LLM use)
    <#>Negative inner productOpenAI embeddings (already normalized)
    <+>L1 (Manhattan)specialized retrieval, sparse vectors

    Choosing Dimensionality

    ModelDimsNotes
    all-MiniLM-L6-v2384Fast, CPU-friendly, good baseline
    BAAI/bge-base-en-v1.5768Strong open-source default
    BAAI/bge-large-en-v1.51024Best open recall; needs more RAM
    OpenAI text-embedding-3-small1536API; reduce with Matryoshka if cost matters
    OpenAI text-embedding-3-large3072High accuracy, expensive to store

    HNSW vs IVFFlat — A Decision Flowchart

    Are inserts continuous (RAG with growing corpus)?
      YES → HNSW. Updates are cheap; recall is high.
      NO  → Static dataset, build once?
            YES + < 1M rows → either works; HNSW is simpler.
            YES + > 1M rows → IVFFlat builds faster and uses less RAM.
    
    Need highest recall at low query latency? → HNSW.
    Need smallest index footprint? → IVFFlat.

    HNSW Tuning Parameters

    CREATE TABLE documents (
      id          bigserial PRIMARY KEY,
      tenant_id   uuid NOT NULL,
      body        text NOT NULL,
      embedding   vector(768) NOT NULL,
      created_at  timestamptz NOT NULL DEFAULT now()
    );
    
    -- Build-time params:
    --   m = 16 default (graph connectivity; raise to 32 for hard datasets)
    --   ef_construction = 64 default (build effort; 200 gives much better recall)
    CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 200);
    
    -- Query-time recall vs latency knob:
    SET hnsw.ef_search = 100;   -- raise for higher recall, lower for speed

    The halfvec Index Trick

    Cut HNSW index size in half with negligible recall loss for most embedding models:

    CREATE INDEX ON documents
      USING hnsw ((embedding::halfvec(768)) halfvec_cosine_ops)
      WITH (m = 16, ef_construction = 200);
    
    -- Query against the index by casting the probe vector:
    SELECT id, body
    FROM documents
    ORDER BY embedding::halfvec(768) <=> $1::halfvec(768)
    LIMIT 10;

    Generating Embeddings Locally

    CPU-only sentence-transformers inference is fast enough for many workloads and removes the API bill:

    pip install sentence-transformers psycopg[binary]
    ingest.py
    from sentence_transformers import SentenceTransformer
    import psycopg
    
    model = SentenceTransformer("BAAI/bge-base-en-v1.5")
    conn = psycopg.connect("dbname=appdb user=appuser host=127.0.0.1 port=6432")
    
    docs = [("doc-1", "Patroni provides leader election for PostgreSQL HA"),
            ("doc-2", "pgBackRest performs incremental block-level backups")]
    
    with conn.cursor() as cur:
        for tid, body in docs:
            emb = model.encode(body, normalize_embeddings=True).tolist()
            cur.execute(
              "INSERT INTO documents (tenant_id, body, embedding) VALUES (%s,%s,%s)",
              (tid, body, emb))
        conn.commit()

    Embeddings via Ollama or an External API

    If you'd rather offload inference to a second VPS running Ollama, or use OpenAI for accuracy:

    # Ollama on a separate RamNode instance:
    curl http://10.0.0.20:11434/api/embeddings -d '{
      "model": "nomic-embed-text",
      "prompt": "Patroni provides leader election for PostgreSQL HA"
    }'

    Hybrid Search: tsvector + pgvector with RRF

    Combine lexical and semantic results with reciprocal rank fusion in a single SQL query:

    WITH semantic AS (
      SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $2) AS rank
      FROM documents WHERE tenant_id = $1
      ORDER BY embedding <=> $2 LIMIT 50
    ),
    lexical AS (
      SELECT id, ROW_NUMBER() OVER (
        ORDER BY ts_rank(to_tsvector('english', body), plainto_tsquery($3)) DESC
      ) AS rank
      FROM documents
      WHERE tenant_id = $1
        AND to_tsvector('english', body) @@ plainto_tsquery($3)
      LIMIT 50
    )
    SELECT d.id, d.body,
           COALESCE(1.0/(60+s.rank), 0) + COALESCE(1.0/(60+l.rank), 0) AS rrf_score
    FROM documents d
    LEFT JOIN semantic s USING (id)
    LEFT JOIN lexical  l USING (id)
    WHERE s.id IS NOT NULL OR l.id IS NOT NULL
    ORDER BY rrf_score DESC LIMIT 10;

    Part 4 (ParadeDB) upgrades the lexical half from ts_rank to a real BM25 implementation.

    Metadata Filtering with Partial Indexes

    Multi-tenant SaaS pattern — one partial HNSW index per high-traffic tenant or per shard, leaving a global index for the long tail:

    CREATE INDEX docs_tenant_a_hnsw ON documents
      USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 200)
      WHERE tenant_id = '11111111-1111-1111-1111-111111111111';

    Performance Numbers (4 vCPU, 8 GB)

    RowsBuild time (m=16, ef=200)p50 queryRecall@10
    100k~30s1.2 ms0.99
    1M~6 min3 ms0.98
    10M~75 min10–18 ms0.96

    Numbers vary with embedding dimensionality and ef_search. These are starting points, not guarantees.

    When pgvector Stops Being Enough

    Beyond roughly 50M vectors at low latency, or with very high write throughput on the index, dedicated systems (Qdrant, Vespa, Milvus) start to pull ahead. Below that scale, the operational simplicity of one Postgres almost always wins.

    What's Next

    Part 4 replaces ts_rank with ParadeDB's pg_search — real BM25, fuzzy queries, faceting, highlighting, and a hybrid-with-RRF query that combines BM25 and pgvector cleanly.