pgvector Backend¶

Overview¶

The pgvector backend provides production-ready vector similarity search using PostgreSQL with the pgvector extension. It combines the reliability and features of PostgreSQL with efficient approximate nearest neighbor (ANN) search capabilities.

Features¶

Production-ready - Built on battle-tested PostgreSQL infrastructure
Multiple index types - HNSW and IVFFlat for different use cases
Flexible configuration - Custom column mappings, schemas, and ID types
Multi-tenant support - Domain-based isolation for SaaS applications
Connection pooling - Built-in async connection pool management
Auto table creation - Automatically creates tables and indexes
Distributed-safe - Index management works correctly in multi-instance deployments

Installation¶

# Install the data package with asyncpg
pip install dataknobs-data asyncpg

# Or with the postgres extra
pip install "dataknobs-data[postgres]"

PostgreSQL Setup¶

Ensure pgvector is installed on your PostgreSQL server:

-- Connect as superuser and create the extension
CREATE EXTENSION IF NOT EXISTS vector;

Quick Start¶

from dataknobs_data.vector.stores import VectorStoreFactory
import numpy as np

# Create store using factory
factory = VectorStoreFactory()
store = factory.create(
    backend="pgvector",
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    dimensions=768,
    metric="cosine"
)

# Initialize connection
await store.initialize()

# Add vectors
vectors = np.random.rand(100, 768).astype(np.float32)
metadata = [{"source": f"doc_{i}"} for i in range(100)]
ids = await store.add_vectors(vectors, metadata=metadata)

# Search
query = np.random.rand(768).astype(np.float32)
results = await store.search(query, k=5)

for id, score, meta in results:
    print(f"ID: {id}, Score: {score:.4f}, Source: {meta['source']}")

# Cleanup
await store.close()

Configuration Options¶

Basic Configuration¶

Option	Type	Default	Description
`connection_string`	str	None	PostgreSQL connection URL. Falls back to `DATABASE_URL` env var
`dimensions`	int	Required	Vector dimensions (e.g., 768 for sentence-transformers)
`metric`	str	`"cosine"`	Distance metric: `cosine`, `euclidean`, `inner_product`
`schema`	str	`"edubot"`	Database schema name
`table_name`	str	`"knowledge_embeddings"`	Table name for vectors

Connection Pool Configuration¶

Option	Type	Default	Description
`pool_min_size`	int	2	Minimum connections in pool
`pool_max_size`	int	10	Maximum connections in pool

Table Configuration¶

Option	Type	Default	Description
`auto_create_table`	bool	`True`	Create table if it doesn't exist
`id_type`	str	`"uuid"`	ID column type: `uuid` or `text`
`columns`	dict	See below	Column name mappings

Index Configuration¶

Option	Type	Default	Description
`index_type`	str	`"none"`	Index type: `none`, `hnsw`, or `ivfflat`
`auto_create_index`	bool	`False`	Automatically create index when conditions are met
`min_rows_for_index`	int	1000	Minimum rows before auto-creating IVFFlat index
`index_params`	dict	`{}`	Index-specific parameters

Multi-tenant Configuration¶

Option	Type	Default	Description
`domain_id`	str	None	Domain ID for filtering (multi-tenant isolation)

Index Types¶

No Index (Default)¶

Best for small datasets (<1,000 vectors) where exact search is acceptable:

store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    index_type="none"  # Default, explicit for clarity
)

HNSW Index¶

Hierarchical Navigable Small World graphs. Best for most production use cases:

Works with empty tables (can be created immediately)
Good balance of speed and recall
Higher memory usage

store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    index_type="hnsw",
    auto_create_index=True,
    index_params={
        "m": 16,              # Connections per layer (higher = better recall, more memory)
        "ef_construction": 64  # Build quality (higher = better recall, slower build)
    }
)

IVFFlat Index¶

Inverted File with Flat quantization. Best for very large datasets:

Requires existing data for clustering
Lower memory usage
Faster index creation

store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    index_type="ivfflat",
    auto_create_index=True,
    min_rows_for_index=1000,  # Create index when this threshold is reached
    index_params={
        "lists": 100  # Number of clusters (sqrt(n) is a good starting point)
    }
)

Explicit Index Creation¶

For more control, create indexes explicitly:

# Initialize without auto-index
store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    auto_create_index=False
)
await store.initialize()

# Add data first (required for IVFFlat)
await store.add_vectors(vectors, metadata=metadata)

# Create index with custom parameters
await store.create_index(
    index_type="ivfflat",
    params={"lists": 200}
)

Column Mappings¶

Customize column names to match your existing schema:

store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    table_name="product_embeddings",
    columns={
        "id": "product_id",
        "embedding": "vector_data",
        "content": "description",
        "metadata": "attributes",
        "domain_id": "category",
        "document_id": "source_file",
        "chunk_index": "segment_num",
        "created_at": "indexed_at"
    },
    id_type="text",
    auto_create_table=True
)

Default column names:

Logical Name	Default Column	Description
`id`	`id`	Primary key
`embedding`	`embedding`	Vector data
`content`	`content`	Source text
`metadata`	`metadata`	JSONB metadata
`domain_id`	`domain_id`	Multi-tenant domain
`document_id`	`document_id`	Source document reference
`chunk_index`	`chunk_index`	Chunk sequence number
`created_at`	`created_at`	Timestamp

Multi-tenant Usage¶

Isolate data by domain for SaaS applications:

# Create store for a specific tenant
tenant_store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    domain_id="tenant_123"  # All operations filtered to this domain
)
await tenant_store.initialize()

# Add vectors (automatically tagged with domain_id)
await tenant_store.add_vectors(vectors, metadata=metadata)

# Search only returns vectors from this domain
results = await tenant_store.search(query, k=10)

# Count only vectors in this domain
count = await tenant_store.count()

# Clear only clears vectors in this domain
await tenant_store.clear()

Usage Examples¶

Basic CRUD Operations¶

# Add vectors
vectors = np.random.rand(10, 768).astype(np.float32)
ids = await store.add_vectors(
    vectors,
    ids=["doc_1", "doc_2", ...],  # Optional, auto-generated if not provided
    metadata=[{"source": "file.pdf"} for _ in range(10)]
)

# Get vectors by ID
results = await store.get_vectors(["doc_1", "doc_2"], include_metadata=True)
for vec, meta in results:
    if vec is not None:
        print(f"Vector shape: {vec.shape}, Metadata: {meta}")

# Update metadata
await store.update_metadata(
    ["doc_1", "doc_2"],
    [{"source": "updated.pdf"}, {"source": "new.pdf"}]
)

# Delete vectors
deleted_count = await store.delete_vectors(["doc_1", "doc_2"])

# Count vectors
total = await store.count()
filtered = await store.count(filter={"category": "science"})

# Clear all vectors
await store.clear()

Search with Filters¶

# Search with metadata filters
results = await store.search(
    query_vector,
    k=10,
    filter={"category": "technology", "year": "2024"},
    include_metadata=True
)

for id, score, metadata in results:
    print(f"ID: {id}")
    print(f"Score: {score:.4f}")
    print(f"Content: {metadata.get('content', '')[:100]}...")

Using with Context Manager¶

from dataknobs_data.vector.stores.pgvector import PgVectorStore

config = {
    "connection_string": "postgresql://...",
    "dimensions": 768,
    "metric": "cosine"
}

store = PgVectorStore(config)
try:
    await store.initialize()
    # Use store...
finally:
    await store.close()

Distance Metrics¶

Metric	Operator	Use Case
`cosine`	`<=>`	Normalized embeddings, semantic similarity
`euclidean`	`<->`	Spatial data, when magnitude matters
`inner_product`	`<#>`	Dot product similarity (MaxSim)

The score returned by search() is converted to a similarity score:

Cosine: 1 - distance (0 = different, 1 = identical)
Euclidean: 1 / (1 + distance) (0 = far, 1 = close)
Inner Product: Raw inner product value

Performance Optimization¶

Index Selection Guide¶

Dataset Size	Recommended Index	Parameters
< 1,000	None	Exact search is fast enough
1,000 - 100,000	HNSW	`m=16, ef_construction=64`
100,000 - 1,000,000	HNSW or IVFFlat	HNSW: `m=32, ef_construction=128`
> 1,000,000	IVFFlat	`lists=sqrt(n)`

Connection Pool Tuning¶

# For high-throughput applications
store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    pool_min_size=5,   # Keep more connections ready
    pool_max_size=20   # Allow more concurrent operations
)

Batch Operations¶

# Add vectors in batches for better performance
batch_size = 1000
for i in range(0, len(all_vectors), batch_size):
    batch = all_vectors[i:i+batch_size]
    batch_meta = all_metadata[i:i+batch_size]
    await store.add_vectors(batch, metadata=batch_meta)

Production Considerations¶

Environment Variables¶

# Connection string via environment
export DATABASE_URL="postgresql://user:pass@host:5432/db"

# Store will use DATABASE_URL if connection_string not provided
store = factory.create(
    backend="pgvector",
    dimensions=768
)

Index Management in Distributed Environments¶

The pgvector backend safely handles index creation in multi-instance deployments by querying the PostgreSQL pg_indexes catalog rather than using in-memory state:

# Safe for AWS ECS, Kubernetes, etc.
store = factory.create(
    backend="pgvector",
    connection_string="postgresql://...",
    dimensions=768,
    index_type="ivfflat",
    auto_create_index=True,
    min_rows_for_index=1000
)
# Multiple instances can safely run - only one will create the index

Schema Migration¶

The auto-created table schema:

CREATE TABLE IF NOT EXISTS {schema}.{table_name} (
    {id} UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    {domain_id} VARCHAR(100),
    {document_id} VARCHAR(255),
    {chunk_index} INTEGER,
    {content} TEXT,
    {embedding} vector({dimensions}),
    {metadata} JSONB DEFAULT '{}',
    {created_at} TIMESTAMP DEFAULT NOW()
);

Troubleshooting¶

Common Issues¶

"pgvector extension not installed"

-- Run as superuser
CREATE EXTENSION vector;

"asyncpg is not installed"

pip install asyncpg

Poor search recall with IVFFlat

Increase lists parameter (try sqrt(n) where n is row count)
Ensure index was created after sufficient data was loaded
Consider switching to HNSW for better recall

Slow index creation

HNSW: Lower ef_construction (trades build time for recall)
IVFFlat: Fewer lists (trades recall for speed)