Vector Database Comparison & Benchmarks 2025: Pinecone vs Weaviate vs Milvus vs Qdrant vs pgvector
We benchmarked Pinecone, Weaviate, Milvus, Qdrant, and pgvector across insertion throughput, query latency, filtered search accuracy, and cost at 10M, 100M, and 500M vector scales. Here are the results.
At 100M vectors (768-dim, OpenAI ada-002 embeddings), Qdrant 1.10 leads on filtered search performance and cost-efficiency for self-hosted deployments, while Pinecone serverless wins on operational simplicity and consistent p99 latency. pgvector 0.7.0 with HNSW indexing is a credible option for teams already on Postgres at sub-50M vector scales. The right choice depends more on your team's operational model than raw benchmark numbers.
Why Vector DB Benchmarks Are Hard to Trust
Most published vector database benchmarks measure pure ANN (approximate nearest neighbour) search on clean, unfiltered datasets at uniform scale. This is not what enterprise RAG pipelines experience. Real workloads combine dense vector similarity with metadata filters — 'find the 10 most similar documents to this query, but only from documents belonging to tenant X, created after 2024-01-01, with status=published.' Filtered search performance varies dramatically across databases and is often the deciding factor.
For these benchmarks, we used 768-dimensional OpenAI ada-002 embeddings — the most common embedding dimension in enterprise RAG deployments — and applied a realistic filter selectivity of 15-25% (meaning the filter eliminates 75-85% of the corpus before the ANN search runs). This is consistent with ANN-Benchmarks methodology adapted for filtered workloads. All managed service benchmarks were run on their respective recommended tier for 100M vectors. Self-hosted benchmarks used c5.4xlarge (16 vCPU, 32GB RAM) on AWS.
Versions tested: Qdrant 1.10.0, Milvus 2.4.3, Weaviate 1.26.0, Pinecone serverless (September 2025 tier), pgvector 0.7.0 on Postgres 16. All indexes used HNSW with ef_construction=128, m=16 unless the database required different parameters for equivalent accuracy.
Benchmark Results at 100M Vectors (768-dim, HNSW Index)
| Database | Insert Throughput (vec/s) | p50 Query Latency (ms) | p99 Query Latency (ms) | Filtered Recall@10 | Monthly Cost (100M vec) |
|---|---|---|---|---|---|
| Qdrant 1.10 (self-hosted) | 42,000 | 3.2 | 18.4 | 0.97 | ~$280 (EC2) |
| Milvus 2.4 (self-hosted) | 38,500 | 4.1 | 24.7 | 0.94 | ~$310 (EC2) |
| Weaviate 1.26 (self-hosted) | 31,200 | 5.8 | 31.2 | 0.95 | ~$310 (EC2) |
| Pinecone Serverless | 18,000 | 6.4 | 22.1 | 0.96 | ~$650 (managed) |
| pgvector 0.7.0 (HNSW) | 12,400 | 9.1 | 58.3 | 0.91 | ~$180 (RDS) |
Scale Comparison: p99 Latency (ms) at Filtered Search
| Database | 10M Vectors | 100M Vectors | 500M Vectors | Notes |
|---|---|---|---|---|
| Qdrant 1.10 | 6.1 | 18.4 | 47.2 | Linear scaling with sharding |
| Milvus 2.4 | 7.8 | 24.7 | 61.8 | Requires GPU node at 500M for best perf |
| Weaviate 1.26 | 9.2 | 31.2 | 89.4 | Latency degrades above 200M without tuning |
| Pinecone Serverless | 8.4 | 22.1 | 38.9 | Consistent latency; auto-scales |
| pgvector 0.7.0 | 14.2 | 58.3 | Untested | Not recommended above 50M vectors |
Production Qdrant Setup with Filtered Search
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct,
Filter, FieldCondition, MatchValue, Range,
HnswConfigDiff, OptimizersConfigDiff
)
from openai import OpenAI
import uuid
import time
from typing import Optional
class EnterpriseVectorStore:
"""
Production-grade Qdrant wrapper for enterprise RAG workloads.
Handles collection creation, upsert, and filtered search.
"""
COLLECTION = "enterprise_docs"
VECTOR_DIM = 768 # OpenAI ada-002 / text-embedding-3-small at 768
def __init__(self, host: str = "localhost", port: int = 6333):
self.client = QdrantClient(host=host, port=port, timeout=30)
self.openai = OpenAI()
self._ensure_collection()
def _ensure_collection(self):
existing = [c.name for c in self.client.get_collections().collections]
if self.COLLECTION not in existing:
self.client.create_collection(
collection_name=self.COLLECTION,
vectors_config=VectorParams(
size=self.VECTOR_DIM,
distance=Distance.COSINE,
),
hnsw_config=HnswConfigDiff(
m=16,
ef_construct=128,
full_scan_threshold=10_000, # HNSW below this, full scan above
on_disk=True, # Critical at 100M+ vectors
),
optimizers_config=OptimizersConfigDiff(
indexing_threshold=20_000, # Batch index, not per-upsert
memmap_threshold=50_000,
),
)
# Create payload index for filtered search performance
for field in ["tenant_id", "status", "doc_type"]:
self.client.create_payload_index(
collection_name=self.COLLECTION,
field_name=field,
field_schema="keyword"
)
self.client.create_payload_index(
collection_name=self.COLLECTION,
field_name="created_at",
field_schema="float"
)
print(f"Collection '{self.COLLECTION}' created with payload indexes.")
def embed(self, text: str) -> list[float]:
response = self.openai.embeddings.create(
input=text,
model="text-embedding-3-small",
dimensions=self.VECTOR_DIM
)
return response.data[0].embedding
def upsert_document(self, doc_id: str, text: str, metadata: dict) -> None:
vector = self.embed(text)
self.client.upsert(
collection_name=self.COLLECTION,
points=[PointStruct(
id=str(uuid.uuid5(uuid.NAMESPACE_DNS, doc_id)),
vector=vector,
payload={**metadata, "doc_id": doc_id, "text": text[:500]}
)]
)
def search(
self,
query: str,
tenant_id: str,
doc_type: Optional[str] = None,
min_created_at: Optional[float] = None,
top_k: int = 10,
) -> list[dict]:
query_vector = self.embed(query)
must_conditions = [
FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id)),
FieldCondition(key="status", match=MatchValue(value="published")),
]
if doc_type:
must_conditions.append(
FieldCondition(key="doc_type", match=MatchValue(value=doc_type))
)
if min_created_at:
must_conditions.append(
FieldCondition(key="created_at", range=Range(gte=min_created_at))
)
start = time.perf_counter()
results = self.client.search(
collection_name=self.COLLECTION,
query_vector=query_vector,
query_filter=Filter(must=must_conditions),
limit=top_k,
with_payload=True,
)
latency_ms = (time.perf_counter() - start) * 1000
print(f"Search latency: {latency_ms:.1f}ms, results: {len(results)}")
return [
{"doc_id": r.payload["doc_id"], "score": r.score, "text": r.payload.get("text")}
for r in results
]
# Usage
if __name__ == "__main__":
store = EnterpriseVectorStore(host="localhost", port=6333)
store.upsert_document(
doc_id="doc-001",
text="Q4 2025 revenue increased 18% YoY driven by enterprise segment growth.",
metadata={"tenant_id": "acme-corp", "status": "published",
"doc_type": "financial", "created_at": 1735000000.0}
)
results = store.search(
query="What was revenue growth in Q4?",
tenant_id="acme-corp",
doc_type="financial",
)
print(results)Enterprise Qdrant setup with payload indexes for filtered search. The on_disk=True HNSW config is essential at 100M+ vector scales — without it, the entire index must fit in RAM. Payload indexes on tenant_id and status are required for sub-20ms filtered search; without them, Qdrant falls back to full payload scan.
pgvector 0.7.0's HNSW support is a genuine improvement, but the performance gap versus dedicated vector databases widens significantly with filtering. At 100M vectors with a 20% selectivity filter, pgvector's p99 latency is 3-4x Qdrant's. If your Postgres instance is also serving transactional workloads, ANN search at scale will compete for shared buffer pool and degrade OLTP performance. Keep vector search on a dedicated read replica at minimum, or migrate to a dedicated vector DB above 20M vectors.
When to Choose Each Database
Qdrant 1.10 — Best self-hosted choice for most teams
Best at: filtered search accuracy and throughput, cost-efficient self-hosting, Rust-native reliability. Choose Qdrant when you have DevOps capacity to manage infrastructure, need strong filtered search performance, and are working at 10M-500M vector scale. The 1.10 release's sparse vector support also makes it the best choice for hybrid dense/sparse search (BM25 + semantic).
Pinecone Serverless — Best managed option for ops-light teams
Best at: zero-ops management, consistent latency SLAs, integrated authentication. Choose Pinecone when your team lacks vector DB operational expertise, when consistent p99 latency matters more than raw throughput, or when you need a managed service with an enterprise SLA. The serverless tier's per-query pricing scales down well at uneven workloads.
Milvus 2.4 — Best for GPU-accelerated workloads
Best at: GPU-accelerated index building at very large scale (500M+), complex multi-vector queries. Choose Milvus when you need to rebuild indexes frequently over very large collections or when you already have GPU infrastructure. The Kubernetes deployment is more operationally complex than Qdrant but the Attu UI provides better visibility.
pgvector 0.7.0 — Best for sub-20M vectors on existing Postgres
Best at: zero additional infrastructure, SQL joins with relational data, familiar ops model. Use pgvector when your vector corpus is below 20M, your team is Postgres-native, and you value the ability to JOIN vector results directly with relational data. Migrate to a dedicated vector DB before your corpus exceeds 50M vectors.
Inductivee's Recommended Stack for Enterprise RAG
Across the RAG deployments we have built in 2025, the default recommendation is Qdrant self-hosted on a dedicated node for teams with any DevOps capacity, and Pinecone serverless for teams where 'managed, no ops' is a hard requirement. The cost difference at 100M vectors — roughly $280/month self-hosted versus $650/month managed — is meaningful but secondary to the operational overhead of running your own Qdrant cluster.
The decision that teams consistently underweight is payload index design. A Qdrant collection with well-designed payload indexes on tenant_id, document_type, and date will outperform a poorly indexed collection by 5-10x on filtered search latency. Spend time on your metadata schema before loading data — retroactively adding payload indexes on a 100M vector collection requires a full scan and temporarily degrades query performance.
For teams doing hybrid search (keyword + semantic), Qdrant 1.10's sparse vector support is the cleanest implementation we have found. BM25 sparse vectors combined with dense semantic search using Qdrant's built-in reciprocal rank fusion consistently outperforms pure semantic search on factual enterprise queries.
Frequently Asked Questions
Which vector database is fastest for filtered search in 2025?
Should I use pgvector or a dedicated vector database?
How much does a vector database cost at 100M vectors?
What embedding dimension should I use for enterprise RAG?
How do I benchmark my own vector database workload?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Related Articles
RAG Pipeline Architecture for the Enterprise: Five Layers Beyond the Basic Chatbot
Semantic Search for Enterprise Knowledge Bases: Engineering Beyond Full-Text
Knowledge Graph RAG: Hybrid Architecture for Complex Enterprise Reasoning
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project