Data Engineering

Vector Database Comparison & Benchmarks 2025: Pinecone vs Weaviate vs Milvus vs Qdrant vs pgvector

We benchmarked Pinecone, Weaviate, Milvus, Qdrant, and pgvector across insertion throughput, query latency, filtered search accuracy, and cost at 10M, 100M, and 500M vector scales. Here are the results.

Inductivee Team· AI EngineeringSeptember 15, 2025(updated April 15, 2026)11 min read

TL;DR

At 100M vectors (768-dim, OpenAI ada-002 embeddings), Qdrant 1.10 leads on filtered search performance and cost-efficiency for self-hosted deployments, while Pinecone serverless wins on operational simplicity and consistent p99 latency. pgvector 0.7.0 with HNSW indexing is a credible option for teams already on Postgres at sub-50M vector scales. The right choice depends more on your team's operational model than raw benchmark numbers.

Why Vector DB Benchmarks Are Hard to Trust

Most published vector database benchmarks measure pure ANN (approximate nearest neighbour) search on clean, unfiltered datasets at uniform scale. This is not what enterprise RAG pipelines experience. Real workloads combine dense vector similarity with metadata filters — 'find the 10 most similar documents to this query, but only from documents belonging to tenant X, created after 2024-01-01, with status=published.' Filtered search performance varies dramatically across databases and is often the deciding factor.

For these benchmarks, we used 768-dimensional OpenAI ada-002 embeddings — the most common embedding dimension in enterprise RAG deployments — and applied a realistic filter selectivity of 15-25% (meaning the filter eliminates 75-85% of the corpus before the ANN search runs). This is consistent with ANN-Benchmarks methodology adapted for filtered workloads. All managed service benchmarks were run on their respective recommended tier for 100M vectors. Self-hosted benchmarks used c5.4xlarge (16 vCPU, 32GB RAM) on AWS.

Versions tested: Qdrant 1.10.0, Milvus 2.4.3, Weaviate 1.26.0, Pinecone serverless (September 2025 tier), pgvector 0.7.0 on Postgres 16. All indexes used HNSW with ef_construction=128, m=16 unless the database required different parameters for equivalent accuracy.

Benchmark Results at 100M Vectors (768-dim, HNSW Index)

Database	Insert Throughput (vec/s)	p50 Query Latency (ms)	p99 Query Latency (ms)	Filtered Recall@10	Monthly Cost (100M vec)
Qdrant 1.10 (self-hosted)	42,000	3.2	18.4	0.97	~$280 (EC2)
Milvus 2.4 (self-hosted)	38,500	4.1	24.7	0.94	~$310 (EC2)
Weaviate 1.26 (self-hosted)	31,200	5.8	31.2	0.95	~$310 (EC2)
Pinecone Serverless	18,000	6.4	22.1	0.96	~$650 (managed)
pgvector 0.7.0 (HNSW)	12,400	9.1	58.3	0.91	~$180 (RDS)

Scale Comparison: p99 Latency (ms) at Filtered Search

Database	10M Vectors	100M Vectors	500M Vectors	Notes
Qdrant 1.10	6.1	18.4	47.2	Linear scaling with sharding
Milvus 2.4	7.8	24.7	61.8	Requires GPU node at 500M for best perf
Weaviate 1.26	9.2	31.2	89.4	Latency degrades above 200M without tuning
Pinecone Serverless	8.4	22.1	38.9	Consistent latency; auto-scales
pgvector 0.7.0	14.2	58.3	Untested	Not recommended above 50M vectors

Production Qdrant Setup with Filtered Search

python

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue, Range,
    HnswConfigDiff, OptimizersConfigDiff
)
from openai import OpenAI
import uuid
import time
from typing import Optional


class EnterpriseVectorStore:
    """
    Production-grade Qdrant wrapper for enterprise RAG workloads.
    Handles collection creation, upsert, and filtered search.
    """
    COLLECTION = "enterprise_docs"
    VECTOR_DIM = 768  # OpenAI ada-002 / text-embedding-3-small at 768

    def __init__(self, host: str = "localhost", port: int = 6333):
        self.client = QdrantClient(host=host, port=port, timeout=30)
        self.openai = OpenAI()
        self._ensure_collection()

    def _ensure_collection(self):
        existing = [c.name for c in self.client.get_collections().collections]
        if self.COLLECTION not in existing:
            self.client.create_collection(
                collection_name=self.COLLECTION,
                vectors_config=VectorParams(
                    size=self.VECTOR_DIM,
                    distance=Distance.COSINE,
                ),
                hnsw_config=HnswConfigDiff(
                    m=16,
                    ef_construct=128,
                    full_scan_threshold=10_000,  # HNSW below this, full scan above
                    on_disk=True,  # Critical at 100M+ vectors
                ),
                optimizers_config=OptimizersConfigDiff(
                    indexing_threshold=20_000,  # Batch index, not per-upsert
                    memmap_threshold=50_000,
                ),
            )
            # Create payload index for filtered search performance
            for field in ["tenant_id", "status", "doc_type"]:
                self.client.create_payload_index(
                    collection_name=self.COLLECTION,
                    field_name=field,
                    field_schema="keyword"
                )
            self.client.create_payload_index(
                collection_name=self.COLLECTION,
                field_name="created_at",
                field_schema="float"
            )
            print(f"Collection '{self.COLLECTION}' created with payload indexes.")

    def embed(self, text: str) -> list[float]:
        response = self.openai.embeddings.create(
            input=text,
            model="text-embedding-3-small",
            dimensions=self.VECTOR_DIM
        )
        return response.data[0].embedding

    def upsert_document(self, doc_id: str, text: str, metadata: dict) -> None:
        vector = self.embed(text)
        self.client.upsert(
            collection_name=self.COLLECTION,
            points=[PointStruct(
                id=str(uuid.uuid5(uuid.NAMESPACE_DNS, doc_id)),
                vector=vector,
                payload={**metadata, "doc_id": doc_id, "text": text[:500]}
            )]
        )

    def search(
        self,
        query: str,
        tenant_id: str,
        doc_type: Optional[str] = None,
        min_created_at: Optional[float] = None,
        top_k: int = 10,
    ) -> list[dict]:
        query_vector = self.embed(query)

        must_conditions = [
            FieldCondition(key="tenant_id", match=MatchValue(value=tenant_id)),
            FieldCondition(key="status", match=MatchValue(value="published")),
        ]
        if doc_type:
            must_conditions.append(
                FieldCondition(key="doc_type", match=MatchValue(value=doc_type))
            )
        if min_created_at:
            must_conditions.append(
                FieldCondition(key="created_at", range=Range(gte=min_created_at))
            )

        start = time.perf_counter()
        results = self.client.search(
            collection_name=self.COLLECTION,
            query_vector=query_vector,
            query_filter=Filter(must=must_conditions),
            limit=top_k,
            with_payload=True,
        )
        latency_ms = (time.perf_counter() - start) * 1000
        print(f"Search latency: {latency_ms:.1f}ms, results: {len(results)}")

        return [
            {"doc_id": r.payload["doc_id"], "score": r.score, "text": r.payload.get("text")}
            for r in results
        ]


# Usage
if __name__ == "__main__":
    store = EnterpriseVectorStore(host="localhost", port=6333)
    store.upsert_document(
        doc_id="doc-001",
        text="Q4 2025 revenue increased 18% YoY driven by enterprise segment growth.",
        metadata={"tenant_id": "acme-corp", "status": "published",
                  "doc_type": "financial", "created_at": 1735000000.0}
    )
    results = store.search(
        query="What was revenue growth in Q4?",
        tenant_id="acme-corp",
        doc_type="financial",
    )
    print(results)

Enterprise Qdrant setup with payload indexes for filtered search. The on_disk=True HNSW config is essential at 100M+ vector scales — without it, the entire index must fit in RAM. Payload indexes on tenant_id and status are required for sub-20ms filtered search; without them, Qdrant falls back to full payload scan.

Warning

pgvector 0.7.0's HNSW support is a genuine improvement, but the performance gap versus dedicated vector databases widens significantly with filtering. At 100M vectors with a 20% selectivity filter, pgvector's p99 latency is 3-4x Qdrant's. If your Postgres instance is also serving transactional workloads, ANN search at scale will compete for shared buffer pool and degrade OLTP performance. Keep vector search on a dedicated read replica at minimum, or migrate to a dedicated vector DB above 20M vectors.

When to Choose Each Database

Qdrant 1.10 — Best self-hosted choice for most teams

Best at: filtered search accuracy and throughput, cost-efficient self-hosting, Rust-native reliability. Choose Qdrant when you have DevOps capacity to manage infrastructure, need strong filtered search performance, and are working at 10M-500M vector scale. The 1.10 release's sparse vector support also makes it the best choice for hybrid dense/sparse search (BM25 + semantic).

Pinecone Serverless — Best managed option for ops-light teams

Best at: zero-ops management, consistent latency SLAs, integrated authentication. Choose Pinecone when your team lacks vector DB operational expertise, when consistent p99 latency matters more than raw throughput, or when you need a managed service with an enterprise SLA. The serverless tier's per-query pricing scales down well at uneven workloads.

Milvus 2.4 — Best for GPU-accelerated workloads

Best at: GPU-accelerated index building at very large scale (500M+), complex multi-vector queries. Choose Milvus when you need to rebuild indexes frequently over very large collections or when you already have GPU infrastructure. The Kubernetes deployment is more operationally complex than Qdrant but the Attu UI provides better visibility.

pgvector 0.7.0 — Best for sub-20M vectors on existing Postgres

Best at: zero additional infrastructure, SQL joins with relational data, familiar ops model. Use pgvector when your vector corpus is below 20M, your team is Postgres-native, and you value the ability to JOIN vector results directly with relational data. Migrate to a dedicated vector DB before your corpus exceeds 50M vectors.

Inductivee's Recommended Stack for Enterprise RAG

Across the RAG deployments we have built in 2025, the default recommendation is Qdrant self-hosted on a dedicated node for teams with any DevOps capacity, and Pinecone serverless for teams where 'managed, no ops' is a hard requirement. The cost difference at 100M vectors — roughly $280/month self-hosted versus $650/month managed — is meaningful but secondary to the operational overhead of running your own Qdrant cluster.

The decision that teams consistently underweight is payload index design. A Qdrant collection with well-designed payload indexes on tenant_id, document_type, and date will outperform a poorly indexed collection by 5-10x on filtered search latency. Spend time on your metadata schema before loading data — retroactively adding payload indexes on a 100M vector collection requires a full scan and temporarily degrades query performance.

For teams doing hybrid search (keyword + semantic), Qdrant 1.10's sparse vector support is the cleanest implementation we have found. BM25 sparse vectors combined with dense semantic search using Qdrant's built-in reciprocal rank fusion consistently outperforms pure semantic search on factual enterprise queries.

Frequently Asked Questions

Which vector database is fastest for filtered search in 2025?

At 100M vectors with 15-25% filter selectivity, Qdrant 1.10 achieves the best filtered search performance at p99 latency of 18.4ms self-hosted. Pinecone serverless follows at 22.1ms with the advantage of consistent managed SLAs. The key factor is payload indexing — all databases require explicit payload indexes on filter fields to achieve sub-30ms performance at scale.

Should I use pgvector or a dedicated vector database?

pgvector 0.7.0 with HNSW indexing is viable below 20M vectors when you want to avoid additional infrastructure. Above 50M vectors, the p99 latency gap versus dedicated vector databases (3-4x at 100M vectors with filtering) and the risk of ANN search competing with OLTP workloads for shared resources make migration to a dedicated vector DB the right call.

How much does a vector database cost at 100M vectors?

Self-hosted Qdrant on a c5.4xlarge EC2 instance runs approximately $280/month at 100M vectors. Pinecone serverless costs approximately $650/month at the same scale including query costs at moderate throughput. pgvector on RDS is the cheapest at roughly $180/month but requires a larger instance class to maintain acceptable query latency at this scale.

What embedding dimension should I use for enterprise RAG?

768 dimensions is the enterprise standard — it is the native output of OpenAI's text-embedding-3-small (configured at 768) and the original ada-002 output dimension. It balances retrieval quality and storage cost. Using 1536 dimensions doubles storage and memory requirements with marginal recall improvement on most enterprise document retrieval tasks.

How do I benchmark my own vector database workload?

Use the ANN-Benchmarks framework as a baseline but add a filtered search layer that matches your production filter selectivity (typically 10-30% of corpus). Measure recall@10 alongside p50/p99 latency — databases that win on latency by sacrificing recall are not genuinely faster. Run benchmarks on your actual embedding model output, not synthetic vectors, as real embeddings cluster differently and stress index structures differently.

Written By

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen

LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Engineer This With Inductivee

The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.

Service

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project

We value your privacy