RAG & Retrieval

Semantic Search for Enterprise Knowledge Bases: Engineering Beyond Full-Text

Full-text search returns documents that contain your query terms. Semantic search returns documents that address your intent. The engineering gap between these two — embeddings, hybrid retrieval, re-ranking — is where enterprise knowledge management is being rebuilt.

Inductivee Team· AI EngineeringDecember 4, 2025(updated April 15, 2026)12 min read

TL;DR

Semantic search replaces keyword matching with intent matching using dense vector embeddings. Enterprise deployments almost always require hybrid retrieval — combining BM25 for lexical precision with vector search for semantic recall — fused via Reciprocal Rank Fusion (RRF). A cross-encoder re-ranker applied on the top-K candidates then closes the gap to production-grade precision.

Why Full-Text Search Fails Enterprise Knowledge Bases

BM25 is a remarkably durable algorithm. Built on term frequency and inverse document frequency, it powers Elasticsearch, OpenSearch, and Solr deployments that handle billions of queries daily. For exact-match retrieval — SKUs, error codes, named entity lookups — it remains hard to beat. But enterprise knowledge bases are not composed of exact-match queries.

When a product manager searches for "what is our policy on customer data retention for EMEA users," BM25 is looking for documents containing the tokens "customer," "data," "retention," and "EMEA." It will miss the compliance team's document titled "GDPR obligations for EU data subjects" even if that document directly answers the question. The semantic gap between query intent and document vocabulary is the core problem that full-text search cannot solve.

Enterprise knowledge compounds this problem. Policies, runbooks, design documents, and Confluence pages are written by dozens of authors with inconsistent terminology. The same concept appears under different names across departments. Sales calls it "churn," finance calls it "customer attrition," engineering calls it "account deletion." BM25 cannot bridge these synonymy gaps. Embedding models trained on large corpora can, because they encode meaning rather than tokens.

Embedding Model Comparison for Enterprise Deployment

Model	Dimensions	Max Tokens	Multilingual	Cost (per 1M tokens)	Best For
OpenAI text-embedding-3-large	3072 (reducible)	8191	No (EN-focused)	$0.13	High-accuracy EN retrieval, OpenAI stack
OpenAI text-embedding-3-small	1536 (reducible)	8191	No (EN-focused)	$0.02	Cost-sensitive EN workloads
Cohere embed-v3-english	1024	512	No	$0.10	Enterprise search, rerank pairing
Cohere embed-v3-multilingual	1024	512	Yes (100+ langs)	$0.10	Global enterprise, mixed-language corpora
BGE-M3 (BAAI)	1024	8192	Yes (100+ langs)	Self-hosted	Long-doc, multilingual, open-source deployments
E5-mistral-7b-instruct	4096	32768	Partial	Self-hosted	Long-context, instruction-following retrieval
GTE-large (Alibaba)	1024	512	Partial	Self-hosted	Strong MTEB scores, efficient self-hosted option

The Hybrid Retrieval Stack

No single retrieval method dominates across all query types. The enterprise production pattern is a three-stage pipeline: broad recall with hybrid retrieval, then precision filtering with re-ranking.

Stage 1: BM25 Sparse Retrieval

Run a BM25 query against your document corpus to retrieve the top-50 candidates by lexical relevance. This stage catches exact terminology, product codes, and proper nouns that embedding models can blur together. In Elasticsearch or OpenSearch, BM25 is the default text scorer. For standalone BM25 without a full search engine, rank-bm25 provides a clean Python implementation over preprocessed corpora.

Stage 2: Dense Vector Retrieval

Run a k-NN approximate nearest-neighbor (ANN) query against your vector store to retrieve the top-50 candidates by semantic similarity. Qdrant, Weaviate, Pinecone, and pgvector are the common enterprise choices. ANN indices (HNSW is standard) make this sub-100ms at scale. Embed the query using the same model used to embed your corpus at index time — this is a common production mistake when models are updated.

Stage 3: Reciprocal Rank Fusion (RRF)

Merge the two candidate lists using RRF: for each document, compute score = sum(1 / (k + rank_i)) across the retrieval lists it appears in. k is typically 60 (empirically validated). RRF does not require score normalization, which is its primary advantage over linear interpolation methods. The merged top-K list (typically 20-30 documents) is passed to the re-ranker.

Stage 4: Cross-Encoder Re-Ranking

A cross-encoder model takes (query, document) pairs as input and produces a relevance score by attending to both jointly — unlike bi-encoders that encode query and document independently. This is computationally expensive (runs inference for each candidate pair), which is why it is applied only to the top-K from fusion rather than the full corpus. Cohere Rerank 3, cross-encoder/ms-marco-MiniLM-L-6-v2 (open-source), and BGE-reranker-large are the standard choices. Expect 15-40% precision improvement over hybrid retrieval alone.

Query Expansion with HyDE

Hypothetical Document Embeddings (HyDE) address a fundamental asymmetry: queries are short, documents are long, and their embedding spaces don't overlap perfectly. HyDE uses an LLM to generate a hypothetical document that would answer the query, then embeds that hypothetical document instead of the raw query. The hypothetical document occupies a denser region of the document embedding space. Empirically, HyDE improves recall by 8-18% on knowledge-heavy corpora. The cost is one extra LLM call per query — use a fast, cheap model (GPT-4o-mini, Claude Haiku) for the generation step.

Chunking Strategy for Enterprise Documents

The quality of your retrieval pipeline is bounded by the quality of your chunks. Poor chunking — especially fixed 512-token splits that cut through logical units — is the most common source of retrieval failure in production RAG systems.

Fixed-Size Chunking

Split documents every N tokens with M-token overlap. Simple to implement, predictable index size, but semantically incoherent. A 512-token chunk that starts mid-sentence and ends mid-paragraph contains partial context that misleads both the embedding model and the downstream LLM. Use only for homogeneous, structured documents (data tables, transcripts) where semantic boundaries are less critical.

Semantic Chunking

Embed each sentence, compute cosine similarity between adjacent sentences, and split on similarity drops that exceed a threshold. This preserves topical coherence: a section on GDPR obligations stays together rather than being split across chunk boundaries. LangChain's SemanticChunker and LlamaIndex's SemanticSplitterNodeParser implement this. Expect 20-35% larger index compared to fixed chunking — plan storage accordingly.

Hierarchical / Parent-Child Chunking

Index small child chunks (256 tokens) for precise retrieval, but return the parent chunk (1024 tokens) as context to the LLM. This combines retrieval precision with context richness. The child chunk's embedding is tight around a specific claim; the parent chunk gives the LLM the surrounding reasoning. LlamaIndex's ParentChildNodeParser implements this natively. This is the default pattern for enterprise knowledge bases with long-form documents like policies and technical specifications.

Hybrid Retrieval Pipeline: BM25 + Qdrant + Cross-Encoder Re-Ranking

python

import os
from dataclasses import dataclass
from typing import Optional

from rank_bm25 import BM25Okapi
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from openai import OpenAI
from sentence_transformers import CrossEncoder
import numpy as np


@dataclass
class RetrievedChunk:
    doc_id: str
    content: str
    bm25_rank: Optional[int]
    vector_rank: Optional[int]
    rrf_score: float
    rerank_score: Optional[float]


class HybridRetrievalPipeline:
    """BM25 + Qdrant vector search fused via RRF, re-ranked by cross-encoder."""

    RRF_K = 60
    COLLECTION_NAME = "enterprise_kb"
    EMBEDDING_MODEL = "text-embedding-3-large"
    EMBEDDING_DIMS = 3072

    def __init__(self, qdrant_url: str = "http://localhost:6333"):
        self.openai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
        self.qdrant = QdrantClient(url=qdrant_url)
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
        self._corpus: list[dict] = []  # [{doc_id, content, tokens}]
        self._bm25: Optional[BM25Okapi] = None

    def build_index(self, documents: list[dict]) -> None:
        """Index documents into both BM25 and Qdrant. documents: [{id, content}]"""
        self._corpus = [
            {"doc_id": d["id"], "content": d["content"], "tokens": d["content"].lower().split()}
            for d in documents
        ]
        self._bm25 = BM25Okapi([c["tokens"] for c in self._corpus])

        # Ensure Qdrant collection exists
        existing = [c.name for c in self.qdrant.get_collections().collections]
        if self.COLLECTION_NAME not in existing:
            self.qdrant.create_collection(
                self.COLLECTION_NAME,
                vectors_config=VectorParams(size=self.EMBEDDING_DIMS, distance=Distance.COSINE),
            )

        # Batch-embed and upsert
        contents = [d["content"] for d in documents]
        response = self.openai.embeddings.create(model=self.EMBEDDING_MODEL, input=contents)
        points = [
            PointStruct(id=i, vector=r.embedding, payload={"doc_id": documents[i]["id"], "content": documents[i]["content"]})
            for i, r in enumerate(response.data)
        ]
        self.qdrant.upsert(collection_name=self.COLLECTION_NAME, points=points)
        print(f"Indexed {len(documents)} documents.")

    def _embed_query(self, query: str) -> list[float]:
        response = self.openai.embeddings.create(model=self.EMBEDDING_MODEL, input=[query])
        return response.data[0].embedding

    def _rrf_fuse(
        self,
        bm25_hits: list[dict],   # [{doc_id, content, rank}]
        vector_hits: list[dict], # [{doc_id, content, rank}]
    ) -> list[RetrievedChunk]:
        scores: dict[str, dict] = {}
        for hit in bm25_hits:
            doc_id = hit["doc_id"]
            scores.setdefault(doc_id, {"content": hit["content"], "bm25_rank": None, "vector_rank": None})
            scores[doc_id]["bm25_rank"] = hit["rank"]
        for hit in vector_hits:
            doc_id = hit["doc_id"]
            scores.setdefault(doc_id, {"content": hit["content"], "bm25_rank": None, "vector_rank": None})
            scores[doc_id]["vector_rank"] = hit["rank"]

        results = []
        for doc_id, meta in scores.items():
            rrf = 0.0
            if meta["bm25_rank"] is not None:
                rrf += 1.0 / (self.RRF_K + meta["bm25_rank"])
            if meta["vector_rank"] is not None:
                rrf += 1.0 / (self.RRF_K + meta["vector_rank"])
            results.append(RetrievedChunk(
                doc_id=doc_id,
                content=meta["content"],
                bm25_rank=meta["bm25_rank"],
                vector_rank=meta["vector_rank"],
                rrf_score=rrf,
            ))
        return sorted(results, key=lambda x: x.rrf_score, reverse=True)

    def retrieve(
        self,
        query: str,
        top_k_retrieval: int = 50,
        top_k_rerank: int = 10,
        use_reranker: bool = True,
    ) -> list[RetrievedChunk]:
        if self._bm25 is None:
            raise RuntimeError("Index not built. Call build_index() first.")

        # BM25 retrieval
        bm25_scores = self._bm25.get_scores(query.lower().split())
        top_bm25_indices = np.argsort(bm25_scores)[::-1][:top_k_retrieval]
        bm25_hits = [
            {"doc_id": self._corpus[i]["doc_id"], "content": self._corpus[i]["content"], "rank": rank + 1}
            for rank, i in enumerate(top_bm25_indices)
        ]

        # Vector retrieval
        query_vector = self._embed_query(query)
        qdrant_results = self.qdrant.search(
            collection_name=self.COLLECTION_NAME,
            query_vector=query_vector,
            limit=top_k_retrieval,
        )
        vector_hits = [
            {"doc_id": r.payload["doc_id"], "content": r.payload["content"], "rank": rank + 1}
            for rank, r in enumerate(qdrant_results)
        ]

        # RRF fusion
        fused = self._rrf_fuse(bm25_hits, vector_hits)
        candidates = fused[:top_k_rerank * 2]  # Re-rank over 2x the desired output

        if not use_reranker:
            return candidates[:top_k_rerank]

        # Cross-encoder re-ranking
        pairs = [[query, c.content] for c in candidates]
        rerank_scores = self.reranker.predict(pairs)
        for chunk, score in zip(candidates, rerank_scores):
            chunk.rerank_score = float(score)

        return sorted(candidates, key=lambda x: x.rerank_score or 0.0, reverse=True)[:top_k_rerank]

Full hybrid pipeline: BM25 via rank-bm25, dense retrieval via Qdrant, fusion via RRF (k=60), precision re-ranking via a MiniLM cross-encoder. Swap text-embedding-3-large for BGE-M3 for a self-hosted alternative.

Warning

Embedding model updates break your index silently. If you update from text-embedding-3-small to text-embedding-3-large without re-indexing your entire corpus, your query vectors and document vectors will be in different embedding spaces and cosine similarity scores will be meaningless. Always re-embed and re-index the full corpus on any model change, and version your embedding model alongside your vector index. Pin model versions in your infrastructure config — never use 'latest' aliases for embedding endpoints.

Production Checklist: Enterprise Semantic Search

Use hierarchical (parent-child) chunking for long-form enterprise documents — small chunks for retrieval, large chunks returned to the LLM as context.
Pin your embedding model version in infrastructure config. A model change without full re-indexing will silently degrade retrieval quality.
Start with RRF k=60 for fusion weighting. Tune BM25 vs vector weight ratios based on your query distribution — lexical-heavy corpora (technical docs, code) weight BM25 higher.
Add cross-encoder re-ranking only to your top-50 candidates — running it over the full corpus is cost-prohibitive. The latency budget is typically 200-400ms for the re-ranking step alone.
Implement HyDE for knowledge-heavy corpora where query vocabulary diverges from document vocabulary. Measure recall@10 with and without HyDE on your golden dataset before committing.
Track retrieval metrics in production: mean reciprocal rank (MRR), recall@K, and latency per stage. Retrieval quality degrades as corpora grow and documents become stale.

How Inductivee Deploys Semantic Search

Across the enterprise knowledge base deployments we have built, the hybrid retrieval pattern — BM25 + vector search fused via RRF, followed by cross-encoder re-ranking — is the production baseline we start with on every project. Full-text-only deployments consistently underperform by 20-35% on recall@5 benchmarks against realistic enterprise query distributions.

The engineering work is not primarily in the retrieval layer itself — it is in chunking strategy and embedding model selection for the specific corpus. A compliance document corpus with dense legal language behaves differently from a technical runbook corpus. We spend significant time building query-specific golden datasets and measuring recall before moving to production.

For enterprises with multilingual knowledge bases, BGE-M3 as the embedding backbone plus a multilingual re-ranker eliminates the need to maintain separate language-specific pipelines. The unified vector space handles cross-lingual retrieval — a query in French can retrieve a relevant document in German — which is a capability gap that full-text search cannot address at any engineering cost.

Frequently Asked Questions

What is the difference between semantic search and full-text search?

Full-text search (BM25) matches documents based on whether they contain the query's exact tokens, weighted by frequency and rarity. Semantic search uses dense vector embeddings to match documents by meaning — a query about "employee turnover" can retrieve documents about "staff attrition" even with no shared tokens. Enterprise deployments typically use both in a hybrid pipeline.

What is Reciprocal Rank Fusion (RRF) and why is it used in hybrid search?

RRF is a rank fusion algorithm that combines multiple ranked lists by computing 1/(k + rank) for each document across all lists and summing the scores. It is preferred over linear score interpolation because it does not require normalizing scores across different retrieval systems, which have incompatible score scales. The constant k=60 is empirically well-validated across retrieval benchmarks.

When should I use a cross-encoder re-ranker in a RAG pipeline?

Use a cross-encoder re-ranker when retrieval precision is critical and you have latency budget for an additional 200-400ms inference step. Cross-encoders attend to both the query and document jointly, producing more accurate relevance scores than bi-encoder similarity. Apply re-ranking only to the top-K candidates (20-50) from your initial retrieval stage — running it over the full corpus is computationally infeasible.

What is HyDE (Hypothetical Document Embeddings) and when does it help?

HyDE generates a hypothetical document that would answer the query using an LLM, then embeds that hypothetical document instead of the raw query. This works because the hypothetical document occupies a denser region of the document embedding space, improving recall by 8-18% on knowledge-heavy corpora. It adds one LLM call per query — use a fast, low-cost model like GPT-4o-mini for the generation step.

What is the best chunking strategy for enterprise knowledge base documents?

Hierarchical parent-child chunking is the production standard for enterprise knowledge bases with long-form documents. Small child chunks (256 tokens) are embedded and retrieved with high precision, while the parent chunk (1024 tokens) is returned to the LLM as context, providing surrounding reasoning. Semantic chunking (split on embedding similarity drops) is preferred over fixed-size chunking for preserving topical coherence within chunks.

Written By

Inductivee Team

Author

Agentic AI Engineering Team

The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.

Agentic AI ArchitectureMulti-Agent OrchestrationLangChainLangGraphCrewAIMicrosoft AutoGen

LinkedIn profile

Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.

Engineer This With Inductivee

The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.

Service

Ready to Build This Into Your Enterprise?

Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.

Start a Project

We value your privacy