Hybrid Search

Why choose between keyword precision and conceptual meaning when you can have both? Hybrid Search is the most powerful mode in Blindata, intelligently combining the strengths of Full-Text Search (FTS) and Semantic Search into a single, unified query.

It is designed to deliver the most comprehensive and relevant results, ensuring you find exactly what you’re looking for, even when your query is complex or ambiguous. For most users, Hybrid Search is the recommended default for the best discovery experience.


The Limits of a Single Search Approach

Relying on only one search strategy means accepting compromises:

  • Full-Text Search (FTS) is precise but lacks context. A search for "customer lifetime value" will miss an asset named CLV_analysis_prod if the exact keywords aren’t in the description. It understands words, not intent.
  • Semantic Search understands context but can lack precision. A search for "customer lifetime value" might find conceptually similar assets about “customer retention” or “user engagement,” which could be too broad if you need the specific CLV metric.

Hybrid Search eliminates this trade-off. It finds assets that match both the literal terms and the underlying meaning of your query.


How Does Hybrid Search Work? An Analogy

Imagine you’re in a massive library and need to find information on “customer churn analysis.” You have two experts at your disposal:

  1. The Meticulous Librarian (Full-Text Search): This expert is a master of keywords. You ask for “customer churn analysis,” and they will bring you every document with those exact words in the title or text. They are incredibly precise and literal. However, they might miss a brilliant report titled “Understanding Client Attrition” because it doesn’t use your exact phrasing.

  2. The Subject Matter Expert (Semantic Search): This expert understands the concepts and intent behind your query. They know that “customer churn” is related to “client attrition,” “user retention,” and “subscriber loyalty.” They will bring you a broader set of conceptually relevant documents, but some might be too general if you need a specific metric.

Relying on just one expert means you might miss something important. The librarian is too literal, and the expert might be too broad. Hybrid Search acts as a moderator, hiring both experts and asking them to work together. It takes the ranked list of results from the librarian and the separate ranked list from the subject matter expert and then intelligently combines them into a single, unified list that gives you the best of both worlds: the precision of keywords and the context of meaning.

The process involves two main steps: executing two searches in parallel and then fusing their results.

When you submit a query, the system sends it to two independent search systems simultaneously:

  • Lexical Search System (Full-Text Search): This system uses a traditional inverted index and a scoring algorithm called BM25 to find documents based on keyword overlap. The BM25 score reflects how well the document’s text matches the query terms, taking into account factors like term frequency.

  • Vector Search System (Semantic Search): This system first converts your query into a numerical representation called a vector embedding. It then uses a vector index and a k-Nearest Neighbor (k-NN) algorithm to find the most similar document vectors based on cosine similarity or another distance metric.

Each system returns its own ranked list of results with its own scoring system.

Fusing Results with Reciprocal Rank Fusion (RRF)

A major challenge is that the scores from these two systems are not compatible. A BM25 score might be 35.4, while a cosine similarity score is 0.91. They live on different scales and can’t be meaningfully added or directly compared.

To solve this, we use an elegant and effective algorithm called Reciprocal Rank Fusion (RRF). Instead of trying to normalize the incompatible scores, RRF focuses on the rank of each result.

The core idea is simple: an asset is considered more relevant if both search engines rank it highly.

The formula to calculate the final RRF score for any given document d is:

RRF Score(d) = Σ (1 / (k + rank_i(d)))

Let’s break that down:

  • rank_i(d): This is the rank (e.g., 1, 2, 3…) of document d in the result list from search system i (either FTS or Semantic).
  • k: This is a constant used for smoothing (Blindata uses 60). It ensures that results ranked very low don’t have an outsized impact.
  • Σ (Sigma): This means we sum the scores for each search system. For a document d, we calculate its score from the FTS list and add it to its score from the Semantic list. If a document doesn’t appear in one of the lists, its rank is considered infinite, and its score contribution from that list is effectively zero.

This “voting system” promotes results that both search engines agree are important, giving you a final list that is more reliable and relevant than what either search method could produce on its own. The table below highlights the main ideas that explain why this method works so well and how its scoring function captures the intuition behind ranking relevance.

Concept Why It Matters
Reciprocal Ranking Gives much more weight to top-ranked items. A #1 spot is far more valuable than a #10 spot.
Diminishing Returns The further down the list an item is, the less its rank matters, preventing irrelevant results from having much influence.
Evidence Combination By adding scores, we combine evidence from both search types, creating a more reliable and balanced final rank.
The k Factor Prevents any single top result from completely dominating the list, ensuring a more balanced outcome.

A Practical Example

Let’s say you search for “analysis of customer churn”.

Rank Full-Text Search Results (Finds keywords) Semantic Search Results (Finds concepts)
1 customer_churn_analysis_report user_retention_metrics_dashboard
2 analysis_of_new_customers customer_churn_analysis_report
3 churn_predictions_dataset client_attrition_rate_kpi

Here’s how Hybrid Search with RRF would interpret this:

  • customer_churn_analysis_report is #1 on the FTS list and #2 on the Semantic list. Since it ranked highly on both, RRF promotes it to the top.
  • user_retention_metrics_dashboard is #1 on the Semantic list but doesn’t contain all the keywords, so it didn’t rank high on the FTS list. It’s still highly relevant and will appear near the top.
  • analysis_of_new_customers ranked on the FTS list because it contains “analysis” and “customers,” but the semantic engine knows it’s not about churn, so it gets a lower final ranking.

The final Hybrid Search Results would look something like this, giving you the best of both worlds:

  1. customer_churn_analysis_report (High on both)
  2. user_retention_metrics_dashboard (High conceptual match)
  3. churn_predictions_dataset (Good keyword match)
  4. client_attrition_rate_kpi (Strong conceptual match)

Strengths and Trade-offs

Hybrid Search is the go-to strategy for almost any search, especially when you need comprehensive results.

  • For General Searches: It eliminates the need to guess which search mode is best.
  • For Complex Queries: It effectively balances keywords with intent (e.g., “Find me PII data related to German customers”).
  • For Exploratory Discovery: It provides the most interesting and relevant results by combining both expected and unexpected findings.
Strengths Trade-offs & Considerations
Most Comprehensive Results: Reduces the chance of missing a relevant asset. Higher Computational Overhead: Runs two search queries instead of one.
Intelligent Ranking: Promotes results that are both literally and conceptually relevant. Dependency: Relies on both FTS and Semantic indexes being well-maintained.
Best User Experience: Adapts to any query style, from simple keywords to complex questions.

Blindata’s Hybrid Search ensures that data discovery is never limited by how a query is phrased. By fusing text-based precision with meaning-based intelligence, it delivers a balanced and resilient approach that adapts to the complexity of real-world questions. It is the smartest way to ensure the results you get are the results that truly matter.