Hybrid Search

Why choose between keyword precision and conceptual meaning when you can have both? Hybrid Search is the most powerful mode in Blindata, intelligently combining the strengths of Full-Text Search (FTS) and Semantic Search into a single, unified query.

It is designed to deliver the most comprehensive and relevant results, ensuring you find exactly what you’re looking for, even when your query is complex or ambiguous. For most users, Hybrid Search is the recommended default for the best discovery experience.


Why Not Just One? The Limits of a Single Approach

Relying on only one search strategy means accepting compromises:

  • Full-Text Search (FTS) is precise but lacks context. A search for "customer lifetime value" will miss an asset named CLV_analysis_prod if the exact keywords aren’t in the description. It understands words, not intent.
  • Semantic Search understands context but can lack precision. A search for "customer lifetime value" might find conceptually similar assets about “customer retention” or “user engagement,” which could be too broad if you need the specific CLV metric.

Hybrid Search eliminates this trade-off. It finds assets that match both the literal terms and the underlying meaning of your query.


How It Works: A Two-Pronged Attack

When you run a Hybrid Search, Blindata executes two searches in parallel:

  1. The FTS Query: It performs a keyword-based search, looking for your exact terms, handling typos, and finding partial matches. This is the “precision” engine.
  2. The Semantic Query: Simultaneously, it converts your query into a vector and searches for conceptually similar assets. This is the “meaning” engine.

The real magic happens in the final step: merging the two result lists into one.

The Dual-Query Architecture

Technically, when a hybrid search query is received, it’s sent to two independent retrieval systems simultaneously:

  1. Lexical Search System: This is our Full-Text Search (FTS). It uses an inverted index and a scoring algorithm like BM25 to find documents based on keyword overlap. The score reflects how well the document’s text matches the query terms.
  2. Vector Search System: This is our Semantic Search. It first converts the query into a vector embedding. Then, it uses a vector index and an algorithm like k-Nearest Neighbor (k-NN) to find the most similar document vectors based on cosine similarity or another distance metric.

Each system returns a ranked list of document IDs and a relevance score.

The Merging Magic: Reciprocal Rank Fusion (RRF)

A major challenge is that the scores from these two systems are not compatible. A BM25 score might be 35.4, while a cosine similarity score is 0.91. These numbers live on completely different scales and can’t be meaningfully added or directly compared. Simply normalizing them can be unreliable.

To create a single, unified ranking, Hybrid Search uses an algorithm called Reciprocal Rank Fusion (RRF). Instead of just adding scores together or favoring one search method over another, RRF acts like a fair voting system.

The core idea is simple: an asset is considered more relevant if both search engines rank it highly.

An asset that appears at the top of both the FTS results and the Semantic results will be promoted to the very top of the final Hybrid list. This method is highly effective at surfacing the most relevant results while pushing down items that are only a weak match in one of the search modes.

The formula to calculate the final RRF score for any given document d is:

RRF Score(d) = Σ (1 / (k + rank_i(d)))

Let’s break that down:

  • rank_i(d): This is the rank (e.g., 1, 2, 3…) of document d in the result list from search system i (either FTS or Semantic).
  • k: This is a constant used for smoothing. It ensures that results ranked very low don’t have an outsized negative impact and stabilizes the formula. A common value, and the one used by Blindata, is 60.
  • Σ (Sigma): This means we sum the scores for each search system. For a document d, we calculate its score from the FTS list and add it to its score from the Semantic list. If a document doesn’t appear in one of the lists, its rank is considered infinite, and its score contribution from that list is effectively zero.

Reciprocal Rank Fusion (RRF) is widely used because it combines rankings in a way that balances simplicity with effectiveness.
The table below highlights the main ideas that explain why this method works so well and how its scoring function captures the intuition behind ranking relevance.

Concept Explanation
Reciprocal Ranking The function 1 / (k + rank) gives much higher weight to items near the top of a ranking list, so documents ranked highly by any retriever strongly influence the combined result.
Diminishing Returns As the rank increases, the contribution to the score decreases quickly, reflecting that top positions carry much more importance than lower ones.
Evidence Combination Summing reciprocal scores across retrievers merges multiple signals, making the final ranking less sensitive to the quirks of any single method.
Role of k The parameter k smooths the scoring so that no single top result dominates, creating a more balanced distribution across ranks.

A Practical Example

Let’s say you search for “analysis of customer churn”.

Rank Full-Text Search Results (Finds keywords) Semantic Search Results (Finds concepts)
1 customer_churn_analysis_report user_retention_metrics_dashboard
2 analysis_of_new_customers customer_churn_analysis_report
3 churn_predictions_dataset client_attrition_rate_kpi

Here’s how Hybrid Search with RRF would interpret this:

  • customer_churn_analysis_report is #1 on the FTS list and #2 on the Semantic list. Since it ranked highly on both, RRF promotes it to the top.
  • user_retention_metrics_dashboard is #1 on the Semantic list but doesn’t contain all the keywords, so it didn’t rank high on the FTS list. It’s still highly relevant and will appear near the top.
  • analysis_of_new_customers ranked on the FTS list because it contains “analysis” and “customers,” but the semantic engine knows it’s not about churn, so it gets a lower final ranking.

The final Hybrid Search Results would look something like this, giving you the best of both worlds:

  1. customer_churn_analysis_report (High on both)
  2. user_retention_metrics_dashboard (High conceptual match)
  3. churn_predictions_dataset (Good keyword match)
  4. client_attrition_rate_kpi (Strong conceptual match)

Strengths and Trade-offs

Hybrid Search is the go-to strategy for almost any search, especially when you need comprehensive results.

  • For General Searches: It eliminates the need to guess which search mode is best.
  • For Complex Queries: It effectively balances keywords with intent (e.g., “Find me PII data related to German customers”).
  • For Exploratory Discovery: It provides the most interesting and relevant results by combining both expected and unexpected findings.
Strengths Trade-offs & Considerations
Most Comprehensive Results: Reduces the chance of missing a relevant asset. Higher Computational Overhead: Runs two search queries instead of one.
Intelligent Ranking: Promotes results that are both literally and conceptually relevant. Dependency: Relies on both FTS and Semantic indexes being well-maintained.
Best User Experience: Adapts to any query style, from simple keywords to complex questions.

Blindata’s Hybrid Search ensures that data discovery is never limited by how a query is phrased. By fusing text-based precision with meaning-based intelligence, it delivers a balanced and resilient approach that adapts to the complexity of real-world questions. It is the smartest way to ensure the results you get are the results that truly matter.