Semantic Search

While traditional search finds words, Semantic Search finds meaning. It moves beyond literal text matching to interpret your intent, connect business concepts to their technical counterparts, and uncover valuable relationships that keyword searches would miss.

This powerful capability allows you to ask questions in plain English, like “which datasets track user satisfaction?”, and get back relevant results like customer_feedback_scores or nps_ratings, even if your exact words don’t appear anywhere in the metadata.


The difference is fundamental. Keyword search is about matching strings, while semantic search is about matching concepts.

Aspect Keyword Search (FTS) Semantic Search
Matching Logic Matches exact words, stems, or fuzzy variations. Matches conceptual meaning and contextual similarity.
Query Style Best for technical terms, IDs, or specific phrases. Ideal for natural language and business questions.
Context Awareness Limited. Doesn’t know “revenue” and “sales” are related. High. Understands synonyms, context, and intent.
Discovery Power Finds what you tell it to find. Uncovers related assets you didn’t know to ask for.

A Simple Example: A search for "chocolate milk" versus "milk chocolate":

  • Keyword search would likely return many of the same results for both, as they contain the same words.
  • Semantic search understands these are entirely different concepts and will return distinct, highly relevant results for each query.

How It Works: The Technology Behind the Magic

Semantic search is powered by two core technologies: Embeddings and Vector Search.

Step 1: Creating Embeddings (The Indexing Phase)

Before you can search, Blindata must learn the “meaning” of your data. It does this by converting all searchable text (asset names, descriptions, business glossaries) into embeddings using a sophisticated AI model.

An embedding is a numerical representation of text—a long list of numbers called a vector. Think of this vector as a set of coordinates that places the text on a vast “map of meaning.” Words and phrases with similar meanings are placed close together on this map, while unrelated concepts are far apart.

For example, the vectors for “revenue,” “income,” and “sales figures” would be located very close to each other in this high-dimensional space. This process is computationally intensive and is done ahead of time, whenever your data catalog is updated.

Step 2: Querying and Searching (The Real-Time Phase)

When you type a query into the search bar, the following happens instantly:

  1. Query Embedding: Your search query (e.g., "money earned per quarter") is sent to the same AI model and converted into a vector.
  2. Vector Search: The system then takes your query’s vector and searches the pre-indexed vector database for the vectors that are closest to it on the map of meaning. This is typically done using an algorithm like k-Nearest Neighbor (k-NN).
  3. Context-Aware Ranking: The assets corresponding to the closest vectors are returned to you, ranked by their conceptual relevance to your original query.

This is how a search for “money earned” can instantly find a technical asset named quarterly_revenue_fct, connecting business language to data reality.


Configuration and Supported Models

To power this process, Blindata leverages state-of-the-art embedding models from OpenAI. High-quality embeddings are critical for capturing nuanced meaning, and OpenAI’s models are industry leaders in this domain.

Note

To enable Semantic Search, you must configure an OpenAI API key in your Blindata instance.

Supported OpenAI Models

Blindata is optimized to work with OpenAI’s leading embedding models. While multiple models are available, we recommend the latest generations for the best performance and cost-effectiveness.

  • text-embedding-ada-02: A highly performant and widely used model that generates embeddings with 1536 dimensions. It provides an excellent balance of quality and speed.
  • text-embedding-3-small: A newer, highly efficient model that also produces 1536-dimension vectors but at a significantly lower cost than its predecessors. It is the recommended default for most use cases.
  • text-embedding-3-large: OpenAI’s most powerful embedding model, creating highly detailed vectors with 3072 dimensions. This can offer superior nuance for very complex or specialized domains, but comes with higher computational and cost overhead.

The “dimensions” of a vector refer to the richness of the semantic information it can store. While more dimensions can capture more detail, a model like text-embedding-3-small provides more than enough quality for nearly all data cataloging scenarios.

The default model for Blindata is the text-embedding-3-small due to its excellent balance of performance and cost. Additional models can be supported on request.


Strengths and Trade-offs

This search mode is a game-changer for bridging the business-technical divide.

  • Empowering Business Users: Analysts can find data using their own terminology without needing to know technical jargon or cryptic table names.
  • Solving Non-Intuitive Naming: A cryptically named column like usr_ses_dur_sec can be easily discovered by searching for user session duration.
  • Accelerating Exploratory Analysis: By revealing hidden relationships between concepts, semantic search helps you discover datasets you didn’t even know existed.
Strengths Trade-offs & Considerations
Understands Natural Language: Enables intuitive, conversational queries. Higher Computational Cost: Embedding generation requires processing power.
Discovers Hidden Relationships: Uncovers conceptually related assets. Dependency: Requires an active OpenAI API key and subscription.
Bridges Terminology Gaps: Unites business and technical vocabularies. Embedding Maintenance: The index must be updated as your data catalog changes.
Highly Resilient: Unaffected by keywords, spelling, or phrasing.

By shifting the focus from how data is named to what data means, Semantic Search fundamentally transforms the data discovery experience. It makes your data catalog more accessible, intuitive, and powerful, ensuring that every user can find the insights they need, regardless of their technical expertise. In Blindata, this is the capability that truly unites your metadata with its meaning.