Data Science Room

Semantic search

How semantic understanding powers search at Alkymi

What is semantic search?

Semantic search is a search methodology where the semantic meaning of words is used to retrieve relevant content in document collections or data sets. This differs from keyword-based search, where documents are retrieved by matching keywords. Semantic search allows for effectively retrieving content that shares the same meaning as a user’s query, despite potentially using different words.

This white paper provides an overview of semantic search, beginning with a description of traditional keyword-based search. It then discusses word embeddings, what they are, how they’re learned, and how they can be used to build powerful search applications with the help of large language models (LLMs). Lastly, it describes how semantic search is used to power Alkymi’s Generative AI products.

Keyword search

Historically, search engines have worked by matching the keywords in a user’s query to occurrences of those words in large collections of documents. As an example, if a user were to search for “asset manager,” the search engine would identify documents in its collection that contained the terms “asset” and “manager” and would do so efficiently through the use of specialized data structures, such as an inverted index.

An inverted index is a data structure that maps words to the documents that contain them (Manning, 2008). For example, an inverted index would have one list for all the documents that contained the word “asset,” and a separate list for all the documents that contain the word “manager.” At search time, instead of scanning every document in the collection for the words “asset” and “manager” (a process that could take a long time for large document collections), the search engine would instead simply retrieve the list of documents that contained the word “asset” and another list of documents that contained the word “manager” and then combine them into a single list based on predefined logic. This process of creating a mapping from words to the documents that contain them is known as indexing.

Inverted index example
Figure: An example of an inverted index.

This approach to search and search indexing is known as lexical search, as it relies on directly matching words in a user’s query to their occurrences in documents. Lexical search does allow for minor variations in the spelling or formation of words through the use of stemming and lemmatization, e.g., management/managed/manager could all be considered equivalent; however, these approaches are limited in their ability to identify the equivalence of different words. This is where semantic word representations come into play.

Semantic word representations: Embeddings

The inverted index approach is highly effective at retrieving exact or very similar variants of a given word. However, it is unable to capture semantically equivalent words. In this context, semantic equivalence refers to words or sentences that differ in form but have the same or a similar meaning. For instance, “fund manager” has a similar meaning to “asset manager,” but an inverted index data structure is unaware of this equivalence. When a user searches for “fund,” it will only identify documents that contain the word fund, and when a user searches for “manager,” it will only identify documents that contain the word manager.

One potential solution to this problem is to create an index of synonyms that can be used to identify similar versions of words. A shortcoming of this approach, however, is that it requires an exhaustive list of synonyms, which are generally only well-defined for individual words or phrases and not semantically equivalent sentences. Furthermore, whether words are synonyms may be dependent on the context in which they occur.

Embeddings are a way of representing words and sentences in a document in a way that captures their semantic meaning. These representations take the form of a list of real-valued numbers known as a vector, where words or sentences with similar meanings occur close to each other in the vector space. These vectors can contain anywhere from a few dozen to thousands of real values.

Embeddings take the form of a list of numbers known as a vector, with potentially hundreds of values representing different dimensions. An example of a 5D word embedding might look like: Asset: [0.95, 4.01, 0.11, 0.03, -1.57]

2D embedding space example
Figure: An example visualization of embeddings represented in a 2D vector space.

Embedding representations are learned in the training of machine learning (ML) systems, where the ML system uses them for many Natural Language Processing (NLP) tasks. For instance, if an NLP system was being trained to extract the phrases “asset manager” and “fund manager” from financial documents, the system would identify that the two phrases occur in very similar contexts and, given enough data, would realize that they could potentially be used interchangeably. As a result, it will learn very similar embeddings (features) for the words fund and asset. The idea that similar words occur in similar contexts is fundamental to the way that embeddings work as it allows for the learning of representations of words (the embeddings) that capture their semantic meaning independent of their surface forms (1). This idea that the meaning of a word is based on its context was succinctly captured by John Firth in a paper in 1957, where he wrote that “a word is characterized by the company it keeps” (Firth, 1957).

Embeddings can also be used to represent sequences of text, such as phrases or sentences (Reimers & Gurevych, 2019). The core idea remains the same, that phrases or sentences with semantically equivalent meanings occur in similar contexts and thus a machine learning system will learn to represent them with similar embeddings. For example, the sentences “Who is the asset manager?” and “What is the name of the fund manager?” are semantically equivalent, and thus a machine learning system will learn to represent the two phrases with similar embeddings. Embeddings can also be used to represent other types of media, such as images, and multimodal embeddings can be used to capture the equivalence of these different media types. For example, multimodal embeddings might represent the word “cat” and an image of a cat with very similar embeddings. The use of multimodal embeddings allows for the development of applications such as visual question answering and video captioning (Gan et al., 2022).

It is important to note that learning these embeddings, also known as representation learning (Bengio et al., 2012), requires extremely large amounts of data for the machine learning systems to understand the contexts in which they are used and to be able to capture their equivalence. Embeddings are learned based on the data used to train them. Therefore, it is common to see different embedding models being learned for different domains, e.g., financial and medical, where these domain-specific models are specialized to capture the meaning of words in those domains.

(1) The surface form of a word is the way that it appears in a text.

Semantic search with embeddings

Word embeddings capture the semantic meaning of words. Instead of searching for the occurrences of specific words in documents, semantic search systems instead search for words or phrases with similar embeddings to the embedding of a user’s query. Semantic search systems make use of vector databases, like Pinecone or Chroma, which are highly optimized to deal with the efficient storage and retrieval of embeddings. When documents are added to a semantic search system, embeddings are generated for the documents. Embeddings may be generated for words, sentences, paragraphs, tables, or even images, depending on the way the vector database and semantic search engine are configured.

When a user searches using a semantic search engine, their search query is embedded (converted to an embedding) and then scored against other embeddings in the vector database. Typically, this scoring is based on calculating the distance between the query embedding and the embeddings of the documents in the vector database. This is possible since embeddings are learned in such a way that embeddings with similar meanings occur close to each other in the vector space. Thus the distance between two embeddings is a good measure of their similarity (closer means more similar) and similar embeddings imply similar meaning. The most common distance measure used for computing the similarity between embeddings is cosine similarity.

Modern semantic search systems are built using embeddings generated by state-of-the-art large language models (LLMs), such as OpenAI’s GPT models. These LLMs are capable of fully leveraging context to produce embeddings that are context-aware and multimodal.

Semantic search at Alkymi

At Alkymi, we use semantic search to power many of our Generative AI products and features. Not only do we generate embeddings for the text in documents, we also generate embeddings for structured data, such as tables, thereby allowing them to be fully queried using semantic technology.

When using Alkymi Alpha, the search bar for your data sets is powered by LLM-generated embeddings that allow you to quickly and efficiently search for semantic content in your documents. Alkymi’s Answer Tool and Document Chat both leverage semantic search and Retrieval Augmented Generation to identify content in documents that can help the LLMs better understand user queries and intent and that allow the LLM to reason over the information in those documents.

Author

Kyle Williams

Kyle Williams is a Data Scientist Manager at Alkymi. He has over 10 years of experience using data science, machine learning, and NLP to build products used by millions of users in diverse domains, such as finance, healthcare, academia, and productivity. He received his Ph.D. from The Pennsylvania State University and has published over 50 peer-reviewed papers in NLP and information retrieval.

Kyle Williams, Data Science Manager at Alkymi
References

Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.

Firth, J. (1957). A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis, 10-32.

Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., & Gao, J. (2022). Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4), 163-352.

Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

Schütze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to information retrieval (Vol. 39, pp. 234-265). Cambridge: Cambridge University Press.

Schedule a demo of Alkymi

Interested in learning how Alkymi can help you go from unstructured data to instantly actionable insights? Schedule a personalized product demo with our team today!