Data Science Room

Retrieval Augmented Generation

How LLMs can reason over your data

What is Retrieval Augmented Generation?

Retrieval Augmented Generation (RAG) is a method for supplementing large language models (LLMs) with relevant contextual information that they can use for reasoning. It allows for a type of fine-tuning of the responses that can be generated by an LLM, without needing to modify the underlying LLM model. 

This white paper describes what RAG is and how it can be used to provide a personalized experience when using an LLM with your data.

What LLMs know

LLMs are trained on data that exists at a point in time from a specific collection of documents. For publicly available LLMs, such as OpenAI’s GPT-3.5, this collection typically contains large amounts of publicly available data from the Web. However, the LLM is not aware of anything that has occurred outside of the data that it was trained with. The effect of this is that an LLM trained in 2022 will not be aware of data that was created in 2023. Most publicly accessible LLMs like OpenAI’s GPT-3.5 (which ChatGPT uses) and Meta’s Llama 2 are trained on public data and are not aware of the contents of private collections of data. The effect of this is that they cannot provide answers and insight into information outside of what was present in the data that they were trained on. For these LLMs to be used with data in private document collections, they need some way of being exposed to this data.

Making LLMs aware of your data: Fine-tuning

One way to make LLMs aware of your data is through fine-tuning. Fine-tuning is not specific to LLMs and, instead, refers to the process of taking a machine-learning model that has been trained for one task and modifying it through additional training to perform well on a different but related task. For instance, a machine-learning model that has been trained to identify cats and dogs in pictures could be fine-tuned to also be capable of identifying rabbits. The key idea behind fine-tuning a model is that the original model has some foundational knowledge, as a result of being trained on a very large amount of data, and that foundational knowledge can be leveraged to perform specialized tasks that the original model was not trained for. By leveraging its existing knowledge, the fine-tuned model should be able to perform better at some downstream tasks than a new machine learning model trained from scratch.

Fine-tuning for LLMs refers to taking a general-purpose LLM, such as Llama 2, and specializing it through exposure to specific data, prompts, and instructions. 

While fine-tuning allows for the creation of an LLM that is aware of a specific collection of data, there are several challenges to fine-tuning LLMs:

  • Catastrophic forgetting can occur, whereby the LLM forgets some of its foundational knowledge and instead over-specializes (Ramasesh et al., 2021). When catastrophic forgetting happens, the LLM performs very well on the data that it was fine-tuned on but loses its ability to generalize outside of that data.
  • Many publicly available LLMs, such as those made available by OpenAI, are only accessible via an API and thus cannot be fine-tuned (1).
  • Fine-tuning is a highly technical process that leverages approaches such as Reinforcement Learning from Human Feedback (RLHF) (Bai et al., 2022). It can be tricky to do without access to significant resources.
  • For continuously changing data, fine-tuning will need to be repeated with each data refresh, which can be expensive.
  • Fine-tuning requires enough data to generalize, which may not always be available.

Retrieval Augmented Generation provides an alternative way to make LLMs aware of your data without the need to fine-tune.

(1) Azure OpenAI provides fine-tuning capabilities for a limited subset of OpenAI models under certain conditions.

Retrieval Augmented Generation: An alternative to fine-tuning

LLMs are very powerful when it comes to reasoning over information that they are provided with. They can leverage their internal knowledge and combine it with user-provided context to generate responses that are relevant given the context. Retrieval Augmented Generation (RAG) is a method for providing LLMs with relevant context to assist them in generating a response. It is an alternative to fine-tuning where, instead of fine-tuning an LLM ahead of time, you provide the LLM with relevant information when it is needed. The LLM then leverages this information to provide domain-specific responses. This has several benefits over traditional fine-tuning:

  • Fewer requirements in terms of data, people, and hardware, which translates to lower costs.
  • Can be used with state-of-the-art (SOTA) LLMs that might only be accessible via an API.
  • No need to retrain LLM with every data refresh.

The figure below shows an overview of the steps involved in Retrieval Augmented Generation and how it can be used to work with your data. RAG makes extensive use of semantic search and vector databases. Each step in the process is described in more detail below.

Retrieval Augmented Generation data flow
Figure: A diagram of the flow of data in Retrieval Augmented Generation.
  1. Generate document embeddings and store in vector database
    The first step in RAG is to process the documents that will be made available to the LLM for reasoning. The goal is to be able to quickly identify relevant information to be made available to the LLM, and this is done by leveraging embeddings and semantic search. Embeddings are generated for all documents that could provide context to the LLM. Embeddings may be generated for the full document, individual pages, paragraphs, tables, or even images. These embeddings are stored in a vector database, a database that is specially built for storing and retrieving embeddings. This database is not static; new documents can be added at any time and immediately become available to the LLM for reasoning.

  2. Embed user prompt
    Semantic search works by using embeddings of a user query or prompt to search for relevant documents that contain information that could be useful in answering that query or prompt. Therefore, to search for relevant documents in the vector database, we also need to embed the user prompt. Examples of this prompt might be, “What is the Net Asset Value for this account?” or “What is this document about?”

  3. Identify relevant documents in vector database
    Using the embedded user prompt, the next step involves identifying and retrieving relevant documents that contain information that may be useful in responding to that prompt. For instance, for the user prompt, “What is the Net Asset Value?” we might identify tables and text paragraphs in a document or collection of documents that talk about Net Asset Value and related concepts. In retrieving these tables and text paragraphs, we leverage the full capabilities of semantic search and vector databases, which allows us to efficiently and effectively identify relevant content.

  4. Generate new prompt that includes relevant context
    An internal system prompt is constructed after identifying the relevant context in the vector database. For instance, the prompt might change as follows:

    User prompt

    What is the Net Asset Value of ABC?

    [Semantic Search is used to retrieve additional context and generate a new internal system prompt with the added context]

    Internal system prompt

    Use the context provided below to answer the question. If you don’t know the answer, say I don’t know.

    Context: ABC is an asset management firm that manages…The Net Asset Value of all assets that they manage is XYZ.

    Question: What is the Net Asset Value of ABC?

    As can be seen in the example above, the context was provided to the LLM that it can use in its reasoning. Internally the LLM may not know anything about ABC; however, by retrieving relevant context from the vector store and providing it as part of the prompt, the LLM is provided with the context it needs to answer the question.

  5. LLM reasoning and response
    The final step in the RAG paradigm is to allow the LLM to reason over the provided information and generate a response.

As suggested in the description above, the quality of RAG is strongly influenced by the power of the semantic search system and the context that it can provide to the LLM. It is important to carefully consider the way that data is indexed in the vector store, e.g., at the document-, page-, paragraph-, or table-level. Additionally, how the semantic search system ranks content is important as you want the most relevant content to be provided to the LLM as context.

LLMs and large contexts

One might wonder if it makes sense to provide an LLM with as much context as possible to allow it to have as much information as possible to use for reasoning. While intuitively this may make sense, there are several reasons why it may be preferable to limit the context provided to an LLM.

  1. Limited context window size
    All LLMs are limited by the amount of context that they can consider. For instance, GPT-type models might allow for context sizes of 4000-16000 tokens (2). Thus, there is a limited amount of context that they can be provided with and if one tried to provide them with all the documents in a database and that limit was met, only a few documents would be included in the context sent to the model.

    How RAG helps: Given an initial user prompt, RAG allows for only the most relevant information to be identified and sent to the LLM.

  2. LLM reasoning capability with long contexts
    Despite the capability of LLMs to support long contexts, a recent paper on the arXiv found that LLMs perform best when relevant information exists either at the start or the end of the context provided to the LLM, and that performance significantly degrades when LLMs need to access information that occurs in the middle of the context provided to an LLM (Liu et al., 2023). The authors also found that performance degrades as LLMs are provided with longer contexts.

    How RAG helps: RAG helps address these issues with long contexts by a) limiting the amount of context provided to the LLM, b) ensuring that the context provided to the LLM is relevant, and c) ensuring that the most relevant context is provided first, due to the relevance ranking provided by semantic search.

  3. Cost
    The cost of using an LLM service is usually based on the number of tokens in a prompt or a response. For instance, in August 2023, OpenAI’s GPT 3.5-Turbo model cost $0.0015 per 1000 tokens in the prompt and $0.002 per 1000 tokens in the response. When large amounts of contextual information are being sent to an LLM, it incurs higher costs.

    How RAG helps: RAG helps limit this cost burden by ensuring that only relevant information is sent to the LLM, thus reducing the cost of unnecessary and irrelevant context.

(2) Tokens are the units that LLMs process. They may be words or subwords and are created using a tokenizer.

Retrieval Augmented Generation at Alkymi

Retrieval Augmented Generation provides a means for personalizing the performance of LLMs based on custom data that they were not trained with. It can be used to help LLMs generate better responses at a lower cost. At Alkymi, we use Retrieval Augmented Generation to power our generative AI products, such as the Answer Tool and Document Chat, available through Alkymi Alpha. RAG allows our customers to enjoy a personalized LLM experience based on their relevant data without the overhead and expense of traditional fine-tuning.


Kyle Williams

Kyle Williams is a Data Scientist Manager at Alkymi. He has over 10 years of experience using data science, machine learning, and NLP to build products used by millions of users in diverse domains, such as finance, healthcare, academia, and productivity. He received his Ph.D. from The Pennsylvania State University and has published over 50 peer-reviewed papers in NLP and information retrieval.

Kyle Williams, Lead Data Scientist at Alkymi

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.

Ramasesh, V. V., Lewkowycz, A., & Dyer, E. (2021, October). Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations.

Schedule a demo of Alkymi

Interested in learning how Alkymi can help you go from unstructured data to instantly actionable insights? Schedule a personalized product demo with our team today!