Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a technique used in question-answering systems to augment the capabilities of Large Language Models (LLMs) with a domain-specific knowledge base. There are 2 pre-processing steps: (1) the knowledge base is segmented into appropriate chunks; (2) an embedding for each chunk is created and stored (along with the text chunk) into a vector database, which is indexed for faster retrieval. During deployment, there are 3 steps: (1) an embedding for the query is created; (2) relevant text is pulled from the vector database based on similarity with respect to the query's embedding; (3) the retrieved relevant text is appended to the original query as "context", and the "augmented query" is finally fed into the LLM. This is roughly the algorithm behind many "talk to your document(s)" applications made possible with the availability of LLMs (or Foundation Models for text).