使用 Llama-Index、Llama 3 和 Qdrant 构建高级重排-RAG 系统

原文：Plaban Nayak Build an Advanced Reranking-RAG System Using Llama-Index, Llama 3 and Qdrant

引言

尽管 LLM（语言模型）能够生成有意义且语法正确的文本，但它们面临着一种称为幻觉的挑战。LLM 中的幻觉指的是它们倾向于自信地产生错误答案，从而产生虚假信息，这些信息可能看起来令人信服。这个问题自 LLM 问世以来就一直存在，并且经常导致不准确和事实错误的输出。

为了解决幻觉问题，事实检查至关重要。一种用于原型设计 LLM 用于事实检查的方法包括三种方法：

提示工程检索增强生成（RAG）微调

在这个背景下，我们将利用 RAG（检索增强生成）来减轻幻觉问题。

什么是 RAG？

RAG = 密集向量检索（R）+ 上下文学习（AG）

检索：为所提问的问题找到参考文献。

增强：将参考文献添加到提示中。

生成：改进所提问的答案。

在 RAG 中，我们通过将一系列文本文档或文档片段编码为称为向量嵌入的数值表示来处理它们。每个向量嵌入对应于一个单独的文档片段，并存储在称为向量存储的数据库中。负责将这些片段编码为嵌入的模型称为编码模型或双编码器。这些模型在大量数据集上进行训练，使它们能够创建出单个向量嵌入中文档片段的强大表示。为了避免幻觉，RAG 利用与 LLM 的推理能力分开的事实知识来源。这些知识被外部存储，并且可以轻松访问和更新。

有两种类型的知识来源：

参数化知识：这种知识在训练过程中获得，并且隐式地存储在神经网络的权重中。非参数化知识：这种类型的知识存储在外部源中，例如向量数据库。

为什么在微调之前使用 RAG（操作顺序）？

便宜：无需额外的训练。更容易更新最新信息。更可信，因为可以进行事实检查的参考文献。

优化工作流程总结了基于以下两个因素可以使用的方法：

内容优化：模型需要了解什么。 LLM 优化：模型需要如何行动。

RAG 数据堆栈

? 加载语言数据

? 处理语言数据

? 嵌入语言数据

? 将向量加载到数据库中

RAG 的阶段

RAG 的阶段包括：

数据加载：这涉及从各种来源（如文本文件、PDF、网站、数据库或 API）检索数据并将其集成到流程中。Llama Hub 提供了各种连接器来实现这个目的。

索引：这个阶段的重点是为数据查询创建一个结构化格式。对于 LLMs，索引通常涉及生成向量嵌入，这些向量嵌入是数据含义的数值表示，以及其他元数据策略，以便实现准确和上下文相关的数据检索。

存储：在索引之后，通常会将索引和相关的元数据存储起来，以避免将来需要重复索引的情况。

查询：有多种方式可以利用 LLM 和 Llama-Index 数据结构进行查询，包括子查询、多步查询和混合策略，具体取决于所选择的索引策略。

评估：这一步对于评估流程的有效性至关重要，可以与替代策略进行比较或在实施更改时使用。评估提供了关于查询响应的准确性、保真度和速度的客观指标。

我们的 RAG 堆栈是使用 Llama-Index、Qdrant 和 Llama 3 构建的。

什么是 Llama-Index？

Llama-Index 是一个用于开发具有上下文的 LLM 应用程序的框架。上下文增强涉及使用 LLMs 处理您的私有或领域特定数据。

该框架的一些热门应用包括：

问答聊天机器人（通常称为 RAG 系统，即“检索增强生成”）文档理解和提取能够进行研究和采取行动的自主代理

Llama-Index 提供了一套全面的工具，从最初的原型到生产就绪的解决方案，以促进这些应用的开发。这些工具可以实现数据摄取和处理，以及将数据访问与基于 LLM 的提示相结合的复杂查询工作流的实现。

在这里，我们使用了 llama-index >= v0.10

来源：https://www.llamaindex.ai/blog/llamaindex-v0-10-838e735948f8

主要增强

ServiceContext 已弃用：每位 LlamaIndex 用户都熟悉 ServiceContext，但它逐渐变得过时且繁琐，用于管理 LLM、嵌入、块大小、回调以及其他功能。因此，我们完全弃用了它；现在您可以直接指定参数或设置默认值。

重构的文件夹结构：

llama-index-core：该文件夹包含所有核心 Llama-Index 抽象。

llama-index-integrations：该文件夹包括第三方集成，涵盖 19 个 Llama-Index 抽象，包括数据加载器、LLM、嵌入模型、向量存储等。

llama-index-packs：在这里，您将找到我们的 50 多个 LlamaPacks 集合，这些模板旨在快速启动用户的应用程序。

LlamaHub 将作为所有集成的中央枢纽。

Llama 3

Meta 的 Llama 3 是开放访问 Llama 系列的最新版本，可通过 Hugging Face 访问。它用作响应合成的语言模型。Llama 3 有两种尺寸可供选择：8B 适用于在消费级 GPU 上进行简化部署和开发，70B 适用于广泛的 AI 应用。每种尺寸变体都提供基础和指令调整版本。此外，基于 Llama 3 8B 进行微调的新版本 Llama Guard 已被引入，称为 Llama Guard 2。

什么是 Qdrant？

Qdrant 是一个向量相似度搜索引擎，通过易于使用的 API 提供一个可投入生产的服务。它专门用于存储、搜索和管理点（向量）以及额外的有效负载信息。它经过优化，可高效地存储和查询高维向量。像 Qdrant 这样的向量数据库利用专门的数据结构和索引技术，如 HNSW（Hierarchical Navigable Small World）来实现近似最近邻居和 Product Quantization 等。这些优化使得快速相似度和语义搜索成为可能，允许用户根据指定的距离度量找到与给定查询向量最接近的向量。Qdrant 支持的常用距离度量包括欧氏距离、余弦相似度和点积。

使用的技术栈

应用框架：Llama-index

嵌入模型：BAAI/bge-small-en-v1.5

LLM：Meta-Llama-3

向量存储：Qdrant

代码实现

安装所需库

%%writefile requirements.txt
llama-index
llama-index-llms-huggingface
llama-index-embeddings-fastembed
fastembed
Unstructured[md]
qdrant
llama-index-vector-stores-qdrant
einops
accelerate
sentence-transformers
#
!pip install -r requirements.txt
accelerate==0.29.3
einops==0.7.0
sentence-transformers==2.7.0
transformers==4.39.3
qdrant-client==1.9.0
llama-index==0.10.32
llama-index-agent-openai==0.2.3
llama-index-cli==0.1.12
llama-index-core==0.10.32
llama-index-embeddings-fastembed==0.1.4
llama-index-legacy==0.9.48
llama-index-llms-huggingface==0.1.4
llama-index-vector-stores-qdrant==0.2.8

下载数据集

!mkdir Data
! wget "https://arxiv.org/pdf/1810.04805.pdf" -O Data/arxiv.pdf

加载文档

from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("/content/Data").load_data()

实例化嵌入模型

from llama_index.embeddings.fastembed import
FastEmbedEmbedding
from llama_index.core import Settings
#
embed_model =
FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
#
Settings.embed_model = embed_model
#
Settings.chunk_size = 512
#

定义系统提示

from llama_index.core import PromptTemplate
system_prompt = "You are a Q&A assistant. Your goal is to answer questions as accurately as possible based on
the instructions and context provided."
# This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<|USER|>{query_str}<|ASSISTANT|>")

实例化 LLM

由于我们使用 Llama 3 作为 LLM，我们需要执行以下操作：

生成 HuggingFace 访问令牌

请求使用模型

from huggingface_hub import notebook_login
notebook_login()
import torch
from transformers import AutoModelForCausalLM,
AutoTokenizer
from llama_index.llms.huggingface import HuggingFaceLLM
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct"
)
stopping_ids = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
llm = HuggingFaceLLM(
context_window=8192,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "do_sample":
False},
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
device_map="auto",
stopping_ids=stopping_ids,
tokenizer_kwargs={"max_length": 4096},
# uncomment this if using CUDA to reduce memory
usage
model_kwargs={"torch_dtype": torch.float16}
)
Settings.llm = llm
Settings.chunk_size = 512

实例化向量存储并加载向量嵌入

from IPython.display import Markdown, display
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
#
client = qdrant_client.QdrantClient(
# 你可以使用 :memory: 模式进行快速轻量级实验，
# 它不需要在任何地方部署 Qdrant
# 但需要 qdrant-client >= 1.1.1
location=":memory:"
# 否则，使用以下方式设置 Qdrant 实例地址：
# url="http://<host>:<port>"
# 否则，使用主机和端口设置 Qdrant 实例：
#host="localhost",
#port=6333
# 为 Qdrant Cloud 设置 API KEY
#api_key=<YOUR API KEY>
)
vector_store = QdrantVectorStore(client=client,collection_name="test")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents,storage_context=storage_context,)

实例化 Reranker 模块

来源：llama-index

检索模型根据嵌入相似性与查询检索前 k 个文档。基于嵌入的检索有许多好处：

在计算点积时非常高效，因为在查询时不需要任何模型调用。

尽管不是完美的，嵌入可以充分编码文档和查询的语义。这导致嵌入式检索提供高度相关结果的查询子集。

然而，尽管具有这些优势，基于嵌入的检索有时可能不够精确，返回与查询无关的上下文。这反过来会降低 RAG 系统的整体质量，无论 LLM 的质量如何。

在这种方法中，我们实现了一个两阶段检索过程。

第一阶段采用基于嵌入的检索，具有较高的 top-k 值，以优先考虑召回率，即使以较低的精度为代价。

随后，第二阶段采用略微更耗时的过程，强调精度优先于召回率。这个阶段旨在“重新排名”最初检索到的候选项，增强最终结果的质量。

from llama_index.core.postprocessor import
SentenceTransformerRerank
rerank = SentenceTransformerRerank( model="cross-encoder/ms-marco-MiniLM-L-2-v2", top_n=3)

实例化查询引擎

import time
query_engine = index.as_query_engine(similarity_top_k=10, node_postprocessors=[rerank] )

提出问题 1

now = time.time()
response = query_engine.query("What is instruction finetuning?",)
print(f"Response Generated: {response}")
print(f"Elapsed: {round(time.time() - now, 2)}s")

由 RAG 中的生成模块综合的响应

Response Generated: Instruction fine-tuning is not
explicitly mentioned in the provided context. However,
based on the text, it can be inferred that fine-tuning is a
process where a pre-trained model like BERT is adapted to a
specific task by swapping out the appropriate inputs and
outputs. This process is described as "straightforward" and
allows BERT to model many downstream tasks by fine-tuning
all the parameters end-to-end.
Elapsed: 7.32s

提出问题 2

now = time.time()
response = query_engine.query("Describe the Feature-based Approach with BERT??",)
print(f"Response Generated: {response}")
print(f"Elapsed: {round(time.time() - now, 2)}s")

由 RAG 中的生成模块综合的响应

Response Generated: According to the text, the
Feature-based Approach with BERT involves extracting the
activations from one or more layers of BERT without
fine-tuning any parameters of BERT. These contextual
embeddings are then used as input to a randomly initialized
two-layer 768-dimensional BiLSTM before the classification
layer. This approach is used to ablate the fine-tuning
approach and demonstrate the effectiveness of BERT for both
fine-tuning and feature-based approaches.
Elapsed: 6.78s

提出问题 3

now = time.time()
response = query_engine.query("What is SQuADv2.0?",)
print(f"Response Generated: {response}")
print(f"Elapsed: {round(time.time() - now, 2)}s")

由 RAG 中的生成模块综合的响应

Response Generated: According to the provided context,
SQuAD v2.0 is an extension of the SQuAD 1.1 problem
definition, allowing for the possibility that no short
answer exists in the provided paragraph, making the problem
more realistic.
Elapsed: 4.15s

结论

在这里，我们开发了一个基于私有数据运行的先进 RAG 问答系统。我们将 LlamaIndex 重新排名概念纳入其中，以优先考虑从检索器中检索到的上下文中最相关的内容。这种方法确保了生成响应的事实准确性。```