用LoRA微调 Llama 2：定制大型语言模型进行问答

Fine-tune Llama 2 with LoRA: Customizing a large language model for question-answering — ROCm Blogs (amd.com)

在这篇博客中，我们将展示如何在AMD GPU上使用ROCm对Llama 2进行微调。我们采用了低秩适配大型语言模型(LoRA)来克服内存和计算限制，并使开源的大型语言模型(LLMs)更易于获取。我们还将向您展示如何微调并将模型上传到Hugging Face。

简介

在生成性AI（GenAI）的动态领域中，微调LLMs（如Llama 2）带来了与大量计算和内存需求相关的独特挑战。LoRA提出了一个引人注目的解决方案，允许快速且经济高效地对最先进的LLMs进行微调。这种突破性的能力不仅加快了调整过程，也降低了相关成本。
为了探索LoRA的好处，我们将提供一个关于使用LoRA对Llama 2进行微调的全面教程，该教程特别适用于AMD GPU上的问答（QA）任务。
在开始之前，让我们简要回顾一下构成我们讨论基础的三个关键组成部分：
• Llama 2：Meta的先进语言模型，具有多达700亿参数的变体。
• 微调：一个关键过程，用于改进LLMs以便于专业任务，优化其性能。
• LoRA：用于微调Llama 2的算法，确保有效适应专业任务。

Llama 2

Llama 2 是Meta发布的第二代开源LLMs集合；它带有商业许可证。Llama 2旨在处理广泛的自然语言处理（NLP）任务，模型规模从70亿到700亿参数不等。
针对对话优化的Llama 2 Chat，在性能上与像ChatGPT和PaLM这样的流行封闭源模型相仿。通过使用高质量的对话数据集微调，你可以提高这个模型的性能。在这篇博客文章中，我们将深入探讨使用QA数据集精炼Llama 2 Chat模型的过程。

微调模型

机器学习中的微调是使用新数据调整预训练模型的权重和参数的过程，以改善其在特定任务上的表现。这涉及使用特定于当前任务的新数据集来更新模型的权重。通常由于内存和计算能力不足，无法在消费者硬件上微调LLMs。然而，在本教程中，我们使用LoRA来克服这些挑战。

LoRA

LoRA 是微软的研究人员开发的一种创新技术，旨在解决微调LLMs的挑战。这显著降低了需要微调的参数数量（可减少多达10,000倍），大幅度减少了GPU内存要求。要了解有关LoRA基本原则的更多信息，请参考使用LoRA进行高效微调的基本原则。

逐步微调Llama 2

标准（全参数）微调考虑所有的参数。这需要大量的计算能力来管理优化器状态和梯度检查点。因此，产生的内存占用通常是模型本身大小的大约四倍。例如，以FP32（每个参数4字节）加载一个70亿参数的模型（例如Llama 2）需要大约28 GB的GPU内存，而微调则需要大约28*4=112 GB的GPU内存。请注意，这112 GB的数字是根据实际经验得出的，批处理大小、数据精度和梯度积累等各种因素会对总内存使用量有所贡献。
为了克服这一内存限制，您可以使用高效参数微调（PEFT）技术，例如LoRA。
此示例利用了AMD MI250 GPU的两个GCD（图形计算模块），每个GCD配备有64 GB的VRAM。使用这种设置，我们可以探索带有和不带LoRA微调Llama 2–7b权重的不同设置。
我们的设置：
• 硬件与操作系统：请查看此链接，了解支持ROCm的硬件和操作系统列表。
• 软件：
◦ ROCm 6.1.0+
◦ Pytorch 2.0.1+
• 库：`transformers`、`accelerate`、`peft`、`trl`、`bitsandbytes`、`scipy`

在这篇博客中，我们使用带有Docker镜像的单个MI250GPU进行了实验 rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2。

步骤一：开始准备

首先，确认GPU是否可用。

!rocm-smi --showproductname

您的输出应该如下所示：

========================= ROCm System Management Interface =========================
=================================== Product Info ===================================
GPU[0]      : Card series:      AMD INSTINCT MI250 (MCM) OAM AC MBA
GPU[0]      : Card model:      0x0b0c
GPU[0]      : Card vendor:      Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]      : Card SKU:      D65209
GPU[1]      : Card series:      AMD INSTINCT MI250 (MCM) OAM AC MBA
GPU[1]      : Card model:      0x0b0c
GPU[1]      : Card vendor:      Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]      : Card SKU:      D65209
====================================================================================
=============================== End of ROCm SMI Log ================================

接下来，安装所需的库。

!pip install -q pandas peft==0.9.0 transformers==4.31.0 trl==0.4.7 accelerate scipy

安装 bitsandbytes

1. 使用以下代码安装 bitsandbytes。

git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm_enabled
pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=hip -S . #Use -DBNB_ROCM_ARCH="gfx90a;gfx942" to target specific gpu arch
make
pip install .

2. 检查 bitsandbytes 版本。

在编写本博客时，版本是 0.43.0。

%%bash
pip list | grep bitsandbytes

导入所需的包

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer

第2步：配置模型和数据

您可以在提交请求后从Hugging Face获取Meta的官方Llama-2模型，这可能需要几天的时间。为了不用等待，我们将使用NousResearch提供的基础模型Llama-2-7b-chat-hf（它与原版相同，但更快获取）。

# 模型和分词器名称
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
new_model_name = "llama-2-7b-enhanced" # 您可以为微调模型取自己的名称

# 分词器
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

# 模型
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto"
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

获取基础模型后，您可以开始微调。我们将使用一个称为 mlabonne/guanaco-llama2-1k 的小数据集对基础模型进行微调，以针对问答任务进行优化，它是 timdettmers/openassistant-guanaco 数据集的一个子集（1000个样本）。该数据集是一个由人类生成、人类注释的助理风格对话语料库，包含161,443条消息，分布在35种不同的语言中，并带有461,292个质量评级。这导致了超过10,000棵完整注释的对话树。

# 数据集
data_name = "mlabonne/guanaco-llama2-1k"
training_data = load_dataset(data_name, split="train")
# 查看数据
print(training_data.shape)
# #11是一个英语的问答样本
print(training_data[11])

(1000, 1)
{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts...</s>'}

## 训练过程中有一个依赖性问题
!pip install tensorboardX

第3步：开始微调

使用以下代码来设置您的训练参数：

# 训练参数
train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=50,
    learning_rate=4e-5,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

使用LoRA配置进行训练

现在您可以将LoRA整合到基础模型中，并评估其额外参数。LoRA实际上在现有权重中添加了成对的秩分解权重矩阵（称为更新矩阵），并且只训绍新添加的权重。

from peft import get_peft_model
# LoRA配置
peft_parameters = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, peft_parameters)
model.print_trainable_parameters()

输出如下：

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199

注意，LoRA添加的参数仅占原始模型的0.062%，这是我们将通过微调更新的比例。如下所示。

# Trainer with LoRA configuration
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning.train()

输出如下：

[250/250 07:59, Epoch 1/1]\
Step     Training Loss \
50       1.976400 \
100      1.613500\
150      1.409100\
200      1.391500\
250      1.377300

TrainOutput(global_step=250, training_loss=1.5535581665039062, metrics={'train_runtime': 484.7942, 'train_samples_per_second': 2.063, 'train_steps_per_second': 0.516, 'total_flos': 1.701064079130624e+16, 'train_loss': 1.5535581665039062, 'epoch': 1.0})

要保存模型，请运行以下代码：

# 保存模型
fine_tuning.model.save_pretrained(new_model_name)

使用LoRA训练期间检查内存使用情况

======================= ROCm 系统管理接口 ====================
=============================== 简洁信息 ================================
GPU  温度(DieEdge)  平均功率  SCLK     MCLK     风扇  性能  功率上限  VRAM%  GPU%
0    52.0c           179.0W  1700Mhz  1600Mhz  0%   自动  300.0W   65%   100%
1    52.0c           171.0W  1650Mhz  1600Mhz  0%   自动  300.0W   66%   100%
=============================================================================
============================ ROCm SMI 日志结束 ============================

为了便于比较有无LoRA的微调，我们接下来的阶段将进行对基础模型的彻底微调。这包括更新基础模型中的所有参数。然后，我们分析内存使用、训练速度、训练损失和其他相关指标的差异。

没有LoRA配置的训练

对于这一部分，你需要重启内核并跳过 ‘LoRA配置训练’ 部分。
为了使用相同的标准直接比较模型，我们在全参数微调过程中保持 train_params 设置的一致性（不作任何更改）。

要检查基础模型中的可训练参数，请使用以下代码。

def print_trainable_parameters(model):
    """
    输出模型中的可训练参数数量。
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"可训练参数: {trainable_params} || 所有参数: {all_param} || 可训练比例: {100 * trainable_params / all_param:.2f}"
    )

print_trainable_parameters(base_model)

输出看起来如下：

可训练参数: 6738415616 || 所有参数: 6738415616 || 可训练比例: 100.00

继续使用以下代码：

# 为微调设置较低的学习率
train_params.learning_rate = 4e-7
print(train_params.learning_rate)
# 无LoRA配置的训练器

fine_tuning_full = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# 训练
fine_tuning_full.train()

输出看起来如下：

[250/250 3:02:12, 第 1/1 轮]\
步骤     训练损失\
50       1.712300\
100      1.487000\
150      1.363800\
200      1.371100\
250      1.368300

训练输出(global_step=250, training_loss=1.4604909362792968, 指标={'train_runtime': 10993.7995, 'train_samples_per_second': 0.091, 'train_steps_per_second': 0.023, 'total_flos': 1.6999849383985152e+16, 'train_loss': 1.4604909362792968, 'epoch': 1.0})

没有LoRA训练期间检查内存使用情况

在训练期间，你可以通过运行终端命令 rocm-smi 来检查内存使用情冗。
该命令产生以下输出：

======================= ROCm System Management Interface ====================
=============================== Concise Info ================================
GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    40.0c           44.0W   800Mhz   1600Mhz  0%   auto  300.0W   100%  89%
1    39.0c           50.0W   1700Mhz  1600Mhz  0%   auto  300.0W   100%  85%
=============================================================================
============================ End of ROCm SMI Log ============================

第4步：使用LoRA和全参数微调的比较

比较_使用LoRA配置训练_ 和不使用LoRA配置训练两部分的结果，注意以下几点：
• 内存使用情况：
    ◦ 在全参数微调的情况下，有 6,738,415,616 个可训练参数，导致训练中的反向传播阶段内存消耗巨大。
    ◦ LoRA仅引入了 4,194,304 个可训练参数，仅占全参数微调中总可训练参数的 *0.062%*。
    ◦ 监控带和不带LoRA训练时的内存使用情况，揭示出使用LoRA微调时仅使用了全参数微调所需内存的 *65%*。这为我们提供了在有限的硬件资源下增加批量大小、最大序列长度和在更大数据集上训练的机会。

• 训练速度：
    ◦ 结果表明全参数微调需要数小时才能完成，而使用LoRA进行微调却在不到 9分钟内完成。几个因素贡献了这种加速：
        ▪︎ LoRA中较少的可训练参数意味着更少的导数计算和存储及更新权重所需的内存较少。
        ▪︎ 全参数微调更容易受到内存限制，数据移动成为训练瓶颈。这反映在GPU利用率较低。尽管调整训练设置可以缓解这一现象，但可能需要更多的资源（额外的GPU）和更小的批量大小。

• 准确性：
◦ 在两次训练中，都观察到了显著的训练损失降低。我们为两种方法实现了接近一致的训练损失：全参数微调为 *1.368*，使用LoRA进行微调为 *1.377*。如果您对了解LoRA对微调性能的影响感兴趣，请参考LoRA: 大型语言模型的低秩适应。

第5步：测试使用LoRA微调过的模型

要测试您的模型，请运行以下代码：

# 以FP16重新加载模型并将其与LoRA权重合并
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto"
)
from peft import LoraConfig, PeftModel
model = PeftModel.from_pretrained(base_model, new_model_name)
model = model.merge_and_unload()

# 重新加载分词器以保存它
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

输出看起来像这样：

    Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.34s/it]

上传模型到Hugging Face，可以让你进行后续测试或与他人共享你的模型（进行此步骤，你需要一个有效的Hugging Face账号）。

from huggingface_hub import login
# 您需要使用您的Hugging Face访问令牌
login("xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
# 将模型推送到Hugging Face。这需要几分钟，时间取决于模型大小和你的网络速度。
model.push_to_hub(new_model_name, use_temp_dir=False)
tokenizer.push_to_hub(new_model_name, use_temp_dir=False)

现在，您可以使用基础模型（原始的）和您微调过的模型进行测试。
• 基础模型：

# Generate text using base model
query = "What do you think is the most important part of building an AI chatbot?"
text_gen = pipeline(task="text-generation", model=base_model_name, tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

# Outputs:
<s>[INST] What do you think is the most important part of building an AI chatbot? [/INST]  There are several important aspects to consider when building an AI chatbot, but here are some of the most critical elements:

1. Natural Language Processing (NLP): A chatbot's ability to understand and interpret human language is crucial for effective communication. NLP is the foundation of any chatbot, and it involves training the AI model to recognize patterns in language, interpret meaning, and generate responses.
2. Conversational Flow: A chatbot's conversational flow refers to the way it interacts with users. A well-designed conversational flow should be intuitive, easy to follow, and adaptable to different user scenarios. This involves creating a dialogue flowchart that guides the conversation and ensures the chatbot responds appropriately to user inputs.
3. Domain Knowledge: A chat

• 优化后的模型：

# Generate text using fine-tuned model
query = "What do you think is the most important part of building an AI chatbot?"
text_gen = pipeline(task="text-generation", model=new_model_name, tokenizer=llama_tokenizer, max_length=200)
output = text_gen(f"<s>[INST] {query} [/INST]")
print(output[0]['generated_text'])

# Outputs:
<s>[INST] What do you think is the most important part of building an AI chatbot? [/INST] The most important part of building an AI chatbot is to ensure that it is able to understand and respond to user input in a way that is both accurate and natural-sounding. This requires a combination of natural language processing (NLP) capabilities and a well-designed conversational flow.

Here are some key factors to consider when building an AI chatbot:

1. Natural Language Processing (NLP): The chatbot must be able to understand and interpret user input, including both text and voice commands. This requires a robust NLP engine that can handle a wide range of language and dialects.
2. Conversational Flow: The chatbot must be able to respond to user input in a way that is both natural and intuitive. This requires a well-designed conversational flow that can handle a wide range

您可以根据给定的查询观察两个模型的输出。由于微调过程改变了模型权重，这些输出呈现出轻微的差异。

总结