微调 Code Llama 完整指南

一、前言

今天这篇文章将向大家详细介绍如何对 Code Llama 进行微调，让它变成适合 SQL 开发的有利工具。对于编程开发任务，经过适当微调后的 Code Llama 的性能通常都会比普通的 Llama 强很多，特别是当我们针对具体任务进行优化时:

使用b-mc2/sql-create-context这个文本查询及其对应的SQL查询集合进行训练

使用Lora方法，将基础模型的权重量化为int8，冻结权重，仅对适配器进行训练

本文大多参考了alpaca-lora项目，同时也进行了一定的改进与优化

通过上述几点方法，相信我们能使Code Llama专注于SQL开发领域，获得更好的效果。如果按照本指南步骤进行指导，相信您也能掌握微调的奥妙。

二、微调 Code Llama

2.1、安装依赖

我使用了一台配置了 Python 3.10 和 Cuda 11.8 的 A100 GPU 服务器来运行本文中的代码。大约运行了一个小时。(为了验证可移植性，我还试验在Colab上运行代码，效果都很好。)

!pip install git+https://github.com/huggingface/transformers.git@main bitsandbytes accelerate==0.20.3  # we need latest transformers for this
!pip install git+https://github.com/huggingface/peft.git@e536616888d51b453ed354a6f1e243fecb02ea08
!pip install datasets==2.10.1
import locale # colab workaround
locale.getpreferredencoding = lambda: "UTF-8" # colab workaround
!pip install wandb

2.2、加载库

from datetime import datetime
import os
import sys

import torch
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
    set_peft_model_state_dict,
)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq

（如果出现导入错误，请尝试重新启动 Jupyter 内核）

2.3、加载数据集

这将从 Huggingface Hub 中提取数据集，并将其中的 10% 分成评估集，以检查模型在训练中的表现如何：

from datasets import load_dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]

如果您想加载自己的数据集，请执行以下操作：

train_dataset = load_dataset('json', data_files='train_set.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation_set.jsonl', split='train')

如果您想查看数据集中的任何样本，只需执行以下操作：

print(train_dataset[3])

2.4、加载模型

我从 Huggingface 加载代码 llama int8（Lora 的标准）：

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

torch_dtype=torch.float16 表示使用 float16 表示形式执行计算，即使值本身是 8 位整数。

如果出现错误“ValueError：Tokenizer 类 CodeLlamaTokenizer 不存在或当前未导入。”确保你的 Transformer 版本是 4.33.0.dev0 并且accelerate是 >=0.20.3。

2.5、检查基础型号

检查模型是否已经可以做我们想要它做的事情：

eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

输出结果：

SELECT * FROM table_name_12 WHERE class > 91.5 AND city_of_license = 'hyannis, nebraska'

如果输入只要求类，那么这显然是错误的，因此请继续进行微调！

2.6、Tokenization

设置一些标记化设置，例如左填充，因为它使训练使用更少的内存：

tokenizer.add_eos_token = True
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

设置 tokenize 函数以使 labels 和 input_ids 相同。这基本上就是自我监督微调：

def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding=False,
        return_tensors=None,
    )

    # "self-supervised learning" means the labels are also the inputs:
    result["labels"] = result["input_ids"].copy()

    return result

并运行将每个 data_point 转换为我在网上找到的效果很好的提示：

def generate_and_tokenize_prompt(data_point):
    full_prompt =f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.

### Input:
{data_point["question"]}

### Context:
{data_point["context"]}

### Response:
{data_point["answer"]}
"""
    return tokenize(full_prompt)

重新格式化以提示并将每个样本标记为我们的标记化训练和评估数据集：

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

2.7、设置 LoRA

置标准 Lora 配置并将其附加到基本模型：

model.train() # put model back into training mode
model = prepare_model_for_int8_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

要从检查点恢复，请将resumefromcheckpoint 设置为要从中恢复的adapter_model.bin 的路径：

resume_from_checkpoint = "" # set this to the adapter_model.bin file you want to resume from

if resume_from_checkpoint:
    if os.path.exists(resume_from_checkpoint):
        print(f"Restarting from {resume_from_checkpoint}")
        adapters_weights = torch.load(resume_from_checkpoint)
        set_peft_model_state_dict(model, adapters_weights)
    else:
        print(f"Checkpoint {resume_from_checkpoint} not found")

设置权重和偏差以查看训练图的可选内容：

wandb_project = "sql-try2-coder"
if len(wandb_project) > 0:
    os.environ["WANDB_PROJECT"] = wandb_project

if torch.cuda.device_count() > 1:
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    model.is_parallelizable = True
    model.model_parallel = True

2.8、模型训练

如果 GPU 内存不足，请更改 perdevicetrainbatchsize。 gradientaccumulationsteps 变量应确保这不会影响训练运行期间的批量动态。所有其他变量都是标准的东西，不用设置：

batch_size = 128
per_device_train_batch_size = 32
gradient_accumulation_steps = batch_size // per_device_train_batch_size
output_dir = "sql-code-llama"

training_args = TrainingArguments(
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=100,
        max_steps=400,
        learning_rate=3e-4,
        fp16=True,
        logging_steps=10,
        optim="adamw_torch",
        evaluation_strategy="steps", # if val_set_size > 0 else "no",
        save_strategy="steps",
        eval_steps=20,
        save_steps=20,
        output_dir=output_dir,
        load_best_model_at_end=False,
        group_by_length=True, # group sequences of roughly the same length together to speed up training
        report_to="wandb", # if use_wandb else "none",
        run_name=f"codellama-{datetime.now().strftime('%Y-%m-%d-%H-%M')}", # if use_wandb else None,
    )

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=training_args,
    data_collator=DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True
    ),
)

然后我们进行一些与 pytorch 相关的优化，这只是使训练更快，但不影响准确性：

model.config.use_cache = False

old_state_dict = model.state_dict
model.state_dict = (lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())).__get__(
    model, type(model)
)
if torch.__version__ >= "2" and sys.platform != "win32":
    print("compiling the model")
    model = torch.compile(model)

trainer.train()

此 ^ 将在 A100 上运行大约 1 小时。

2.9、加载最终检查点

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

base_model = "codellama/CodeLlama-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")

要加载经过微调的 Lora/Qlora 适配器，请使用 PeftModel.frompretrained。 output_dir 应该是包含adapterconfig.json和adapter_model.bin的东西：

from peft import PeftModel
model = PeftModel.from_pretrained(model, output_dir)

尝试与之前相同的提示：

eval_prompt = """You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question and context regarding one or more tables.

You must output the SQL query that answers the question.
### Input:
Which Class has a Frequency MHz larger than 91.5, and a City of license of hyannis, nebraska?

### Context:
CREATE TABLE table_name_12 (class VARCHAR, frequency_mhz VARCHAR, city_of_license VARCHAR)

### Response:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

模型输出：

SELECT class FROM table_name_12 WHERE frequency_mhz > 91.5 AND city_of_license = "hyannis, nebraska"

从运行结果可以看到微调是有效果的！也可以将此适配器转换为 Llama.cpp 模型以在本地运行。

Jupyter Notebook 的完整代码：

https://github.com/Crossme0809/frenzyTechAI/blob/main/fine-tune-code-llama/finetunecode_llama.ipynb

三、References

[1]. Alpaca-LoRA：

https://github.com/tloen/alpaca-lora

[2]. LoRA Paper:

https://arxiv.org/abs/2106.09685

[3]. Sql-Create-Context：

https://huggingface.co/datasets/b-mc2/sql-create-context

如果你对这篇文章感兴趣，而且你想要了解更多关于AI领域的实战技巧，可以关注「技术狂潮AI」公众号。在这里，你可以看到最新最热的AIGC领域的干货文章和案例实战教程。