函数调用：在 xLAM 上微调 Llama 3得益于 QLoRA，速度快且内存高效

欢迎来到雲闪世界.最近的大型语言模型 (LLM) 在大多数语言生成任务中都表现出色。然而，由于它们基于下一个标记预测进行操作，因此它们通常难以准确执行数学运算。此外，由于它们的知识匮乏，它们可能缺乏准确回答某些查询所需的信息。

缓解这些问题的一种方法是通过函数调用。函数调用允许 LLM 可靠地连接到外部工具。它支持与外部 API 交互。例如，通过将 LLM 与网络搜索引擎和计算器连接，可以通过函数调用从互联网检索信息并执行数学运算。

在本文中，我们将了解如何微调 LLM 以进行函数调用。我使用 xLAM，这是 Salesforce 发布的包含 60k 条函数调用条目的数据集，用于微调 Llama 3。我们将了解如何格式化数据集以及如何利用微调适配器进行函数调用。

我还制作了这个笔记本，实现了本文中描述的代码以进行微调，以及一些推理示例：

获取笔记本 (#89)

LLM 的函数调用：它是如何工作的？

如果你向标准 LLM 发出“给我 3342398 的平方根”的提示，它将一次一位地生成答案，这可能非常不准确。让我们用 Llama 3 Instruct 试试：

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
messages = [
    {"role": "system", "content": "You are a calculator."},
    {"role": "user", "content": "Give me the square root of 3342398"},
]
terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][-1])

注意：我使用了Meta 提供的推理代码。我只更改了提示。

结果：

The square root of 3342398 is 1833.13

答案很接近，但错了。正确答案是“1828.222634144977”。LLM 无法准确地进行数学运算。它们可以推理数学问题并得出近似结果，但不如计算器那么准确。它们不是为此设计的。但是，它们可以调用计算器。我们只需要模型“理解”我们请求的是数字3342398的平方根。

为此，我们可以对函数调用数据集上的 LLM 进行微调。通常，函数调用数据集至少有两列：

查询（或提示）：这是一个标准提示，例如“给我 3342398 的平方根” 工具调用：一般为JSON格式，例如：

{
    "name": "math.sqrt"    
    "description": "Python square root function"
    "tool_id":20
    "arguments": 
    {
        "1": 3342398
    }
}

在推理时，我们希望模型根据“查询”生成“工具调用”。使用上面的示例，该模式将返回一个 JSON 对象，其中工具 ID 为“20”。解析 LLM 的输出后，我们可以检索工具及其参数的 ID。

然后，我们可以将“参数”传递给所选工具。对于上述示例，它将返回所请求数字的精确平方根。

尽管如此，请注意，函数调用对于 LLM 来说是一项非常困难的任务。许多事情都可能出错：

该模型可能无法为工具调用生成有效的 JSON。该模式可能无法正确“理解”查询，例如，使用 ID 19 而不是 20 调用该工具或者没有设置正确的参数。

函数调用应始终由控制代码支持，例如，在解析之前确保 JSON 对象有效，并检查参数是否可以转换为工具所需的格式。例如，如果模型生成以下工具调用：

{
    "name": "math.sqrt"    
    "description": "Python square root function"
    "tool_id":20
    "arguments": 
    {
        "1": "sqrt of 3342398"
    }
}

代码应该拒绝调用，因为参数“sqrt of 3342398”不能转换为平方根函数所需的浮点类型。

xLAM：用于函数调用的大型数据集

xLAM 是一个用于微调 LLM 以进行函数调用的数据集。它可以在 Hugging Face Hub 上找到：

Salesforce/xlam-function-c
alling-60k（CC-BY 许可证）

它是封闭的。您需要一个 Hugging Face 帐户、一个访问令牌，并且还要同意 Salesforce 的条款（模型卡上的表格）。

数据集只有一个分割（训练），并分为 4 列：

id：示例的标识符，例如“5,633” 查询：需要函数调用能力的人为提示，例如

What is the smallest number that both 12 and 15 can divide into without leaving a remainder?

tools：将调用来回答查询的工具或函数列表。每个工具都是一个 JSON 对象，其中包含多个描述预期参数的字段，例如

[{"name": "least_common_multiple", "description": "Computes the least common multiple (LCM) of two positive integers.", "parameters": {"a": {"description": "The first positive integer.", "type": "int"}, "b": {"description": "The second positive integer.", "type": "int"}}}]

答案：从查询中检索到的工具和函数调用列表，其中包含参数。例如，

[{"name": "least_common_multiple", "arguments": {"a": 12, "b": 15}}]

针对函数调用进行微调的 LLM 应该会生成“答案”。然后，对于上面的例子，我们必须解析模型的答案以检索工具的名称“least_common_multiple”及其参数“{“a”: 12, “b”: 15}”。

我们可以想象我们有一个 Python 程序，它只是检查我们是否有这个工具，以及参数是否正确（数字、类型等）。一个简单的例子可能是：

if tool == "least_common_multiple":
    check = check_args_least_common_multiple(args)
    if check:
        return least_common_multiple(args["a"], args["b"])

然后我们会将上述代码返回的值返回给用户。

xLAM 包含 60k 个训练示例。这已经足够了。

使用 xLAM 对 Llama 3 进行微调以实现函数调用

预处理 xLAM

对 LLM 进行函数调用的微调与标准微调没有太大区别。代码基本相同。主要区别在于数据预处理。但我们必须首先回答我们希望 LLM 在给定用户查询的情况下生成什么。

例如，你可能只希望你的 LLM 生成要调用的工具的名称及其参数。这对于 LLM 来说很容易学习。

在本教程中，我训练了模型来学习生成答案和工具栏。我制作了这个模板：

<user>{query}</user>
<tools>{tools}</tools>
<calls>{answers}</calls><|end_of_text|>

例如，它产生：

<user>Where can I find live giveaways for beta access and games?</user>
<tools>{'name': 'live_giveaways_by_type', 'description': 'Retrieve live giveaways from the GamerPower API based on the specified type.', 'parameters': {'type': {'description': 'The type of giveaways to retrieve (e.g., game, loot, beta).', 'type': 'str', 'default': 'game'}}}</tools>
<calls>{'name': 'live_giveaways_by_type', 'arguments': {'type': 'beta'}}
{'name': 'live_giveaways_by_type', 'arguments': {'type': 'game'}}</calls><|end_of_text|>

由于需要生成大量 JSON 语法，这项任务对于 LLM 来说相当困难。为了使用此模板预处理 xLAM，我使用了以下代码：

ds = load_dataset("Salesforce/xlam-function-calling-60k", split="train")
#Add the EOS token
def process(row):
    row["query"] = "<user>"+row["query"]+"</user>\n\n"
    tools = []
    for t in json.loads(row["tools"]):
      tools.append(str(t))
    answers = []
    for a in json.loads(row["answers"]):
      answers.append(str(a))
    row["tools"] = "<tools>"+"\n".join(tools)+"</tools>\n\n"
    row["answers"] = "<calls>"+"\n".join(answers)+"</calls>"
    row["text"] = row["query"]+row["tools"]+row["answers"]+tokenizer.eos_token
    return row
ds = ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

训练示例现在位于名为“文本”的新列中。

QLoRA 微调

我使用 QLoRA 微调来减少内存消耗。

安装以下库：

pip install --upgrade bitsandbytes transformers peft accelerate datasets trl

然后，导入：

import torch, os, multiprocessing, json
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from trl import SFTTrainer, SFTConfig

接下来，我们检查 FlashAttention 和 bfloat16 是否与 GPU 兼容。如果兼容，我们将使用它们。我们还设置了填充标记：

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  os.system('pip install flash_attn')
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "meta-llama/Meta-Llama-3-8B"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = "<|eot_id|>"
tokenizer.pad_token_id = 128009
tokenizer.padding_side = 'left'

我创建了一个 QLoRA 函数，我们只需要将预处理的 xLAM 传递给它：

def QLoRA(ds):
  bnb_config = BitsAndBytesConfig(
          load_in_4bit=True,
          bnb_4bit_quant_type="nf4",
          bnb_4bit_compute_dtype=compute_dtype,
          bnb_4bit_use_double_quant=True,
  )
  model = AutoModelForCausalLM.from_pretrained(
            model_name, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
  )
model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})
  #Configure the pad token in the model
  model.config.pad_token_id = tokenizer.pad_token_id
  model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching
  peft_config = LoraConfig(
          lora_alpha=16,
          lora_dropout=0.05,
          r=16,
          bias="none",
          task_type="CAUSAL_LM",
          target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
  )
  training_arguments = SFTConfig(
          output_dir="./Llama3_8b_xLAM",
          optim="adamw_8bit",
          per_device_train_batch_size=8,
          gradient_accumulation_steps=4,
          log_level="debug",
          save_steps=250,
          logging_steps=10,
          learning_rate=1e-4,
          fp16 = not torch.cuda.is_bf16_supported(),
          bf16 = torch.cuda.is_bf16_supported(),
          max_steps=1000,
          warmup_ratio=0.1,
          lr_scheduler_type="linear",
          dataset_text_field="text",
          max_seq_length=512,
  )
  trainer = SFTTrainer(
          model=model,
          train_dataset=ds,
          peft_config=peft_config,
          tokenizer=tokenizer,
          args=training_arguments,
  )
  trainer.train()

QLoRA(ds)

我只训练了 1000 步，没有寻找更好的超参数。我建议至少训练 1 个 epoch，并将“max_seq_length”设置为 1024，以确保训练示例不会被过于频繁地截断。这些更改将大大改善适配器，但也会增加内存消耗，并且在消费级硬件上训练将需要一天以上的时间。设置较低的学习率也可能有帮助。

训练完成后，您可以使用以下代码尝试微调适配器。首先，我们使用微调期间使用的相同量化配置加载模型和适配器：

import torch,os
from peft import PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  os.system('pip install flash_attn')
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'
quantization_config=BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)
adapter= "./Llama3_8b_xLAM/checkpoint-1000"
model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
print(f"Starting to load the model {model_name} into memory")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    torch_dtype=compute_dtype,
    device_map={"": 0},
    attn_implementation=attn_implementation,
)
print(model)
model = PeftModel.from_pretrained(model, adapter)

然后，我们可以尝试一些查询。注意：我们不希望模型在函数调用方面具有创造性，而是输出最可能的答案。换句话说，我们应该停用采样并将温度设置为 0.0。

prompt = "<user>Check if the numbers 8 and 1233 are powers of two.</user>\n\n<tools>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, do_sample=False, temperature=0.0, max_new_tokens=150)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

它生成：

<user>Check if the numbers 8 and 1233 are powers of two.</user>
<tools>is_power_of_two</tools>
<calls>{'name': 'is_power_of_two', 'arguments': {'num': 8}}
{'name': 'is_power_of_two', 'arguments': {'num': 1233}}</calls>

对于仅有 1000 个训练步骤的情况来说，这还不算太糟糕。但是，它没有在 <tools> 和 </tools> 之间生成良好的工具描述。这不是 JSON，与我们在训练数据中的内容非常不同。这将通过更长的训练来解决。

我还使用 LoRA 训练了 Llama 3 8B 的适配器。您可以在 Hugging Face Hub 上找到它：

kaitchup/Meta-Llama-3–8B-xLAM-Adapter（CC-BY 许可证）

注意：此适配器也未经过训练。我提供它主要用于测试目的。

结论

对 LLM 进行函数调用的微调使其能够执行精确的数学运算、检索最新信息以及调用 API 来执行各种任务。

使用包含 60,000 个函数调用条目的 xLAM 数据集，可以训练 Llama 3 等模型来生成工具调用。但是，正如我们所看到的，我们可能需要对模型进行长时间的训练，才能让它们正确地学会以预期的格式生成。简化格式（例如，从 JSON 更改为 YAML）并从训练数据中删除工具描述，可能有助于模型学习如何进行函数调用。

感谢关注雲闪世界。（亚马逊aws和谷歌GCP服务协助解决云计算及产业相关解决方案）

订阅频道(https://t.me/awsgoogvps_Host)
TG交流群(t.me/awsgoogvpsHost)

总结

这个链接可能存在安全风险，为了保护您的设备和数据安全，请避免访问此链接。