vllm 聊天模板

背景如何使用chat template generation prompt & add_generation_prompt chat templates的额外输入工具使用 / 函数调用 Chat Template的工作机制多个模板的情况

背景

最近在使用vllm来运行大模型，使用了文档提供的代码如下所示，发现模型只是在补全我的话，像一个base的大模型一样，而我使用的是经过指令微调的有聊天能力的大模型。回过头看huggingface提供的使用大模型的代码，发现有一个方法是apply_apply_chat_template，并且对话还通常有着角色，例如"user"或"system"，这让我意识到使用大模型的聊天功能并不是直接将输入提供给模型即可。因此需要对大模型聊天能力背后的细节进行一些了解。实现将prompt转为对话信息的代码见：https://github.com/JinFish/EasyChatTemplating

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="../../pretrained_models/llama3-chat")
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
>>> Prompt: 'Hello, my name is', Generated text: ' Helen and I am a 35 year old mother of two. I am a'
>>> Prompt: 'The president of the United States is', Generated text: ' the head of the executive branch of the federal government, and is the highest-ranking'
>>> Prompt: 'The capital of France is', Generated text: ' Paris, and it is also the largest city in the country. It is situated'
>>> Prompt: 'The future of AI is', Generated text: ' full of endless possibilities, but it also poses significant challenges and risks. As AI'

当前的大模型通常是decoder-only的模型，无论是单轮对话还是多轮对话都是一股脑地丢进模型，而区分对话中的角色和对话需要一些特殊的标记。例如：在用户输入的时候，格式是user：我今早上吃了炒米粉。assistant：炒米粉在广东是蛮常见的早餐，但是油太多，可以偶尔吃吃。而输入给模型的则是：<s><intp>我今早上吃了炒米粉。</intp> [ASST] 炒米粉在广东是蛮常见的早餐，但是油太多，可以偶尔吃吃。[/ASST] eos_token。其中<intp>和</intp>用来表示用户的输入，[ASST]和[/ASST]表示模型的回复。eos_token表示会话的结束。

此外，目前大模型最常见的应用便是“对话”，在对话的上下文中，往往语言模型不是像往常那样延续一个单独的文本字符串，而是要延续由一个或多个**“messages”（消息）组成的会话**，并且每个消息都会包含一个**“role”（角色）**，例如"user"或者"assistant"，以及对应的消息内容。

就像不同的模型有不同的分词方式、特殊标记和形式一样，不同的大模型也有不同的chat template，这是tokenizer的一部分，其主要指定了如何将以消息列表呈现的会话转换成模型所期望的单个token化的字符串格式。以mistralai/Mistral-7B-Instruct-v0.1为例，其会使用<s>表示一个会话的开始，</s>表示回合的结束，即用来表示回合的边界，其会使用[INST]以及[/INST]来表示用户输入的信息：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

tokenizer.apply_chat_template(chat, tokenize=False)
"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

如何使用chat template

从上个例子上来看，使用chat template是比较简单的，首先就是定义一个带有”role“和，”content“为键的消息列表，然后将该列表传入给tokenizer的apply_chat_template方法即可。

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)  # You may want to use bfloat16 and/or move to GPU here

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

运行上述代码则可以得到对应的chat template格式：

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s> 
<|user|>
How many helicopters can a human eat in one sitting?</s> 
<|assistant|>

generation prompt & add_generation_prompt

在上述例子中的apply_chat_template方法中有一个参数为add_generation_prompt，其值为True或False，如果设置为True，那么模型就会自动生成一些固定的prompt，即generation prompt。例如，有以下对话：

messages = [
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"},
    {"role": "user", "content": "Can I ask a question?"}
]

如果将参数设置为False，则会得到以下的chat template格式的输出：

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
"""

如果将参数设置为True，则会得到以下格式的输出：

tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
"""<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""

可以看到比上个例子多了<|im_start|>assistant，这是模型回答之前的标识。这样可以确保模型生成文本时会输出助手对应的回复，而不是做一些意想不到的事情，比如继续用户的信息。注意：**聊天模型仍然只是语言模型，它们被训练来续写文本，而聊天对它们来说也只是一种特殊的文本。我们需要用适当的控制标记来引导它们，让它们知道自己应该做什么。**提前为模型写好<|im_start|>assistant可以让模型知道自己输出的应该是回复，而不是续写用户的输入。

但是并不是所有的模型都支持该特性，因为有一些模型其“助手回复”之前没有任何的特殊标识，因此在这些模型上使用add_generation_prompt是没有效果的。

chat templates的额外输入

通常来说，只需要给apply_chat_template方法传入messages即可，但也可以传入一些额外的参数供模板访问。例如将tools传入可以实现对工具的使用，或者将文档传入实现RAG。

工具使用 / 函数调用

“Tool use”即LLM会在生成答案之前能够选择调用的方法作为额外的工具，将工具传递给可以使用工具的模型时，可以简单地将函数列表传递给工具参数：

import datetime

def current_time():
    """Get the current local time as a string."""
    return str(datetime.now())

def multiply(a: float, b: float):
    """
    A function that multiplies two numbers
    
    Args:
        a: The first number to multiply
        b: The second number to multiply
    """
    return a * b

tools = [current_time, multiply]

model_input = tokenizer.apply_chat_template(
    messages,
    tools=tools
)

为了使工具能够正常工作，应该按照上述格式编写函数，以便将其作为工具正确解析。具体来说，应遵循以下规则：

函数应有一个描述性的名称每个参数都必须有类型提示函数必须具有标准 Google 风格的 docstring（换句话说，在初始函数描述之后，必须有一个 Args: 块来描述参数，除非函数没有任何参数）不要在 Args: 代码块中包含类型。换句话说，应该写 a: 第一个要相乘的数字，而不是 a (int)：第一个要相乘的数字。类型提示应放在函数头中函数可以在 docstring 中包含返回类型和 Returns: 块。不过，这些都是可选的，因为大多数工具使用模型都会忽略它们

当模型具体调用一个方法时发生了什么呢：

当模型生成响应时，它的输出可能包含调用特定工具的信息。这通常是以模型预定义的格式或特殊标记来表达的，比如模型可能会生成类似call tool <tool_name> with arg1=value1, arg2=value2的语句。需要解析这些参数，从模型的输出中提取出工具的名字以及调用该工具所需的参数。将模型调用工具的请求添加到对话历史中，这是为了保持对话上下文的连贯性。这一步确保了工具调用是对话的一部分，且在未来的模型推理中可以参考这次调用。使用从模型输出中解析得到的工具名和参数，实际调用相应的工具函数。这可能是执行一个外部API调用、运行一段代码、查询数据库或其他任何类型的操作。将工具执行后返回的结果添加回对话历史中。这样，模型在后续的推理过程中可以访问这些结果，从而基于工具提供的信息生成更丰富、更具体的响应。

整个流程形成了一个闭环，模型可以提出需求（调用工具），执行工具，然后将工具的结果反馈给模型，使模型能够在对话中利用这些结果继续进行有意义的交互。这在构建复杂的应用场景，如查询数据库、执行计算或调用天气API时非常有用。

下面是一个具体的一个模型调用工具的例子，使用了8B Hermes-2-Pro模型：

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "NousResearch/Hermes-2-Pro-Llama-3-8B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, revision="pr/13")
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto")

def get_current_temperature(location: str, unit: str) -> float:
    """
    Get the current temperature at a location.
    
    Args:
        location: The location to get the temperature for, in the format "City, Country"
        unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"])
    Returns:
        The current temperature at the specified location in the specified units, as a float.
    """
    return 22.  # A real function should probably actually get the temperature!

def get_current_wind_speed(location: str) -> float:
    """
    Get the current wind speed in km/h at a given location.
    
    Args:
        location: The location to get the temperature for, in the format "City, Country"
    Returns:
        The current wind speed at the given location in km/h, as a float.
    """
    return 6.  # A real function should probably actually get the wind speed!

tools = [get_current_temperature, get_current_wind_speed]

messages = [
  {"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."},
  {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]

inputs = tokenizer.apply_chat_template(messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))

将用户的对话和函数传给通过tokenizer处理后，将处理结果输入给模型得到：

<tool_call>
{"arguments": {"location": "Paris, France", "unit": "celsius"}, "name": "get_current_temperature"}
</tool_call><|im_end|>

可以看出模型的输出是一个工具调用指令，它遵循了一种结构化的格式，用于指示系统调用一个特定的工具或API：

<tool_call> 和 </tool_call> 是标签，它们标记了模型输出中工具调用的开始和结束。

{...} 中的内容是一个JSON格式的对象，包含了调用工具所需的信息。

"name": "get_current_temperature" 指出了模型想要调用的工具或函数的名称，这里是获取当前温度的功能。

"arguments" 字段包含了传递给该工具的参数，它本身也是一个JSON对象，其中包含：

"location": "Paris, France"，指定要查询的地点为法国巴黎。 "unit": "celsius"，表明温度单位应该是摄氏度。

接下来还需要将工具的调用附加在会话上，让模型得知自己调用了什么工具，以及结果是什么：

首先是随机化了一个工具id，用来唯一地表示在某个会话中调用的工具，并且记录工具的类型和工具的名称和参数：

tool_call_id = "vAHdf3"  # Random ID, should be unique for each tool call
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"}}
messages.append({"role": "assistant", "tool_calls": [{"id": tool_call_id, "type": "function", "function": tool_call}]})

然后将工具的返回结果添加入会话中：

messages.append({"role": "tool", "tool_call_id": tool_call_id, "name": "get_current_temperature", "content": "22.0"})

最后，让模型根据读取对应的信息继续生成输出给用户：

inputs = tokenizer.apply_chat_template(messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):]))

"The current temperature in Paris, France is 22.0 ° Celsius.<|im_end|>""

实际上，模型并没有完全地阅读函数中的代码。模型真正关心的是函数定义和需要传递给函数的参数以及参数的描述，模型关心的是工具的作用和使用方法，而不是工具的工作原理。因此，函数会被处理成JSON的格式供模型阅读：

from transformers.utils import get_json_schema

def multiply(a: float, b: float):
    """
    A function that multiplies two numbers
    
    Args:
        a: The first number to multiply
        b: The second number to multiply
    """
    return a * b

schema = get_json_schema(multiply)
print(schema)
# 可以看到，返回的结果只包含了函数的说明性的信息，并没有对应原理或源码的信息
{
  "type": "function", 
  "function": {
    "name": "multiply", 
    "description": "A function that multiplies two numbers", 
    "parameters": {
      "type": "object", 
      "properties": {
        "a": {
          "type": "number", 
          "description": "The first number to multiply"
        }, 
        "b": {
          "type": "number",
          "description": "The second number to multiply"
        }
      }, 
      "required": ["a", "b"]
    }
  }
}

Chat Template的工作机制

将“messages”转换为模型可以理解的格式需要通过Chat Template，这些聊天模板通常存储在模型的tokenizer_config.json中的chat_template的键中，其在tokenizer加载的过程中会将其赋给tokenizer的chat_template属性，其中blenderbot-400M-distill的聊天模板如下所示：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill")

tokenizer.default_chat_template
"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ '  ' }}{% endif %}{% endfor %}{{ eos_token }}"

这些聊天模板通常以Jinja2的形式存在，以下是其更加直观的样式：

{%- for message in messages %}
    {%- if message['role'] == 'user' %}
        {{- ' ' }}
    {%- endif %}
    {{- message['content'] }}
    {%- if not loop.last %}
        {{- '  ' }}
    {%- endif %}
{%- endfor %}
{{- eos_token }}

由于Jinja2会保留模板中标签自带的缩进和换行，所以会在控制语句和表达式中添加-，用来消除其前面的空白。将上述代码转为python的代码的话，如下所示：

for idx, message in enumerate(messages):
    if message['role'] == 'user':
        print(' ')
    print(message['content'])
    if not idx == len(messages) - 1:  # Check for the last message in the conversation
        print('  ')
print(eos_token)

所以，总的来说，上述模板做的事情有：

遍历每条信息，如果信息的角色是用户，那么会先添加一个空格。添加每条信息内容。如果不是最后一条信息，那么就添加两个空格。在最后一条信息后添加eos_token。

多个模板的情况

一些模型在不同的情况下使用不同的模板，它们可能会使用一个模板用于聊天，另一个模板用于调用工具或者使用RAG。在这种情况下，给tokenizer中的chat_template属性是一个字典，包含了多个模板，其中每个键是模板的名称，其中可能会有一个键为default，用于大部分的样例。这可能会导致一些歧义或困惑，在可能的情况下，最好是使用一个模板用于所有的情况。

在引入更加灵活的聊天模板（chat templates）之前，模型的聊天能力受限于其内部实现。这意味着不同模型在处理聊天输入时可能有不同的行为，这主要由模型的架构和训练数据决定。为了保持向后兼容性，当引入新的聊天模板机制时，旧的、基于模型类的处理方式被保留下来，作为“default”模板。

如果一个模型没有显式地设置聊天模板（chat template），但它的模型类中定义了一个“default”模板，那么TextGenerationPipeline和其他相关的方法会使用这个类级别的模板来处理聊天对话。这保证了即使开发者没有主动配置聊天模板，模型也能以某种预设的方式进行对话处理。

要了解一个特定模型的默认模板是什么，可以检查tokenizer的default_chat_template属性。这提供了模型默认的聊天处理方式的信息。

尽管“default”模板为向后兼容提供了便利，但官方文档强烈建议开发者不要依赖这些默认模板，而是应该显式地设置chat_template属性。这样做有几个好处：

明确地表明模型已经为聊天场景进行了正确的配置。提供了更大的灵活性和控制，可以根据具体的应用场景调整模板。减少了未来更新中移除默认模板可能带来的风险。

本文参考自https://huggingface.co/docs/transformers/chat_templating

总结