手把手带你了解和实践扩充 LLaMA 大语言模型的 tokenizer 模型（实现中文token过程）

目前，大语言模型呈爆发式的增长，其中，基于llama家族的模型占据了半壁江山。而原始的llama模型对中文的支持不太友好，接下来本文将讲解如何去扩充vocab里面的词以对中文进行token化。

一般的，目前比较主流的是使用sentencepiece训练中文词库。安装指令也很简单：pip install sentencepiece。然后，我们准备好语料，这里我们使用的语料是斗破苍穹小说。

with open("data.txt", "r", encoding="utf-8") as fp:
    data = fp.read().strip().split("\n")
sentences = []

for d in data:
    d = d.strip()
    if "===" in d or len(d) == 0 or d == "《斗破苍穹》来自:":
        continue
    sentences.append(d)

with open("corpus.txt", "w", encoding="utf-8") as fp:
    fp.write("\n".join(sentences))

!pip install sentencepiece

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: sentencepiece in e:\miniconda3\lib\site-packages (0.1.99)


DEPRECATION: Loading egg at e:\miniconda3\lib\site-packages\whisper_live-0.0.11-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330

开始训练，这里面有几个参数要注意一下，model_type分词算法选择bpe，split_digits为True，byte_fallback为True，和LLaMa 保持一致，max_sentence_length设置的大一点，更多参数解释可以查看：https://zhuanlan.zhihu.com/p/655281268 和 https://zhuanlan.zhihu.com/p/639144223

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    input_format='text',
    model_prefix='tokenizer',
    vocab_size=10000,
    character_coverage=0.9995,
    model_type="bpe",
    num_threads=32,
    split_digits=True,
    byte_fallback=True,
    max_sentence_length=24000
)

执行上述训练过程，大概需要30S左右，会在当前目录下生成三个文件，tokenizer.model，tokenizer.vocab。看一下模型的分词效果：

import sentencepiece as spm
sp_bpe = spm.SentencePieceProcessor()

sp_bpe.load('tokenizer.model')

print(sp_bpe.encode_as_pieces('The excellence of a translation can only be judged by noting'))
print('分词长度:', len(sp_bpe.encode_as_pieces('The excellence of a translation can only be judged by noting')))
print(sp_bpe.encode_as_pieces('麒麟，是中国古代神话中的一种瑞兽'))
print('分词长度:', len(sp_bpe.encode_as_pieces('麒麟，是中国古代神话中的一种瑞兽')))

['▁', '<0x54>', 'h', 'e', '▁', 'e', 'x', 'c', 'e', 'l', 'l', 'e', 'n', 'c', 'e', '▁', 'o', 'f', '▁', 'a', '▁', 't', 'r', 'a', 'n', 's', 'l', 'a', 't', 'i', 'o', 'n', '▁', 'c', 'a', 'n', '▁', 'o', 'n', 'l', 'y', '▁', 'b', 'e', '▁', 'j', 'u', 'd', 'g', 'e', 'd', '▁', 'b', 'y', '▁', 'n', 'o', 't', 'i', 'n', 'g']
分词长度: 61
['▁', '<0xE9>', '<0xBA>', '<0x92>', '麟', ',', '是', '中', '国', '古', '代', '神', '话', '中', '的一种', '<0xE7>', '<0x91>', '<0x9E>', '兽']
分词长度: 19

可以看到，因为训练语料几乎都是中文的，对中文的分词效果是好于英文的，中文常见的一些词都变成了一个token，而英文被分的很碎。接下里把这个词表和原生LLaMa的词表进行合并。

import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm

# 位置
llama_tokenizer_dir = "llama2-7b-hf" # llama2模型
chinese_sp_model_file ="tokenizer.model" # 刚才训练的模型

# 加载
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_file)
llama_spm = sp_pb2_model.ModelProto()
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
chinese_spm = sp_pb2_model.ModelProto()
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())

# 打印两个词表的大小和原llama的特殊token
print(f'llama2的词表大小为{len(llama_tokenizer)}, 刚训练的模型的词表大小为{len(chinese_sp_model)}')
print(llama_tokenizer.all_special_tokens) # 特殊token
print(llama_tokenizer.all_special_ids) # 特殊token对应的id
print(llama_tokenizer.special_tokens_map)

llama2的词表大小为32000, 刚训练的模型的词表大小为10000
['<s>', '</s>', '<unk>']
[1, 2, 0]
{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}

开始往llama词表里添加，这里你也可以直接加入你想要加入词表的词，或者是领域内的特殊词

llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)

print(f"添加词表前，词表大小为:{len(llama_spm_tokens_set)}")
for p in chinese_spm.pieces:
    piece = p.piece
    if piece not in llama_spm_tokens_set:
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        llama_spm.pieces.append(new_p)

print(f"新合并词表的大小为: {len(llama_spm.pieces)}")

添加词表前，词表大小为:32000
新合并词表的大小为: 41013

41013-32000=9013，可以大小9013跟我们训练的10000词不相等，这是因为合并过程会默认进行去重操作，去重后的新合并的词表大小为9013。

# 保存合并后的模型
output_sp_dir = 'merged_tokenizer_sp_test'
output_hf_dir = 'merged_tokenizer_hf_test'

os.makedirs(output_sp_dir,exist_ok=True)
os.makedirs(output_hf_dir,exist_ok=True)
with open(output_sp_dir+'/chinese_llama.model', 'wb') as f:
    f.write(llama_spm.SerializeToString())
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/chinese_llama.model')

tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Chinese-LLaMA tokenizer has been saved to merged_tokenizer_hf_test

# 看一下效果
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)

text = "The excellence of a translation can only be judged by noting"
print("原始文本:",text)
print(f"llama进行token词表分割:{llama_tokenizer.tokenize(text)}")
print(f"llama进行token词表长度为:{len(llama_tokenizer.tokenize(text))}")
print(f"新合并的token模型词表分割:{chinese_llama_tokenizer.tokenize(text)}")
print(f"新合并的token模型词表长度为:{len(chinese_llama_tokenizer.tokenize(text))}")

原始文本: The excellence of a translation can only be judged by noting
llama进行token词表分割:['▁The', '▁excell', 'ence', '▁of', '▁a', '▁translation', '▁can', '▁only', '▁be', '▁jud', 'ged', '▁by', '▁not', 'ing']
llama进行token词表长度为:14
新合并的token模型词表分割:['The', '▁excell', 'ence', '▁of', '▁a', '▁translation', '▁can', '▁only', '▁be', '▁jud', 'ged', '▁by', '▁not', 'ing']
新合并的token模型词表长度为:14

可以看到在英文上是没有变化的

text = "麒麟，是中国古代神话中的一种瑞兽"
print("Test text:\n",text)
print("原始文本:",text)
print(f"llama进行token词表分割:{llama_tokenizer.tokenize(text)}")
print(f"llama进行token词表长度为:{len(llama_tokenizer.tokenize(text))}")
print(f"新合并的token模型词表分割:{chinese_llama_tokenizer.tokenize(text)}")
print(f"新合并的token模型词表长度为:{len(chinese_llama_tokenizer.tokenize(text))}")

Test text:
 麒麟，是中国古代神话中的一种瑞兽
原始文本: 麒麟，是中国古代神话中的一种瑞兽
llama进行token词表分割:['▁', '<0xE9>', '<0xBA>', '<0x92>', '<0xE9>', '<0xBA>', '<0x9F>', '，', '是', '中', '国', '古', '代', '神', '话', '中', '的', '一', '种', '<0xE7>', '<0x91>', '<0x9E>', '<0xE5>', '<0x85>', '<0xBD>']
llama进行token词表长度为:25
新合并的token模型词表分割:['<0xE9>', '<0xBA>', '<0x92>', '麟', '，', '是', '中', '国', '古', '代', '神', '话', '中的', '一种', '<0xE7>', '<0x91>', '<0x9E>', '兽']
新合并的token模型词表长度为:18

至此，我们完成了LLaMa中文词表的扩充，扩充垂直领域词表也是如此，要准备垂直领域的训练语料，最好和通用领域的训练语料混合一下

总结

### 文章总结
随着大语言模型的爆发式增长，基于LLaMa家族的模型在市场中占据了重要地位。然而，原始的LLaMa模型对中文的支持并不友好。为了改善这一状况，本文详细介绍了如何通过扩充vocab中的词汇来实现对中文的更好token化。
#### 主要步骤：
1. **选择工具与安装**：
- 使用`sentencepiece`库来训练中文词库，通过`pip install sentencepiece`安装。
2. **准备语料**：
- 选择斗破苍穹小说作为训练语料，并进行预处理，去除无用行和特殊标记，保存为`corpus.txt`。
3. **训练词库**：
- 使用`sentencepiece`的`SentencePieceTrainer.train`方法训练词库，设置合适的参数（如`model_type="bpe"`, `split_digits=True`, `byte_fallback=True`等），确保与LLaMa模型兼容。
- 训练完成后，生成`tokenizer.model`和`tokenizer.vocab`文件。
4. **验证分词效果**：
- 加载训练好的模型，测试其对中英文的分词效果，发现对中文的分词效果优于英文。
5. **合并词表**：
- 加载原始的LLaMa词表和训练好的中文词表。
- 将中文词表中的词汇添加到LLaMa词表中，注意去重。
- 保存合并后的词表为新的模型文件。
6. **测试合并后的词表**：
- 使用合并后的词表对中英文文本进行分词，验证其在中文上的改进效果。
- 示例显示，在中文文本上，合并后的词表能够生成更合理的分词结果。
#### 注意事项：
- 训练词库时，参数设置应与LLaMa模型保持一致，以确保兼容性。
- 合并词表时，注意去重操作，避免词表过大。
- 扩充垂直领域词表时，建议将垂直领域语料与通用领域语料混合使用，以提高模型的泛化能力。
通过上述步骤，本文成功实现了LLaMa模型中文词表的扩充，为中文处理提供了更好的支持。