Llama 及中文Alpaca模型部署测试

环境：

Xeon E5-2680v4 16C

40G RAM

WinServer 2019 Standard Edition

Python 3.10

依赖库:

accelerate==0.18.0
anyio==3.5.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.5
attrs==22.1.0
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==4.1.0
brotlipy==0.7.0
certifi==2022.12.7
cffi==1.15.1
chardet==5.1.0
charset-normalizer==3.1.0
colorama==0.4.6
comm==0.1.2
cryptography==39.0.1
debugpy==1.5.1
decorator==5.1.1
defusedxml==0.7.1
entrypoints==0.4
executing==0.8.3
fastjsonschema==2.16.2
filelock==3.12.0
fsspec==2023.4.0
huggingface-hub==0.14.0
idna==3.4
importlib-metadata==6.0.0
importlib-resources==5.2.0
ipykernel==6.19.2
ipython==8.12.0
ipython-genutils==0.2.0
jedi==0.18.1
Jinja2==3.1.2
json5==0.9.6
jsonschema==4.17.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter-server==1.23.4
jupyterlab==3.5.3
jupyterlab-pygments==0.1.2
jupyterlab_server==2.22.0
lxml==4.9.2
MarkupSafe==2.1.2
matplotlib-inline==0.1.6
mistune==0.8.4
mpmath==1.3.0
nbclassic==0.5.5
nbclient==0.5.13
nbconvert==6.5.4
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==3.1
notebook==6.5.4
notebook_shim==0.2.2
numpy==1.24.3
packaging==23.1
pandocfilters==1.5.0
parso==0.8.3
peft==0.2.0
pickleshare==0.7.5
pip==23.0.1
pkgutil_resolve_name==1.3.10
platformdirs==2.5.2
prometheus-client==0.14.1
prompt-toolkit==3.0.36
protobuf==3.19.0
psutil==5.9.4
pure-eval==0.2.2
pycparser==2.21
Pygments==2.11.2
pyOpenSSL==23.0.0
pyrsistent==0.18.0
PySocks==1.7.1
python-dateutil==2.8.2
pytz==2022.7
pywin32==305.1
pywinpty==2.0.10
PyYAML==6.0
pyzmq==23.2.0
regex==2023.3.23
requests==2.28.2
Send2Trash==1.8.0
sentencepiece==0.1.98
setuptools==66.0.0
six==1.16.0
sniffio==1.2.0
soupsieve==2.4
stack-data==0.2.0
sympy==1.11.1
terminado==0.17.1
tinycss2==1.2.1
tokenizers==0.13.3
tomli==2.0.1
torch==2.0.0
tornado==6.2
tqdm==4.65.0
traitlets==5.7.1
transformers==4.29.0.dev0
typing_extensions==4.5.0
urllib3==1.26.15
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==0.58.0
wheel==0.38.4
win-inet-pton==1.1.0
zipp==3.11.0

c++编译工具：w64devkit-1.16.1

Pretrained Model：

*原板LLaMa 13B;

下载链接：

LoRa：

中文LLaMa 13B LoRa; 【后改为中文Alpaca 13B LoRa】

下载链接：

测试部署过程：

1、操作原始版本LLaMa

转换模型格式: 使用?transformers提供的脚本convert_llama_weights_to_hf.py，将原版LLaMA模型转换为HuggingFace格式。将原版LLaMA的tokenizer.model放在--input_dir指定的目录，其余文件放在${input_dir}/${model_size}下。执行以下命令后，--output_dir中将存放转换好的HF版权重。

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir path_to_original_llama_root_dir \
    --model_size 13B \
    --output_dir path_to_original_llama_hf_dir

2、合并LoRa权重：

这一个步骤是将LLaMA模型进行中文词表的扩展，合并LoRa权重，生成全量模型权重。可以选责生成PyTorch版本权重文件（.pth）文件，或者生成HuggingFace版本权重（.bin）文件。

.pth 文件用于使用llama.cpp工具量化部署；

.bin文件用于： a) 使用Transformer进行推理； b) 使用text-generation-webUI进行模型部署；

使用合并代码merge_llama_with_chinese_lora.py进行合并：

python scripts/merge_llama_with_chinese_lora.py \
    --base_model path_to_original_llama_hf_dir \
    --lora_model path_to_chinese_llama_or_alpaca_lora \
    --output_type [pth|huggingface]
    --output_dir path_to_output_dir

参数说明：

    --base_model：存放HF格式的LLaMA模型权重和配置文件的目录（第一步生成）
    --lora_model：中文LLaMA/Alpaca LoRA解压后文件所在目录，合并多个lora文件需要用逗号隔开；
    -output_type: 指定输出格式，可为pth或huggingface。若不指定，默认为pth
    --output_dir：指定保存全量模型权重的目录，默认为./
    （可选）--offload_dir：对于低内存用户需要指定一个offload缓存路径

本项目中，lora的顺序是先 llama-lora，后Alpaca-lora。
⚠️模型顺序不能弄错⚠️

运行过程中，占用28G左右的内存，逐层合并权重：

代码运行结束后，生成合并权重后的模型文件。

Llama 及 中文Alpaca模型部署测试

Llama 及中文Alpaca模型部署测试