【学习笔记】：Ubuntu 22 使用模型量化工具llama.cpp部署大模型 CPU+GPU

学习笔记：Ubuntu 22 使用模型量化工具llama.cpp部署大模型 CPU+GPU

前言 1 下载并编译llama.cpp 1.1 git下载llama.cpp仓库源码 1.2 编译源码（make） 1.2.1 选择一：仅在CPU上运行 1.2.2 选择二：使用GPU，与cuBLAS编译 2 量化大模型 2.1 准备大模型 2.2 生成量化模型 3 加载模型 3.1 CPU 3.2 GPU 4 llama-cpp-python 4.1 安装llama-cpp-python 4.2 API 参考

前言

官方仓库：
llama.cpp
llama-cpp-python
环境：
CUDA Version: 12.2
Torch: 2.1.1
Python: 3.9

1 下载并编译llama.cpp

1.1 git下载llama.cpp仓库源码

由于服务器git上不去，先下载源码到本地再上传到服务器（带有.git隐藏文件）。

git clone https://github.com/ggerganov/llama.cpp

1.2 编译源码（make）

生成./main和./quantize等二进制文件。

cd llama.cpp

1.2.1 选择一：仅在CPU上运行

make

1.2.2 选择二：使用GPU，与cuBLAS编译

使用 Nvidia GPU 的 CUDA 内核提供 BLAS 加速，确保设备上有GPU+CUDA。

make LLAMA_CUBLAS=1

2 量化大模型

2.1 准备大模型

llama.cpp支持转换的模型格式有PyTorch 的.pth、huggingface的 .safetensors、还有之前 llamma.cpp 采用的 ggmlv3。
在 huggingface 上找到合适格式的模型，下载至 llama.cpp 的 models目录下。
或本地已下载的模型上传至models目录。

2.2 生成量化模型

quantize 提供各种精度的量化。量化会损失精度.（参考WIKI最后部分、LLM量化笔记）
仅需要近似于Q4_0或者Q4_1的效果（模型大小在3.5～3.9G），可以使用的是Q4_K_M，效果相比标准Q4_1模型变小了（3.9G->3.8G），ppl也变小了（0.1846->0.0535）。

先将模型转为GGUF的FP16格式

python3 convert.py ./models/chinese-alpaca-2-7b-hf/

再对FP16模型进行4-bit量化

./quantize ./models/chinese-alpaca-2-7b-hf/ggml-model-f16.gguf ./models/chinese-alpaca-2-7b-hf/ggml-model-q4_0.gguf Q4_0

3 加载模型

（涉及加载模型后部分模式的应用）

3.1 CPU

./main -m ./models/chinese-alpaca-2-7b-hf/ggml-model-q4_0.gguf -n 128 --prompt "Once upon a time"

main模式部分参数：

m 指定模型 ins 交互模式，可以连续对话，上下文会保留 c 控制上下文的长度，值越大越能参考更长的对话历史（默认：512） n 控制回复生成的最大长度（默认：128） –temp 温度系数，值越低回复的随机性越小

#以交互式对话
./main -m ./models/chinese-alpaca-2-7b-hf/ggml-model-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.3
#chat with bob
./main -m ./models/chinese-alpaca-2-7b-hf/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

参考

3.2 GPU

CPU处理速度较慢，可以选择使用GPU进行加速。添加n_gpu_layers参数，让一些层在GPU上跑，提升推理的速度，具体数值视情况而定。

./main -m ./models/chinese-alpaca-2-7b-hf/ggml-model-q4_0.gguf -n 128 --n_gpu_layers 40 --prompt "Once upon a time"

其他详细examples内容请参考官方文档。

4 llama-cpp-python

可以借助llama-cpp-python的API编写python程序，读取文件进行文本生成任务。

4.1 安装llama-cpp-python

pip install llama-cpp-python

注：若需要GPU加速，要在安装前设置环境变量：LLAMA_CUBLAS=on。若加速失败下滑见后文参考。

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

4.2 API

高级API使用实例可直接参考仓库High-level API部分，使用GPU加速修改n_gpu_layers值即可。

from llama_cpp import Llama
llm = Llama(
      model_path="./models/7B/llama-model.gguf",
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

若安装后不能成功使用GPU加速，参考此解决方案重新安装：

conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit cuda-nvcc -y --copy
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir

参考

[1] llama.cpp量化部署
[2]大模型部署工具 llama.cpp
[3]Installing llama-cpp-python with GPU Support

（个人学习参考笔记，如有不妥烦请告知）