llama.cpp制作GGUF文件及使用

llama.cpp的介绍

llama.cpp是一个开源项目，由Georgi Gerganov开发，旨在提供一个高性能的推理工具，专为在各种硬件平台上运行大型语言模型（LLMs）而设计。这个项目的重点在于优化推理过程中的性能问题，特别是针对CPU环境。以下是关于llama.cpp的几个关键特性：

高性能推理引擎：llama.cpp使用C语言编写的机器学习张量库ggml，这使得它能够高效地处理大规模的张量运算，从而加速模型推理。

模型量化工具：项目包含模型量化的功能，允许用户将原本的32位浮点数模型参数量化为16位浮点数，甚至是更低精度的8位或4位整数，从而减少模型大小并显著提高推理速度，这对于在资源受限的设备上运行大模型尤其重要。

跨平台兼容性：除了支持CPU推理外，llama.cpp还支持CUDA和OpenCL，这意味着它能够在包括桌面计算机、服务器乃至某些移动设备上的GPU上运行，提供了广泛的硬件兼容性。

易于部署：由于其优化的C++实现，llama.cpp使得在本地CPU上部署大型语言模型变得更加容易，即便是配置较低的设备也能运行这类模型，降低了部署大型AI应用的门槛。

代码可读性和教育价值：尽管功能强大，llama.cpp的代码结构相对直观且可读性强，适合开发者通过阅读源码来学习大型语言模型的推理技术和底层实现细节。项目文件数量不多，但每个都是精心设计的，便于理解和修改。

社区支持和活跃度：在GitHub上，该项目拥有大量的stars，表明了其在开发者社区中的高关注度和活跃度。这通常意味着更好的文档、示例以及持续的维护更新。

综上所述，llama.cpp是一个专为性能优化和广泛兼容性设计的工具，它不仅能够帮助研究人员和开发者在不同类型的硬件上高效运行大型语言模型，同时也是学习现代语言模型推理技术的一个优秀资源。

GGUF文件的制作

设备环境如下：Ubuntu20.04、NVIDIA-A800、CUDA Version: 12.0、python 3.10

#代码准备
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

#编译
make

# 获取官方模型权重并将其放入./models中
ls ./models
llama-2-7b tokenizer_checklist.chk tokenizer.model
# [可选] 对于使用 BPE 分词器的模型
ls ./models
<folder containing weights and tokenizer json> vocab.json
# [可选] 适用于 Mistral-7B 等 PyTorch.bin 模型
ls ./models
<folder containing weights and tokenizer json>

# 安装Python依赖项
python3 -m pip install -r requirements.txt

# 将模型转换为ggml FP16格式
python3 convert.py models/mymodel/

# [可选] 对于使用 BPE 分词器的模型
python convert.py models/mymodel/ --vocab-type bpe

# 将模型量化为 4 位（使用 Q4_K_M 方法）
./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M

# 如果现在不支持旧版本，请将 gguf 文件类型更新为当前版本
./quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY

运行量化模型

# 开始推理 gguf 模型
./main -m ./models/mymodel/ggml-model-Q4_K_M.gguf -p "你好" -n 128

# 交互式使用模型
./main -m models/openbuddy-llama3-8b-v21.1-8k/openbuddy-llama3-8b-Q4_K_M.gguf -n 999 -cml

# 启动兼容openai api 的HTTP server
./server -m models/openbuddy-llama3-8b-v21.1-8k/openbuddy-llama3-8b-Q4_K_M.gguf -c 4096 --host 0.0.0.0 --port 7861

命令行选项可见官方文档

模型量化的精度

根据自己的硬件配置来选择合适的精度

(llamacpp) root@9pp562fqj4j6n-0:/1_11_test/hhh/llama.cpp# ./quantize
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16     : 13.00G              @ 7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

安卓上运行GGUF

手机配置

型号 Mi 9T Android11 运行内存 6GB

工具准备
手机端安装termux，官网
电脑端准备好Android NDK，将其解压至某个文件夹

使用Android NDK构建llama.cpp项目

# 代码准备，我担心影响上边编译好的，重新拉了一份代码，其实是不影响的
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

#开始构建
mkdir build-android
cd build-android

# 查看你的ndk文件夹路径
export NDK=<your_ndk_directory>

cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..

make

上面代码执行完以后，目录下会生成bin文件夹，如下图所示

将构建后的文件和模型文件移动到手机端
我这边采用的usb文件传输，直接复制过去了

模型文件记得也要传输过去

手机端工作准备
手机打开termux
运行以下命令开启SD卡访问权限

termux-setup-storage

执行完这个命令后，手机会弹出是否允许访问权限的，一定要点允许。
移动bin文件夹和模型文件到termux的根目录下

文件启动
文件移动完成后，进入到bin文件夹，执行以下命令给所有的文件添加可执行权限

chmod +x ./*

使用以下命令启动模型

./main -m ../openbuddy-llama3-8b-Q2_K.gguf -n 128 -cml