llama.cpp使用 - 人工智能

llama.cpp的github库地址为ggerganov/llama.cpp: LLM inference in C/C++ (github.com)，具体使用以官方说明为准。

简介

llama.cpp目标是在本地和云端的各种硬件上以最小的设置和最先进性能实现LLM推理。

具体而言，可以用llama.cpp将训练好的模型转化为通用格式如gguf等，进行量化，以server或命令行方式运行模型。

安装

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
mkdir build
cd build
apt install make cmake gcc
cmake ..
cmake --build . --config Release
make install

基本使用（以微调好的模型为例）

注意，由于该库在不断更新，请注意以官方库的说明为准。目前互联网上很多教程是基于之前的版本，而2024年6月12日后库更新了，修改了可执行文件名，导致网上很多教程使用的quantize、main、server等指令无法找到，在当前版本（截至2024年7月20日）这些指令分别被重命名为llama-quantize、llama-cli、llama-server。

安装完之后可以在/usr/bin或/usr/local/bin目录下建立命令链接，以便在任意目录下使用llama.cpp

例如：在/usr/bin目录下

ln -s your/path/to/llama.cpp/build/bin/llama-quantize llama-quantize
ln -s your/path/to/llama.cpp/build/bin/llama-server llama-server
ln -s your/path/to/llama.cpp/build/bin/llama-cli llama-cli

模型转化与量化

该库支持的模型详见官方文档，以下以qwen1.5-14b-chat模型为例。

运行完swift和infer脚本，微调、验证并合并模型后（该过程详见魔搭社区的模型训练相关文档），进行模型转化，将其转化为gguf格式并进行量化，在llama.cpp路径下：

# 1. qwen必须先使用convert-hf-to-gguf.py转换再降低精度
# issue: KeyError: 'transformer.h.0.attn.c_attn.bias' #4304
# 14b转f16大概28G
# 我训练好的模型位于/root/autodl-tmp/swift/examples/pytorch/llm/output/qwen-14b-chat/v2-20240514-122025/checkpoint-93-merged路径下，请参考并替换为自己的对应路径
python convert-hf-to-gguf.py /root/autodl-tmp/swift/examples/pytorch/llm/output/qwen-14b-chat/v2-20240514-122025/checkpoint-93-merged --outfile my-qwen-14b.gguf --outtype f16

# 2. 使用llama-quantize 转换精度
# llama-quantize支持的精度以及更多的使用方法可通过llama-quantize --help查看
llama-quantize ./my-qwen-14b.gguf my-qwen-14b-q8_0.gguf q8_0

模型运行

可通过llama-cli或llama-server运行模型。

llama-cli：

llama-cli -m my-qwen-14-q8_0.gguf -p "you are a helpful assistant" -cnv -ngl 24

# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

其中：

-m参数后跟要运行的模型 -cnv表示以对话模式运行模型 -ngl：当编译支持 GPU 时，该选项允许将某些层卸载到 GPU 上进行计算。一般情况下，性能会有所提高。

其他参数详见官方文档llama.cpp/examples/main/README.md at master · ggerganov/llama.cpp (github.com)

llama-server:

./llama-server -m /mnt/workspace/qwen2-7b-instruct-q8_0.gguf -ngl 28

会启动一个类似web服务器的进程，默认端口号为8080，可通过web页面或者OpenAI api等进行访问。

使用OpenAI api访问：

import openai

client = openai.OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key = "sk-no-key-required"
)

completion = client.chat.completions.create(
    model="qwen", # model name can be chosen arbitrarily
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "tell me something about michael jordan"}
    ]
)
print(completion.choices[0].message.content)

可参考魔搭社区GGUF模型怎么玩！看这篇就够了_魔搭ModelScope社区-ModelScope魔搭社区

总结

### 文章总结: `llama.cpp`
#### 仓库信息
- **GitHub地址**: [ggerganov/llama.cpp](http://github.com/ggerganov/llama.cpp)
- **宗旨**: 在各类硬件上利用最小设置和最佳性能实现LLM模型推理。
#### 主要功能
- **模型转化与量化**：支持将训练好的模型转化为`gguf`格式并进行量化。
- **模型运行**：提供命令行(`llama-cli`)和服务器(`llama-server`)两种方式运行模型。
#### 安装步骤
1. **克隆仓库**：通过Git克隆`llama.cpp`的GitHub仓库。
2. **构建项目**：在项目目录下创建`build`文件夹，安装必要的依赖（`make`, `cmake`, `gcc`），构建项目并安装。
3. **创建软链接**（可选）：在`/usr/bin`或`/usr/local/bin`目录下创建可执行文件的软链接，方便全局调用。
#### 模型处理
- **模型转化**：使用`convert-hf-to-gguf.py`脚本将模型转换为`gguf`格式，并指定输出类型和路径。
- **模型量化**：通过`llama-quantize`工具降低模型精度，提升推理性能。
#### 基本使用
- **命令行方式**（`llama-cli`）：适用于直接通过命令行与模型交互的场景。
- 使用`-m`参数指定模型文件。
- `-cnv`标志表示以对话模式运行。
- `-ngl`参数（当编译支持GPU时）允许将特定层卸载到GPU上加速计算。
- **服务器方式**（`llama-server`）：启动一个类似web服务器的进程，可通过web页面或OpenAI API等进行访问。
- 启动服务器时，可通过`-m`参数指定模型文件。
- 默认端口号为8080，可通过OpenAI API进行交互（需配置正确的`base_url`）。
#### 注意事项
- 由于`llama.cpp`库在不断更新，建议以最新的官方文档和说明为准，特别是命令行工具的名称（如`quantize`、`main`、`server`已被重命名为`llama-quantize`、`llama-cli`、`llama-server`）。
- 模型转化和量化的具体步骤及所需参数请参考官方文档。
#### 参考资源
- 模型转化、量化、及运行的具体步骤和示例代码可参考`llama.cpp`项目的官方文档及相关社区讨论。
- 魔搭社区关于GGUF模型的使用和教程也是宝贵的资源。