前言
近些日子,大模型火的一塌糊涂,那么现在就有义务要学习一套好用的、从 dataset --> train --> deploy 的一整套流程,好拿来装逼。话不多说,进入正题
Train 框架
目前好用的框架太多,如BELLE, ChatGLM等,今天笔者推荐一个 Llama-Factory.
环境安装
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
conda create -n llama_factory python=3.10
conda activate llama_factory
pip install -e .[torch,metrics]
没错,你没看错,就这么几步,环境就搞定了
Dataset
数据集是通用格式,如下:
[
{
"instruction": "你好",
"input": "",
"output": "您好,我是XX大模型,一个由XXX开发的 AI 助手,很高兴认识您。请问我能为您做些什么?"
},
{
"instruction": "你好",
"input": "",
"output": "您好,我是XX大模型,一个由XXX打造的人工智能助手,请问有什么可以帮助您的吗?"
}
]
如果是text-to-text任务,可以instruction中写prompt,input 和 output 分别写text即可。
注:还有很多种数据格式,具体见项目的readme
自定义数据集制作好以后,放在项目的 data
下,然后在 data/dataset_info.json
文件中添加描述,如下mydataset
:
{
"starcoder_python": {
"hf_hub_url": "bigcode/starcoderdata",
"ms_hub_url": "AI-ModelScope/starcoderdata",
"columns": {
"prompt": "content"
},
"folder": "python"
},
"mydataset": {
"file_name": "mydataset.json",
"file_sha1": "535e1a88e1d480f80eca38d50216ea3a5dbedfa9"
}
}
mydataset
是数据集自定义名称,file_name
是 data
文件夹下数据集json的文件名。file_sha1
通过以下代码哈希加密获得:
import hashlib
def calculate_sha1(file_path):
sha1 = hashlib.sha1()
try:
with open(file_path, 'rb') as file:
while True:
data = file.read(8192) # Read in chunks to handle large files
if not data:
break
sha1.update(data)
return sha1.hexdigest()
except FileNotFoundError:
return "File not found."
# 使用示例
file_path = 'mydataset.json' # 替换为您的文件路径
sha1_hash = calculate_sha1(file_path)
print("SHA-1 Hash:", sha1_hash)
Train & Inference & Export
具体使用单卡or多卡,微调or全量具体看一下原项目说明,简洁易懂
笔者使用的指令是:
bash examples/full_multi_gpu/single_node.sh
single_node内容:
#!/bin/bash
NPROC_PER_NODE=4 # GPU卡数量,和下边的CUDA_VISIBLE_DEVICES数量对应,否则报错
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \
--nproc_per_node $NPROC_PER_NODE \
--nnodes $NNODES \
--node_rank $RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
src/train.py examples/full_multi_gpu/llama3_full_sft.yaml
再去修改一下你的 stf.yaml
配置文件即可。
train, infer,export model 都只需要一条指令,具体更改配置文件yaml中的参数即可。
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
Ollama部署
当然前边的 llamafactory-cli chat
也可以进行推理,但是不好用,这时Ollama就闪亮登场
导出llama模型
上一步训练好的llama-factory模型,ollama不能直接使用,需要转换一下格式。
如果是lora
微调,首先使用 llamafactory-cli export
导出合并成一个模型文件;
全量微调只有一个模型文件,不需要操作。
模型转换代码如下:
git clone https://github.com/ollama/ollama.git
cd ollama
# and then fetch its llama.cpp submodule:
git submodule init
git submodule update llm/llama.cpp
# conda
conda create -n ollama python=3.11
conda activate ollama
pip install -r llm/llama.cpp/requirements.txt
# Then build the quantize tool:
make -C llm/llama.cpp quantize
# convert model
python llm/llama.cpp/convert-hf-to-gguf.py path-to-your-trained-model --outtype f16 --outfile converted.bin
还有不懂可以看原文档中 Importing
部分。
量化、创建Ollama model、docker部署
Quantize the model(Optional)
llm/llama.cpp/quantize converted.bin quantized.bin q4_0
Create a Modelfile
for your model:
FROM quantized.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
使用docker进行转换Ollama可用的格式:
将 quantized.bin
和 Modelfile
移动到 ollama_file
文件夹内(没有创建一个)
docker run -itd --gpus=1 -v ollama_file:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
该过程会自动pull镜像,container创建好以后,进入容器,转换模型:
# 进入container
docker exec -it ollama /bin/bash/
cd /root/.ollama
# 转换ollama可用格式的模型
ollama create name-to-your-ollama-model -f Modelfile
# 测试转换结果
ollama list
直接在docker内部测试:
ollama run name-to-your-ollama-model
停止测试:
/bye
退出docker:
exit
docker的安装参考笔者另一篇博客Ubuntu装机必备软件和配置
服务器or局域网调用
可以通过curl进行,也可以通过post请求
通过curl:
curl http://localhost:11434/api/generate -d '{
"model": "name-to-your-ollama-model",
"prompt":"hello",
"stream": false
}'
通过post请求:
import json
import requests
# Generate a response for a given prompt with a provided model. This is a streaming endpoint, so will be a series of responses.
# The final response object will include statistics and additional data from the request. Use the callback function to override
# the default handler.
def generate(model_name, prompt, system=None):
try:
url = "http://localhost:11434/api/generate"
payload = {"model": model_name, "prompt": prompt, "system": system}
# Remove keys with None values
payload = {k: v for k, v in payload.items() if v is not None}
with requests.post(url, json=payload, stream=True) as response:
response.raise_for_status()
# Creating a variable to hold the context history of the final chunk
final_context = None
# Variable to hold concatenated response strings if no callback is provided
full_response = ""
# Iterating over the response line by line and displaying the details
for line in response.iter_lines():
if line:
# Parsing each line (JSON chunk) and extracting the details
chunk = json.loads(line)
if not chunk.get("done"):
response_piece = chunk.get("response", "")
full_response += response_piece
# print(response_piece, end="", flush=True)
# Check if it's the last chunk (done is true)
if chunk.get("done"):
final_context = chunk.get("context")
# Return the full response and the final context
return full_response, final_context
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
return None, None
if __name__ == '__main__':
model = 'name-to-your-ollama-model'
SYS_PROMPT= 'hello'
USER_PROMPT = "what your favorate movie?"
response1, _ = generate(model_name=model, system=SYS_PROMPT, prompt=USER_PROMPT)
print(response1)
Langchain部署
上面ollama部署已经比较好用了,但还是缺少一些配置。这个配置也不错
接下来介绍Langchain部署,langchain是个啥,点这里
待更新