使用llama.cpp在linux cuda环境部署llama2方法记录及遇到的问题

一、编译lllama.cpp

拉取llama.cpp库 cd llama.cpp make LLAMA_CUBLAS=1 LLAMA_CUDA_NVCC=/usr/local/cuda/bin/nvcc

bug：编译问题

使用make，nvcc为cuda安装位置
make LLAMA_CUBLAS=1 LLAMA_CUDA_NVCC=/usr/local/cuda/bin/nvcc

报错信息：

nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:171: ggml-cuda.o] Error 1
make: *** Waiting for unfinished jobs....

解决方法：

添加 CUDA_DOCKER_ARCH参数，可先尝试改为=all，无法解决的话，其他参数值自行对应cuda尝试，如：compute_75，

'all','all-major','compute_35','compute_37', 'compute_50','compute_52','compute_53','compute_60','compute_61','compute_62', 'compute_70','compute_72','compute_75','compute_80','compute_86','compute_87', 'lto_35','lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62', 'lto_70','lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50', 'sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80', 'sm_86','sm_87'.

make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_75 LLAMA_CUDA_NVCC=/usr/local/cuda-11.4/bin/nvcc

解决成功：

cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c tests/test-c.c -o tests/test-c.o

二、模型权重转换为f16-gguf格式

将模型权重(包含tokenizer)放入model文件夹 python convert.py /Data/linai/llama.cpp/models/chinese-alpaca-2-13b-hf 成功则显示：Wrote /Data/linai/llama.cpp/models/chinese-alpaca-2-13b-hf/ggml-model-f16.gguf

因硬件资源足够，则不j继续进行量化

量化方法:

./quantize ./zh-models/7B/ggml-model-f16.gguf ./zh-models/7B/ggml-model-q4_0.gguf q4_0

三、启动！

chinese-llama官方已经说的很详细了，就不再赘述：

llamacpp_zh · ymcui/Chinese-LLaMA-Alpaca-2 Wiki (github.com)