一、编译lllama.cpp
拉取llama.cpp库 cd llama.cpp make LLAMA_CUBLAS=1 LLAMA_CUDA_NVCC=/usr/local/cuda/bin/nvccbug:编译问题
使用make,nvcc为cuda安装位置
make LLAMA_CUBLAS=1 LLAMA_CUDA_NVCC=/usr/local/cuda/bin/nvcc
报错信息:
nvcc fatal : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:171: ggml-cuda.o] Error 1
make: *** Waiting for unfinished jobs....
解决方法:
添加 CUDA_DOCKER_ARCH参数,可先尝试改为=all,无法解决的话,其他参数值自行对应cuda尝试,如:compute_75,
'all','all-major','compute_35','compute_37', 'compute_50','compute_52','compute_53','compute_60','compute_61','compute_62', 'compute_70','compute_72','compute_75','compute_80','compute_86','compute_87', 'lto_35','lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62', 'lto_70','lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50', 'sm_52','sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80', 'sm_86','sm_87'.
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=compute_75 LLAMA_CUDA_NVCC=/usr/local/cuda-11.4/bin/nvcc
解决成功:
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native -c tests/test-c.c -o tests/test-c.o
二、模型权重转换为f16-gguf格式
将模型权重(包含tokenizer)放入model文件夹 python convert.py /Data/linai/llama.cpp/models/chinese-alpaca-2-13b-hf 成功则显示:Wrote /Data/linai/llama.cpp/models/chinese-alpaca-2-13b-hf/ggml-model-f16.gguf因硬件资源足够,则不j继续进行量化
量化方法:
./quantize ./zh-models/7B/ggml-model-f16.gguf ./zh-models/7B/ggml-model-q4_0.gguf q4_0
三、启动!
chinese-llama官方已经说的很详细了,就不再赘述:
llamacpp_zh · ymcui/Chinese-LLaMA-Alpaca-2 Wiki (github.com)