使用 LoRA 技术对 LLaMA 65B 大模型进行微调及推理

前几天，Meta 发布了 LIMA 大模型，在LLaMA-65B的基础上，无需使用 RLHF，只用了 1000 个精心准备的样本数据进行微调，就达到了和 GPT-4 相媲美的程度。这激发了我探索 LLaMA 65B 大模型的兴趣。

之前的一系列大模型相关文章都是在LLaMA 7B/13B模型参数上面进行微调，文本使用 LoRA 技术对 LLaMA 30B/65B 大模型进行微调。相关代码放置在GitHub上面：llm-action。

环境准备

基础环境配置如下：

操作系统: CentOS 7 CPUs: 单个节点具有 1TB 内存的 Intel CPU，物理CPU个数为64，每颗CPU核数为16 GPUs: 8 卡 A800 80GB GPUs Python: 3.10 (需要先升级OpenSSL到1.1.1t版本（点击下载OpenSSL），然后再编译安装Python)，点击下载Python NVIDIA驱动程序版本: 515.65.01，根据不同型号选择不同的驱动程序，点击下载。 CUDA工具包: 11.7，点击下载 NCCL: nccl_2.14.3-1+cuda11.7，点击下载 cuDNN: 8.8.1.3_cuda11，点击下载

本文的实验环境与足够惊艳，使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调，效果比肩斯坦福羊驼一文中的实验环境一致，因此不再赘述。

直接激活虚拟环境。

source /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/bin/activate

数据集准备

数据集直接使用alpaca-lora项目提供的alpaca_data.json、alpaca_data_cleaned_archive.json或alpaca_data_gpt4.json即可。除此之外，可参考GPT-4-LLM项目，该项目还提供了使用Alpaca的Prompt翻译成中文使用 GPT4 生成了 5.2 万条指令跟随数据。

模型格式转换

首先，对原始的 LLaMA 30B/65B 大模型进行模型格式转换。模型转换的具体步骤请参考之前的文章：从0到1复现斯坦福羊驼（Stanford Alpaca 7B）。

原始 LLaMA 65B模型权重：

> tree llama-model/65B/
llama-model/65B/
├── checklist.chk
├── consolidated.00.pth
...
├── consolidated.07.pth
└── params.json

0 directories, 10 files

转换HF格式后的 LLaMA 65B 模型权重：

ls -al hf-llama-model/llama-65b/ hf-llama-model/tokenizer/
hf-llama-model/llama-65b/:
total 127511452
drwxrwxr-x 1 nobody nobody          0 Mar 27 20:44 .
drwxrwxr-x 1 nobody nobody          0 Mar 27 20:35 ..
-rw-rw-r-- 1 nobody nobody        426 Mar 27 20:44 config.json
-rw-rw-r-- 1 nobody nobody        124 Mar 27 20:44 generation_config.json
-rw-rw-r-- 1 nobody nobody 1619037191 Mar 27 20:38 pytorch_model-00001-of-00081.bin
...
-rw-rw-r-- 1 nobody nobody 1048593571 Mar 27 20:44 pytorch_model-00081-of-00081.bin
-rw-rw-r-- 1 nobody nobody      63494 Mar 27 20:44 pytorch_model.bin.index.json

hf-llama-model/tokenizer/:
total 500
drwxrwxr-x 1 nobody nobody      0 Mar 30 10:53 .
drwxrwxr-x 1 nobody nobody      0 Mar 27 20:35 ..
-rw-rw-r-- 1 nobody nobody      2 Mar 30 10:53 special_tokens_map.json
-rw-rw-r-- 1 nobody nobody    141 Mar 30 10:53 tokenizer_config.json
-rw-rw-r-- 1 nobody nobody 499723 Mar 30 10:53 tokenizer.model

然后，将tokenizer目录的文件拷贝到llama-65B目录下。

cp hf-llama-model/tokenizer/* hf-llama-model/llama-65b/

LLaMA 30B 的转换工作与之类似，不再赘述。

模型微调

LLaMA-30B

首先，对 LLaMA 30B 进行微调，30B 参数的模型大约60G左右。在A800上面 micro_batch_size 为 6 能够充分利用显存资源。

模型训练过程：

torchrun --nproc_per_node=8 --master_port=29005 finetune.py \
> --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-30b' \
> --data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
> --output_dir '/home/guodong.li/output/alpaca-lora-30b-dp' \
> --batch_size 96 \
> --micro_batch_size 6 \
> --num_epochs 2

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-30b
data_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.json
output_dir: /home/guodong.li/output/alpaca-lora-30b-dp
batch_size: 96
micro_batch_size: 6
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

...

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [02:11<00:00,  2.16s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [02:12<00:00,  2.17s/it]
Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 187.05it/s]
trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777

...

Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map:   4%|█████▍                                                                                                                                        | 1904/49942 [00:01<00:38, 1244.61 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 193.31it/s]
trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
Map:   9%|████████████▊                                                                                                                                 | 4513/49942 [00:03<00:32, 1402.69 examples/s]Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map:  66%|█████████████████████████████████████████████████████████████████████████████████████████████▌                                               | 33152/49942 [00:24<00:12, 1340.03 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 561.56it/s]
trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map:  67%|██████████████████████████████████████████████████████████████████████████████████████████████▍                                              | 33433/49942 [00:24<00:12, 1371.96 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 627.33it/s]
Map:  40%|█████████████████████████████████████████████████████████                                                                                    | 20222/49942 [00:16<00:26, 1104.62 examples/s]trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
{'loss': 2.0954, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.02}
{'loss': 1.984, 'learning_rate': 5.6999999999999996e-05, 'epoch': 0.04}
{'loss': 1.7062, 'learning_rate': 8.4e-05, 'epoch': 0.06}
{'loss': 1.3441, 'learning_rate': 0.00011399999999999999, 'epoch': 0.08}
{'loss': 1.1435, 'learning_rate': 0.00014099999999999998, 'epoch': 0.1}
{'loss': 0.9968, 'learning_rate': 0.00017099999999999998, 'epoch': 0.12}
{'loss': 0.9275, 'learning_rate': 0.000201, 'epoch': 0.13}
...
{'loss': 0.812, 'learning_rate': 0.00026904255319148935, 'epoch': 0.38}
{'eval_loss': 0.8141900897026062, 'eval_runtime': 28.5046, 'eval_samples_per_second': 70.164, 'eval_steps_per_second': 1.123, 'epoch': 0.38}
{'loss': 0.8016, 'learning_rate': 0.0002658510638297872, 'epoch': 0.4}
{'loss': 0.8024, 'learning_rate': 0.0002626595744680851, 'epoch': 0.42}
{'loss': 0.7938, 'learning_rate': 0.000259468085106383, 'epoch': 0.44}
...
{'loss': 0.793, 'learning_rate': 0.00021478723404255316, 'epoch': 0.71}
{'loss': 0.7884, 'learning_rate': 0.00021159574468085105, 'epoch': 0.73}
{'loss': 0.7748, 'learning_rate': 0.00020840425531914894, 'epoch': 0.75}
{'loss': 0.7869, 'learning_rate': 0.00020521276595744677, 'epoch': 0.77}
{'eval_loss': 0.8041278719902039, 'eval_runtime': 28.2371, 'eval_samples_per_second': 70.829, 'eval_steps_per_second': 1.133, 'epoch': 0.77}
{'loss': 0.7846, 'learning_rate': 0.00020202127659574466, 'epoch': 0.79}
{'loss': 0.791, 'learning_rate': 0.00019882978723404255, 'epoch': 0.81}
{'loss': 0.7923, 'learning_rate': 0.00019563829787234039, 'epoch': 0.83}
...
{'loss': 0.7775, 'learning_rate': 0.0001573404255319149, 'epoch': 1.06}
{'loss': 0.7883, 'learning_rate': 0.00015414893617021278, 'epoch': 1.08}
{'loss': 0.7805, 'learning_rate': 0.0001509574468085106, 'epoch': 1.1}
{'loss': 0.7955, 'learning_rate': 0.0001477659574468085, 'epoch': 1.11}
{'loss': 0.7801, 'learning_rate': 0.00014457446808510636, 'epoch': 1.13}
{'loss': 0.7933, 'learning_rate': 0.00014138297872340425, 'epoch': 1.15}
{'eval_loss': 0.8008487820625305, 'eval_runtime': 28.9576, 'eval_samples_per_second': 69.066, 'eval_steps_per_second': 1.105, 'epoch': 1.15}
{'loss': 0.785, 'learning_rate': 0.0001381914893617021, 'epoch': 1.17}
{'loss': 0.7686, 'learning_rate': 0.000135, 'epoch': 1.19}
{'loss': 0.7717, 'learning_rate': 0.00013180851063829786, 'epoch': 1.21}
...
{'loss': 0.7688, 'learning_rate': 8.393617021276595e-05, 'epoch': 1.5}
{'loss': 0.7785, 'learning_rate': 8.074468085106383e-05, 'epoch': 1.52}
{'loss': 0.7767, 'learning_rate': 7.75531914893617e-05, 'epoch': 1.54}
{'eval_loss': 0.7986326813697815, 'eval_runtime': 28.3196, 'eval_samples_per_second': 70.622, 'eval_steps_per_second': 1.13, 'epoch': 1.54}
{'loss': 0.7907, 'learning_rate': 7.436170212765956e-05, 'epoch': 1.56}
{'loss': 0.7691, 'learning_rate': 7.117021276595744e-05, 'epoch': 1.58}
...
{'loss': 0.7649, 'learning_rate': 1.6914893617021273e-05, 'epoch': 1.9}
{'loss': 0.7624, 'learning_rate': 1.3723404255319146e-05, 'epoch': 1.92}
{'eval_loss': 0.7973329424858093, 'eval_runtime': 29.2014, 'eval_samples_per_second': 68.49, 'eval_steps_per_second': 1.096, 'epoch': 1.92}
{'loss': 0.7824, 'learning_rate': 1.0531914893617022e-05, 'epoch': 1.94}
{'loss': 0.7772, 'learning_rate': 7.3404255319148934e-06, 'epoch': 1.96}
{'loss': 0.7762, 'learning_rate': 4.148936170212765e-06, 'epoch': 1.98}
{'loss': 0.7572, 'learning_rate': 9.574468085106382e-07, 'epoch': 2.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040/1040 [1:18:34<00:00,  4.44s/it]
{'train_runtime': 4716.2302, 'train_samples_per_second': 21.179, 'train_steps_per_second': 0.221, 'train_loss': 0.8336130522764646, 'epoch': 2.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040/1040 [1:18:34<00:00,  4.53s/it]

模型权重文件：

> tree -h  /home/guodong.li/output/alpaca-lora-30b-dp
/home/guodong.li/output/alpaca-lora-30b-dp
├── [ 424]  adapter_config.json
├── [ 49M]  adapter_model.bin
└── [4.0K]  checkpoint-1000
    ├── [ 98M]  optimizer.pt
    ├── [ 49M]  pytorch_model.bin
    ├── [ 14K]  rng_state_0.pth
    ├── [ 14K]  rng_state_1.pth
    ├── [ 14K]  rng_state_2.pth
    ├── [ 14K]  rng_state_3.pth
    ├── [ 14K]  rng_state_4.pth
    ├── [ 14K]  rng_state_5.pth
    ├── [ 14K]  rng_state_6.pth
    ├── [ 14K]  rng_state_7.pth
    ├── [ 557]  scaler.pt
    ├── [ 627]  scheduler.pt
    ├── [ 13K]  trainer_state.json
    └── [3.5K]  training_args.bin

1 directory, 16 files

可以看到在A800上面，数据并行为8，5万条数据，单次epoch大约需要40分钟左右。

LLaMA-65B

首先，对 LLaMA 65B 进行微调，65B 参数的模型大约120G左右。为了让单卡A800能够跑65B的大模型，这里将micro_batch_size设置为1。

模型训练过程：

torchrun --nproc_per_node=8 --master_port=29005 finetune.py \
> --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b' \
> --data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
> --output_dir '/home/guodong.li/output/alpaca-lora-65b-dp' \
> --batch_size 8 \
> --micro_batch_size 1 \
> --num_epochs 1
...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b
data_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.json
output_dir: /home/guodong.li/output/alpaca-lora-65b-dp
batch_size: 8
micro_batch_size: 1
num_epochs: 1
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [02:06<00:00,  1.56s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [02:20<00:00,  1.74s/it]
...
Map:  13%|█████████████████▉                                                                                                                            | 6312/49942 [00:04<00:30, 1410.98 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 196.47it/s]
trainable params: 20971520 || all params: 65306632192 || trainable%: 0.03211238934867168
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
{'loss': 2.1086, 'learning_rate': 2.3999999999999997e-05, 'epoch': 0.0}
{'loss': 2.0261, 'learning_rate': 5.399999999999999e-05, 'epoch': 0.0}
{'loss': 1.7054, 'learning_rate': 8.4e-05, 'epoch': 0.0}
{'loss': 1.2423, 'learning_rate': 0.00011099999999999999, 'epoch': 0.01}
{'loss': 0.9976, 'learning_rate': 0.00013199999999999998, 'epoch': 0.01}
{'loss': 0.801, 'learning_rate': 0.000162, 'epoch': 0.01}
{'loss': 0.839, 'learning_rate': 0.00019199999999999998, 'epoch': 0.01}
{'loss': 0.8134, 'learning_rate': 0.00022199999999999998, 'epoch': 0.01}
{'loss': 0.7575, 'learning_rate': 0.00025199999999999995, 'epoch': 0.01}
...
{'loss': 0.769, 'learning_rate': 0.0001992023441315318, 'epoch': 0.35}
{'loss': 0.7393, 'learning_rate': 0.00019871398339573498, 'epoch': 0.35}
{'loss': 0.7269, 'learning_rate': 0.0001982256226599381, 'epoch': 0.35}
{'loss': 0.6783, 'learning_rate': 0.00019773726192414128, 'epoch': 0.35}
{'eval_loss': 0.7974867820739746, 'eval_runtime': 48.5181, 'eval_samples_per_second': 41.222, 'eval_steps_per_second': 0.66, 'epoch': 0.35}
{'loss': 0.6891, 'learning_rate': 0.00019724890118834445, 'epoch': 0.35}
{'loss': 0.7216, 'learning_rate': 0.0001967605404525476, 'epoch': 0.36}
{'loss': 0.7114, 'learning_rate': 0.00019627217971675075, 'epoch': 0.36}
{'loss': 0.7089, 'learning_rate': 0.0001957838189809539, 'epoch': 0.36}
...
{'loss': 0.6985, 'learning_rate': 5.323132020185577e-06, 'epoch': 0.98}
{'loss': 0.7167, 'learning_rate': 4.834771284388734e-06, 'epoch': 0.99}
{'loss': 0.7433, 'learning_rate': 4.346410548591893e-06, 'epoch': 0.99}
{'loss': 0.6875, 'learning_rate': 3.8580498127950505e-06, 'epoch': 0.99}
{'loss': 0.7104, 'learning_rate': 3.369689076998209e-06, 'epoch': 0.99}
{'loss': 0.7346, 'learning_rate': 2.881328341201367e-06, 'epoch': 0.99}
{'loss': 0.7062, 'learning_rate': 2.3929676054045255e-06, 'epoch': 0.99}
{'eval_loss': 0.787121593952179, 'eval_runtime': 48.4232, 'eval_samples_per_second': 41.303, 'eval_steps_per_second': 0.661, 'epoch': 0.99}
{'loss': 0.701, 'learning_rate': 1.9046068696076832e-06, 'epoch': 0.99}
{'loss': 0.7169, 'learning_rate': 1.4162461338108414e-06, 'epoch': 1.0}
{'loss': 0.763, 'learning_rate': 9.278853980139996e-07, 'epoch': 1.0}
{'loss': 0.6903, 'learning_rate': 4.3952466221715773e-07, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6243/6243 [4:36:50<00:00,  2.42s/it]
{'train_runtime': 16612.2434, 'train_samples_per_second': 3.006, 'train_steps_per_second': 0.376, 'train_loss': 0.7368283385404043, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6243/6243 [4:36:50<00:00,  2.66s/it]

显存占用：

Tue May 23 17:05:37 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   67C    P0   296W / 300W |  78543MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:35:00.0 Off |                    0 |
| N/A   69C    P0   303W / 300W |  78577MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80G...  Off  | 00000000:36:00.0 Off |                    0 |
| N/A   70C    P0   300W / 300W |  78657MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80G...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   72C    P0   297W / 300W |  78577MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A800 80G...  Off  | 00000000:9B:00.0 Off |                    0 |
| N/A   71C    P0   292W / 300W |  78641MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A800 80G...  Off  | 00000000:9C:00.0 Off |                    0 |
| N/A   71C    P0   305W / 300W |  78629MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A800 80G...  Off  | 00000000:9D:00.0 Off |                    0 |
| N/A   68C    P0   296W / 300W |  78625MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 |
| N/A   68C    P0   298W / 300W |  78799MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     33369      C   ...nv-py310-cu117/bin/python    78541MiB |
|    1   N/A  N/A     33370      C   ...nv-py310-cu117/bin/python    78575MiB |
|    2   N/A  N/A     33371      C   ...nv-py310-cu117/bin/python    78655MiB |
|    3   N/A  N/A     33372      C   ...nv-py310-cu117/bin/python    78575MiB |
|    4   N/A  N/A     33373      C   ...nv-py310-cu117/bin/python    78639MiB |
|    5   N/A  N/A     33374      C   ...nv-py310-cu117/bin/python    78627MiB |
|    6   N/A  N/A     33375      C   ...nv-py310-cu117/bin/python    78623MiB |
|    7   N/A  N/A     33376      C   ...nv-py310-cu117/bin/python    78797MiB |
+-----------------------------------------------------------------------------+

模型权重：

> tree -h /home/guodong.li/output/alpaca-lora-65b-dp
/home/guodong.li/output/alpaca-lora-65b-dp
├── [ 424]  adapter_config.json
├── [ 80M]  adapter_model.bin
└── [4.0K]  checkpoint-6200
    ├── [160M]  optimizer.pt
    ├── [ 80M]  pytorch_model.bin
    ├── [ 14K]  rng_state_0.pth
    ├── [ 14K]  rng_state_1.pth
    ├── [ 14K]  rng_state_2.pth
    ├── [ 14K]  rng_state_3.pth
    ├── [ 14K]  rng_state_4.pth
    ├── [ 14K]  rng_state_5.pth
    ├── [ 14K]  rng_state_6.pth
    ├── [ 14K]  rng_state_7.pth
    ├── [ 557]  scaler.pt
    ├── [ 627]  scheduler.pt
    ├── [ 80K]  trainer_state.json
    └── [3.5K]  training_args.bin

1 directory, 16 files

可以看到在A800上面，数据并行为8，5万条数据，单次epoch大约需要4.5小时左右。

将 LoRA 权重合并回基础模型

下面将 LoRA 权重合并回基础模型，以便于进行模型推理。具体可参考足够惊艳，使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调，效果比肩斯坦福羊驼一文修改export_hf_checkpoint.py文件。

权重合并过程：

BASE_MODEL=/data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b \
> LORA_MODEL=/home/guodong.li/output/alpaca-lora-65b-dp \
> HF_CHECKPOINT=/home/guodong.li/output/hf_65b_ckpt \
> python export_hf_checkpoint.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:15<00:00,  1.08it/s]

合并后的权重文件：

> tree -h hf_65b_ckpt
hf_65b_ckpt
├── [ 580]  config.json
├── [ 137]  generation_config.json
├── [ 537]  pytorch_model-00001-of-00403.bin
├── [500M]  pytorch_model-00002-of-00403.bin
├── [256M]  pytorch_model-00003-of-00403.bin
├── [256M]  pytorch_model-00004-of-00403.bin
├── [344M]  pytorch_model-00005-of-00403.bin
├── [344M]  pytorch_model-00006-of-00403.bin
├── [344M]  pytorch_model-00007-of-00403.bin
...
├── [344M]  pytorch_model-00400-of-00403.bin
├── [344M]  pytorch_model-00401-of-00403.bin
├── [344M]  pytorch_model-00402-of-00403.bin
├── [500M]  pytorch_model-00403-of-00403.bin
└── [ 65K]  pytorch_model.bin.index.json

0 directories, 406 files

模型推理

接下来使用转换后的模型权重进行模型推理，具体的模型推理（inference.py）代码如下所示：

import sys
from transformers import LlamaForCausalLM, AutoTokenizer
import torch

device = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")


tokenizer_path="/data/nfs/guodong.li/pretrain/hf-llama-model/tokenizer"
model_path = "/home/guodong.li/output/hf_65b_ckpt" # You can modify the path for storing the local model

model =  LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
print("Human:")
line = input()
while line:
        inputs = 'Human: ' + line.strip() + '\n\nAssistant:'
        input_ids = tokenizer(inputs, return_tensors="pt").input_ids
        input_ids = input_ids.to(device)
        outputs = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)
        rets = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)
        print("Assistant:\n" + rets[0].strip().replace(inputs, ""))
        print("\n------------------------------------------------\nHuman:")
        line = input()

运行推理代码：

> python inference.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402/402 [01:48<00:00,  3.72it/s]
Human:
What are the five characteristics of a good argument?
Assistant:
 A good argument should be clear, concise, logical, supported by evidence, and respectful of the opposing view.

------------------------------------------------
Human:
Generate a list of ten common idioms related to animals.
Assistant:
 1. "Like a fish out of water" 2. "Birds of a feather flock together" 3. "Let the cat out of the bag" 4. "Herding cats" 5. "Barking up the wrong tree" 6. "Sly as a fox" 7. "A lion's share" 8. "A bird in the hand is worth two in the bush" 9. "A wolf in sheep's clothing" 10. "A wild goose chase".

------------------------------------------------
Human:
Evaluate the following expression: (6+2)*(2-2).
Assistant:
 10.

------------------------------------------------
Human:
Compute the derivative of 3x^3 + 10x.
Assistant:
 The derivative of 3x^3 + 10x is 9x^2 + 10.

------------------------------------------------
Human:

显存占用：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   44C    P0    69W / 300W |  66927MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
...
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A800 80G...  Off  | 00000000:9E:00.0 Off |                    0 |
| N/A   47C    P0    71W / 300W |   7224MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     43499      C   python                          66925MiB |
|    1   N/A  N/A     43499      C   python                            949MiB |
...
|    7   N/A  N/A     43499      C   python                            949MiB |
+-----------------------------------------------------------------------------+

可以看到即使使用了FP16加载模型，单卡的显存占用也高达60多G。如果硬件资源不足，可以考虑使用模型并行推理。具体可参考： tensor_parallel 和 FasterTransformer 这两个项目，使用模型并行对 LLaMA 进行推理。当然，从提升模型的推理速度以及吞吐量的角度来说，对百亿级以上的大模型，也应该使用模型并行进行推理。

结语

本文讲述了使用 LoRA 高效微调技术对 LLaMA 30B/65B 进行模型训练及推理，希望能够给你带来帮助。

参考文档：

从0到1复现斯坦福羊驼（Stanford Alpaca 7B）足够惊艳，使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调，效果比肩斯坦福羊驼 Alpaca-LoRA