前几天,Meta 发布了 LIMA 大模型,在LLaMA-65B的基础上,无需使用 RLHF,只用了 1000 个精心准备的样本数据进行微调,就达到了和 GPT-4 相媲美的程度。这激发了我探索 LLaMA 65B 大模型的兴趣。
之前的一系列大模型相关文章都是在LLaMA 7B/13B模型参数上面进行微调,文本使用 LoRA 技术对 LLaMA 30B/65B 大模型进行微调。相关代码放置在GitHub上面:llm-action。
环境准备
基础环境配置如下:
操作系统: CentOS 7 CPUs: 单个节点具有 1TB 内存的 Intel CPU,物理CPU个数为64,每颗CPU核数为16 GPUs: 8 卡 A800 80GB GPUs Python: 3.10 (需要先升级OpenSSL到1.1.1t版本( 点击下载OpenSSL),然后再编译安装Python), 点击下载Python NVIDIA驱动程序版本: 515.65.01,根据不同型号选择不同的驱动程序, 点击下载。 CUDA工具包: 11.7, 点击下载 NCCL: nccl_2.14.3-1+cuda11.7, 点击下载 cuDNN: 8.8.1.3_cuda11, 点击下载本文的实验环境与足够惊艳,使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调,效果比肩斯坦福羊驼一文中的实验环境一致,因此不再赘述。
直接激活虚拟环境。
source /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/bin/activate
数据集准备
数据集直接使用alpaca-lora项目提供的alpaca_data.json
、alpaca_data_cleaned_archive.json
或alpaca_data_gpt4.json
即可。除此之外,可参考GPT-4-LLM项目,该项目还提供了使用Alpaca的Prompt翻译成中文使用 GPT4 生成了 5.2 万条指令跟随数据。
模型格式转换
首先,对原始的 LLaMA 30B/65B 大模型进行模型格式转换。模型转换的具体步骤请参考之前的文章:从0到1复现斯坦福羊驼(Stanford Alpaca 7B)。
原始 LLaMA 65B模型权重:
> tree llama-model/65B/
llama-model/65B/
├── checklist.chk
├── consolidated.00.pth
...
├── consolidated.07.pth
└── params.json
0 directories, 10 files
转换HF格式后的 LLaMA 65B 模型权重:
ls -al hf-llama-model/llama-65b/ hf-llama-model/tokenizer/
hf-llama-model/llama-65b/:
total 127511452
drwxrwxr-x 1 nobody nobody 0 Mar 27 20:44 .
drwxrwxr-x 1 nobody nobody 0 Mar 27 20:35 ..
-rw-rw-r-- 1 nobody nobody 426 Mar 27 20:44 config.json
-rw-rw-r-- 1 nobody nobody 124 Mar 27 20:44 generation_config.json
-rw-rw-r-- 1 nobody nobody 1619037191 Mar 27 20:38 pytorch_model-00001-of-00081.bin
...
-rw-rw-r-- 1 nobody nobody 1048593571 Mar 27 20:44 pytorch_model-00081-of-00081.bin
-rw-rw-r-- 1 nobody nobody 63494 Mar 27 20:44 pytorch_model.bin.index.json
hf-llama-model/tokenizer/:
total 500
drwxrwxr-x 1 nobody nobody 0 Mar 30 10:53 .
drwxrwxr-x 1 nobody nobody 0 Mar 27 20:35 ..
-rw-rw-r-- 1 nobody nobody 2 Mar 30 10:53 special_tokens_map.json
-rw-rw-r-- 1 nobody nobody 141 Mar 30 10:53 tokenizer_config.json
-rw-rw-r-- 1 nobody nobody 499723 Mar 30 10:53 tokenizer.model
然后,将tokenizer目录的文件拷贝到llama-65B目录下。
cp hf-llama-model/tokenizer/* hf-llama-model/llama-65b/
LLaMA 30B 的转换工作与之类似,不再赘述。
模型微调
LLaMA-30B
首先,对 LLaMA 30B 进行微调,30B 参数的模型大约60G左右。在A800上面 micro_batch_size 为 6 能够充分利用显存资源。
模型训练过程:
torchrun --nproc_per_node=8 --master_port=29005 finetune.py \
> --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-30b' \
> --data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
> --output_dir '/home/guodong.li/output/alpaca-lora-30b-dp' \
> --batch_size 96 \
> --micro_batch_size 6 \
> --num_epochs 2
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-30b
data_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.json
output_dir: /home/guodong.li/output/alpaca-lora-30b-dp
batch_size: 96
micro_batch_size: 6
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca
...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [02:11<00:00, 2.16s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61/61 [02:12<00:00, 2.17s/it]
Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 187.05it/s]
trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
...
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map: 4%|█████▍ | 1904/49942 [00:01<00:38, 1244.61 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 193.31it/s]
trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
Map: 9%|████████████▊ | 4513/49942 [00:03<00:32, 1402.69 examples/s]Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map: 66%|█████████████████████████████████████████████████████████████████████████████████████████████▌ | 33152/49942 [00:24<00:12, 1340.03 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 561.56it/s]
trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
Map: 67%|██████████████████████████████████████████████████████████████████████████████████████████████▍ | 33433/49942 [00:24<00:12, 1371.96 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 627.33it/s]
Map: 40%|█████████████████████████████████████████████████████████ | 20222/49942 [00:16<00:26, 1104.62 examples/s]trainable params: 12779520 || all params: 32541723136 || trainable%: 0.03927118409369777
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
{'loss': 2.0954, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.02}
{'loss': 1.984, 'learning_rate': 5.6999999999999996e-05, 'epoch': 0.04}
{'loss': 1.7062, 'learning_rate': 8.4e-05, 'epoch': 0.06}
{'loss': 1.3441, 'learning_rate': 0.00011399999999999999, 'epoch': 0.08}
{'loss': 1.1435, 'learning_rate': 0.00014099999999999998, 'epoch': 0.1}
{'loss': 0.9968, 'learning_rate': 0.00017099999999999998, 'epoch': 0.12}
{'loss': 0.9275, 'learning_rate': 0.000201, 'epoch': 0.13}
...
{'loss': 0.812, 'learning_rate': 0.00026904255319148935, 'epoch': 0.38}
{'eval_loss': 0.8141900897026062, 'eval_runtime': 28.5046, 'eval_samples_per_second': 70.164, 'eval_steps_per_second': 1.123, 'epoch': 0.38}
{'loss': 0.8016, 'learning_rate': 0.0002658510638297872, 'epoch': 0.4}
{'loss': 0.8024, 'learning_rate': 0.0002626595744680851, 'epoch': 0.42}
{'loss': 0.7938, 'learning_rate': 0.000259468085106383, 'epoch': 0.44}
...
{'loss': 0.793, 'learning_rate': 0.00021478723404255316, 'epoch': 0.71}
{'loss': 0.7884, 'learning_rate': 0.00021159574468085105, 'epoch': 0.73}
{'loss': 0.7748, 'learning_rate': 0.00020840425531914894, 'epoch': 0.75}
{'loss': 0.7869, 'learning_rate': 0.00020521276595744677, 'epoch': 0.77}
{'eval_loss': 0.8041278719902039, 'eval_runtime': 28.2371, 'eval_samples_per_second': 70.829, 'eval_steps_per_second': 1.133, 'epoch': 0.77}
{'loss': 0.7846, 'learning_rate': 0.00020202127659574466, 'epoch': 0.79}
{'loss': 0.791, 'learning_rate': 0.00019882978723404255, 'epoch': 0.81}
{'loss': 0.7923, 'learning_rate': 0.00019563829787234039, 'epoch': 0.83}
...
{'loss': 0.7775, 'learning_rate': 0.0001573404255319149, 'epoch': 1.06}
{'loss': 0.7883, 'learning_rate': 0.00015414893617021278, 'epoch': 1.08}
{'loss': 0.7805, 'learning_rate': 0.0001509574468085106, 'epoch': 1.1}
{'loss': 0.7955, 'learning_rate': 0.0001477659574468085, 'epoch': 1.11}
{'loss': 0.7801, 'learning_rate': 0.00014457446808510636, 'epoch': 1.13}
{'loss': 0.7933, 'learning_rate': 0.00014138297872340425, 'epoch': 1.15}
{'eval_loss': 0.8008487820625305, 'eval_runtime': 28.9576, 'eval_samples_per_second': 69.066, 'eval_steps_per_second': 1.105, 'epoch': 1.15}
{'loss': 0.785, 'learning_rate': 0.0001381914893617021, 'epoch': 1.17}
{'loss': 0.7686, 'learning_rate': 0.000135, 'epoch': 1.19}
{'loss': 0.7717, 'learning_rate': 0.00013180851063829786, 'epoch': 1.21}
...
{'loss': 0.7688, 'learning_rate': 8.393617021276595e-05, 'epoch': 1.5}
{'loss': 0.7785, 'learning_rate': 8.074468085106383e-05, 'epoch': 1.52}
{'loss': 0.7767, 'learning_rate': 7.75531914893617e-05, 'epoch': 1.54}
{'eval_loss': 0.7986326813697815, 'eval_runtime': 28.3196, 'eval_samples_per_second': 70.622, 'eval_steps_per_second': 1.13, 'epoch': 1.54}
{'loss': 0.7907, 'learning_rate': 7.436170212765956e-05, 'epoch': 1.56}
{'loss': 0.7691, 'learning_rate': 7.117021276595744e-05, 'epoch': 1.58}
...
{'loss': 0.7649, 'learning_rate': 1.6914893617021273e-05, 'epoch': 1.9}
{'loss': 0.7624, 'learning_rate': 1.3723404255319146e-05, 'epoch': 1.92}
{'eval_loss': 0.7973329424858093, 'eval_runtime': 29.2014, 'eval_samples_per_second': 68.49, 'eval_steps_per_second': 1.096, 'epoch': 1.92}
{'loss': 0.7824, 'learning_rate': 1.0531914893617022e-05, 'epoch': 1.94}
{'loss': 0.7772, 'learning_rate': 7.3404255319148934e-06, 'epoch': 1.96}
{'loss': 0.7762, 'learning_rate': 4.148936170212765e-06, 'epoch': 1.98}
{'loss': 0.7572, 'learning_rate': 9.574468085106382e-07, 'epoch': 2.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040/1040 [1:18:34<00:00, 4.44s/it]
{'train_runtime': 4716.2302, 'train_samples_per_second': 21.179, 'train_steps_per_second': 0.221, 'train_loss': 0.8336130522764646, 'epoch': 2.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040/1040 [1:18:34<00:00, 4.53s/it]
模型权重文件:
> tree -h /home/guodong.li/output/alpaca-lora-30b-dp
/home/guodong.li/output/alpaca-lora-30b-dp
├── [ 424] adapter_config.json
├── [ 49M] adapter_model.bin
└── [4.0K] checkpoint-1000
├── [ 98M] optimizer.pt
├── [ 49M] pytorch_model.bin
├── [ 14K] rng_state_0.pth
├── [ 14K] rng_state_1.pth
├── [ 14K] rng_state_2.pth
├── [ 14K] rng_state_3.pth
├── [ 14K] rng_state_4.pth
├── [ 14K] rng_state_5.pth
├── [ 14K] rng_state_6.pth
├── [ 14K] rng_state_7.pth
├── [ 557] scaler.pt
├── [ 627] scheduler.pt
├── [ 13K] trainer_state.json
└── [3.5K] training_args.bin
1 directory, 16 files
可以看到在A800上面,数据并行为8,5万条数据,单次epoch大约需要40分钟左右。
LLaMA-65B
首先,对 LLaMA 65B 进行微调,65B 参数的模型大约120G左右。为了让单卡A800能够跑65B的大模型,这里将micro_batch_size设置为1。
模型训练过程:
torchrun --nproc_per_node=8 --master_port=29005 finetune.py \
> --base_model '/data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b' \
> --data_path '/data/nfs/guodong.li/data/alpaca_data_cleaned.json' \
> --output_dir '/home/guodong.li/output/alpaca-lora-65b-dp' \
> --batch_size 8 \
> --micro_batch_size 1 \
> --num_epochs 1
...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Training Alpaca-LoRA model with params:
base_model: /data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b
data_path: /data/nfs/guodong.li/data/alpaca_data_cleaned.json
output_dir: /home/guodong.li/output/alpaca-lora-65b-dp
batch_size: 8
micro_batch_size: 1
num_epochs: 1
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [02:06<00:00, 1.56s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [02:20<00:00, 1.74s/it]
...
Map: 13%|█████████████████▉ | 6312/49942 [00:04<00:30, 1410.98 examples/s]Found cached dataset json (/home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 196.47it/s]
trainable params: 20971520 || all params: 65306632192 || trainable%: 0.03211238934867168
Loading cached split indices for dataset at /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-d8c5d7ac95d53860.arrow and /home/guodong.li/.cache/huggingface/datasets/json/default-2dab63d15cf49261/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e/cache-4a34b0c9feb19e72.arrow
{'loss': 2.1086, 'learning_rate': 2.3999999999999997e-05, 'epoch': 0.0}
{'loss': 2.0261, 'learning_rate': 5.399999999999999e-05, 'epoch': 0.0}
{'loss': 1.7054, 'learning_rate': 8.4e-05, 'epoch': 0.0}
{'loss': 1.2423, 'learning_rate': 0.00011099999999999999, 'epoch': 0.01}
{'loss': 0.9976, 'learning_rate': 0.00013199999999999998, 'epoch': 0.01}
{'loss': 0.801, 'learning_rate': 0.000162, 'epoch': 0.01}
{'loss': 0.839, 'learning_rate': 0.00019199999999999998, 'epoch': 0.01}
{'loss': 0.8134, 'learning_rate': 0.00022199999999999998, 'epoch': 0.01}
{'loss': 0.7575, 'learning_rate': 0.00025199999999999995, 'epoch': 0.01}
...
{'loss': 0.769, 'learning_rate': 0.0001992023441315318, 'epoch': 0.35}
{'loss': 0.7393, 'learning_rate': 0.00019871398339573498, 'epoch': 0.35}
{'loss': 0.7269, 'learning_rate': 0.0001982256226599381, 'epoch': 0.35}
{'loss': 0.6783, 'learning_rate': 0.00019773726192414128, 'epoch': 0.35}
{'eval_loss': 0.7974867820739746, 'eval_runtime': 48.5181, 'eval_samples_per_second': 41.222, 'eval_steps_per_second': 0.66, 'epoch': 0.35}
{'loss': 0.6891, 'learning_rate': 0.00019724890118834445, 'epoch': 0.35}
{'loss': 0.7216, 'learning_rate': 0.0001967605404525476, 'epoch': 0.36}
{'loss': 0.7114, 'learning_rate': 0.00019627217971675075, 'epoch': 0.36}
{'loss': 0.7089, 'learning_rate': 0.0001957838189809539, 'epoch': 0.36}
...
{'loss': 0.6985, 'learning_rate': 5.323132020185577e-06, 'epoch': 0.98}
{'loss': 0.7167, 'learning_rate': 4.834771284388734e-06, 'epoch': 0.99}
{'loss': 0.7433, 'learning_rate': 4.346410548591893e-06, 'epoch': 0.99}
{'loss': 0.6875, 'learning_rate': 3.8580498127950505e-06, 'epoch': 0.99}
{'loss': 0.7104, 'learning_rate': 3.369689076998209e-06, 'epoch': 0.99}
{'loss': 0.7346, 'learning_rate': 2.881328341201367e-06, 'epoch': 0.99}
{'loss': 0.7062, 'learning_rate': 2.3929676054045255e-06, 'epoch': 0.99}
{'eval_loss': 0.787121593952179, 'eval_runtime': 48.4232, 'eval_samples_per_second': 41.303, 'eval_steps_per_second': 0.661, 'epoch': 0.99}
{'loss': 0.701, 'learning_rate': 1.9046068696076832e-06, 'epoch': 0.99}
{'loss': 0.7169, 'learning_rate': 1.4162461338108414e-06, 'epoch': 1.0}
{'loss': 0.763, 'learning_rate': 9.278853980139996e-07, 'epoch': 1.0}
{'loss': 0.6903, 'learning_rate': 4.3952466221715773e-07, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6243/6243 [4:36:50<00:00, 2.42s/it]
{'train_runtime': 16612.2434, 'train_samples_per_second': 3.006, 'train_steps_per_second': 0.376, 'train_loss': 0.7368283385404043, 'epoch': 1.0}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6243/6243 [4:36:50<00:00, 2.66s/it]
显存占用:
Tue May 23 17:05:37 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... Off | 00000000:34:00.0 Off | 0 |
| N/A 67C P0 296W / 300W | 78543MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800 80G... Off | 00000000:35:00.0 Off | 0 |
| N/A 69C P0 303W / 300W | 78577MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A800 80G... Off | 00000000:36:00.0 Off | 0 |
| N/A 70C P0 300W / 300W | 78657MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A800 80G... Off | 00000000:37:00.0 Off | 0 |
| N/A 72C P0 297W / 300W | 78577MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A800 80G... Off | 00000000:9B:00.0 Off | 0 |
| N/A 71C P0 292W / 300W | 78641MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A800 80G... Off | 00000000:9C:00.0 Off | 0 |
| N/A 71C P0 305W / 300W | 78629MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A800 80G... Off | 00000000:9D:00.0 Off | 0 |
| N/A 68C P0 296W / 300W | 78625MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A800 80G... Off | 00000000:9E:00.0 Off | 0 |
| N/A 68C P0 298W / 300W | 78799MiB / 81920MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 33369 C ...nv-py310-cu117/bin/python 78541MiB |
| 1 N/A N/A 33370 C ...nv-py310-cu117/bin/python 78575MiB |
| 2 N/A N/A 33371 C ...nv-py310-cu117/bin/python 78655MiB |
| 3 N/A N/A 33372 C ...nv-py310-cu117/bin/python 78575MiB |
| 4 N/A N/A 33373 C ...nv-py310-cu117/bin/python 78639MiB |
| 5 N/A N/A 33374 C ...nv-py310-cu117/bin/python 78627MiB |
| 6 N/A N/A 33375 C ...nv-py310-cu117/bin/python 78623MiB |
| 7 N/A N/A 33376 C ...nv-py310-cu117/bin/python 78797MiB |
+-----------------------------------------------------------------------------+
模型权重:
> tree -h /home/guodong.li/output/alpaca-lora-65b-dp
/home/guodong.li/output/alpaca-lora-65b-dp
├── [ 424] adapter_config.json
├── [ 80M] adapter_model.bin
└── [4.0K] checkpoint-6200
├── [160M] optimizer.pt
├── [ 80M] pytorch_model.bin
├── [ 14K] rng_state_0.pth
├── [ 14K] rng_state_1.pth
├── [ 14K] rng_state_2.pth
├── [ 14K] rng_state_3.pth
├── [ 14K] rng_state_4.pth
├── [ 14K] rng_state_5.pth
├── [ 14K] rng_state_6.pth
├── [ 14K] rng_state_7.pth
├── [ 557] scaler.pt
├── [ 627] scheduler.pt
├── [ 80K] trainer_state.json
└── [3.5K] training_args.bin
1 directory, 16 files
可以看到在A800上面,数据并行为8,5万条数据,单次epoch大约需要4.5小时左右。
将 LoRA 权重合并回基础模型
下面将 LoRA 权重合并回基础模型,以便于进行模型推理。具体可参考足够惊艳,使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调,效果比肩斯坦福羊驼一文修改export_hf_checkpoint.py
文件。
权重合并过程:
BASE_MODEL=/data/nfs/guodong.li/pretrain/hf-llama-model/llama-65b \
> LORA_MODEL=/home/guodong.li/output/alpaca-lora-65b-dp \
> HF_CHECKPOINT=/home/guodong.li/output/hf_65b_ckpt \
> python export_hf_checkpoint.py
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81/81 [01:15<00:00, 1.08it/s]
合并后的权重文件:
> tree -h hf_65b_ckpt
hf_65b_ckpt
├── [ 580] config.json
├── [ 137] generation_config.json
├── [ 537] pytorch_model-00001-of-00403.bin
├── [500M] pytorch_model-00002-of-00403.bin
├── [256M] pytorch_model-00003-of-00403.bin
├── [256M] pytorch_model-00004-of-00403.bin
├── [344M] pytorch_model-00005-of-00403.bin
├── [344M] pytorch_model-00006-of-00403.bin
├── [344M] pytorch_model-00007-of-00403.bin
...
├── [344M] pytorch_model-00400-of-00403.bin
├── [344M] pytorch_model-00401-of-00403.bin
├── [344M] pytorch_model-00402-of-00403.bin
├── [500M] pytorch_model-00403-of-00403.bin
└── [ 65K] pytorch_model.bin.index.json
0 directories, 406 files
模型推理
接下来使用转换后的模型权重进行模型推理,具体的模型推理(inference.py
)代码如下所示:
import sys
from transformers import LlamaForCausalLM, AutoTokenizer
import torch
device = torch.device("cuda:2") if torch.cuda.is_available() else torch.device("cpu")
tokenizer_path="/data/nfs/guodong.li/pretrain/hf-llama-model/tokenizer"
model_path = "/home/guodong.li/output/hf_65b_ckpt" # You can modify the path for storing the local model
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
print("Human:")
line = input()
while line:
inputs = 'Human: ' + line.strip() + '\n\nAssistant:'
input_ids = tokenizer(inputs, return_tensors="pt").input_ids
input_ids = input_ids.to(device)
outputs = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)
rets = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("Assistant:\n" + rets[0].strip().replace(inputs, ""))
print("\n------------------------------------------------\nHuman:")
line = input()
运行推理代码:
> python inference.py
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/rh/devtoolset-9/root/usr/lib/dyninst'), PosixPath('/opt/rh/devtoolset-7/root/usr/lib/dyninst')}
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/guodong.li/virtual-venv/alpara-lora-venv-py310-cu117/lib/python3.10/site-packages/bitsandbytes-0.37.2-py3.10.egg/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402/402 [01:48<00:00, 3.72it/s]
Human:
What are the five characteristics of a good argument?
Assistant:
A good argument should be clear, concise, logical, supported by evidence, and respectful of the opposing view.
------------------------------------------------
Human:
Generate a list of ten common idioms related to animals.
Assistant:
1. "Like a fish out of water" 2. "Birds of a feather flock together" 3. "Let the cat out of the bag" 4. "Herding cats" 5. "Barking up the wrong tree" 6. "Sly as a fox" 7. "A lion's share" 8. "A bird in the hand is worth two in the bush" 9. "A wolf in sheep's clothing" 10. "A wild goose chase".
------------------------------------------------
Human:
Evaluate the following expression: (6+2)*(2-2).
Assistant:
10.
------------------------------------------------
Human:
Compute the derivative of 3x^3 + 10x.
Assistant:
The derivative of 3x^3 + 10x is 9x^2 + 10.
------------------------------------------------
Human:
显存占用:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800 80G... Off | 00000000:34:00.0 Off | 0 |
| N/A 44C P0 69W / 300W | 66927MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
...
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A800 80G... Off | 00000000:9E:00.0 Off | 0 |
| N/A 47C P0 71W / 300W | 7224MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 43499 C python 66925MiB |
| 1 N/A N/A 43499 C python 949MiB |
...
| 7 N/A N/A 43499 C python 949MiB |
+-----------------------------------------------------------------------------+
可以看到即使使用了FP16加载模型,单卡的显存占用也高达60多G。如果硬件资源不足,可以考虑使用模型并行推理。具体可参考: tensor_parallel 和 FasterTransformer 这两个项目,使用模型并行对 LLaMA 进行推理。当然,从提升模型的推理速度以及吞吐量的角度来说,对百亿级以上的大模型,也应该使用模型并行进行推理。
结语
本文讲述了使用 LoRA 高效微调技术对 LLaMA 30B/65B 进行模型训练及推理,希望能够给你带来帮助。
参考文档:
从0到1复现斯坦福羊驼(Stanford Alpaca 7B) 足够惊艳,使用Alpaca-Lora基于LLaMA(7B)二十分钟完成微调,效果比肩斯坦福羊驼 Alpaca-LoRA