LLaMA-Factory在华为显卡上的实验记录

如何判断目前所选择的模型是否支持
LLaMA-Factory/src/llamafactory/data/template.py
在项目的这个地址中会有不同模型的支持模版。

这里用目前我最常用的两个模型举例子一个是智谱的glm4-9B模型

_register_template(
    name="glm4",
    format_user=StringFormatter(slots=["<|user|>\n{{content}}<|assistant|>"]),
    format_assistant=StringFormatter(slots=["\n{{content}}"]),
    format_system=StringFormatter(slots=["<|system|>\n{{content}}"]),
    format_function=FunctionFormatter(slots=["{{name}}\n{{arguments}}"]),
    format_observation=StringFormatter(slots=["<|observation|>\n{{content}}<|assistant|>"]),
    format_tools=ToolFormatter(tool_format="glm4"),
    format_prefix=EmptyFormatter(slots=["[gMASK]<sop>"]),
    stop_words=["<|user|>", "<|observation|>"],
    efficient_eos=True,
)

这段代码看起来是在定义一个模板（template）的注册过程，可能是在某个框架或者系统中使用。让我来解释一下每个参数的作用和含义：

`_register_template(...)`

这是一个函数或者方法，用来注册一个名为 "glm4" 的模板。

参数解释：

name=“glm4”：

这里指定了模板的名称，即 "glm4"。

format_user=StringFormatter(slots=[“\n{{content}}”])：

format_user 是用来格式化用户输入的内容的格式器（formatter）。 StringFormatter(slots=["\n{{content}}"]) 表示使用字符串格式化器，slots=["\n{{content}}"] 指定了插槽（slots），用于接收用户输入内容，并在格式化时将内容放置在 \n{{content}} 的位置上。

format_assistant=StringFormatter(slots=[“\n{{content}}”])：

format_assistant 是用来格式化助理（assistant）输出的内容的格式器。同样使用了 StringFormatter，并指定了相同的插槽 ["\n{{content}}"]。

format_system=StringFormatter(slots=[“\n{{content}}”])：

format_system 是用来格式化系统（system）输出的内容的格式器。同样使用了 StringFormatter，并指定了相同的插槽 ["\n{{content}}"]。

format_function=FunctionFormatter(slots=[“{{name}}\n{{arguments}}”])：

format_function 是用来格式化函数（function）定义的格式器。 FunctionFormatter(slots=["{{name}}\n{{arguments}}"]) 表示格式化时会使用 {{name}} 和 {{arguments}} 插槽，用于显示函数名称和参数。

format_observation=StringFormatter(slots=[“\n{{content}}”])：

format_observation 是用来格式化观察（observation）输出的内容的格式器。同样使用了 StringFormatter，并指定了相同的插槽 ["\n{{content}}"]。

format_tools=ToolFormatter(tool_format=“glm4”)：

format_tools 是用来格式化工具（tools）的格式器。 ToolFormatter(tool_format="glm4") 表示工具格式化器将使用 "glm4" 格式。

format_prefix=EmptyFormatter(slots=[“[gMASK]”])：

format_prefix 是用来格式化前缀（prefix）的格式器。 EmptyFormatter(slots=["[gMASK]<sop>"]) 表示前缀格式化器将输出 "[gMASK]<sop>"。

stop_words=[“”, “”]：

stop_words 是停用词列表，但在这里给出的示例中，停用词列表为空，即 ["", ""]。

efficient_eos=True：

efficient_eos 是一个布尔值参数，表示是否启用高效的EOS（End of Sentence）处理。在这里设置为 True，可能意味着系统会优化处理句子结束的方式。

这段代码的主要作用是定义了一个名为 "glm4" 的模板，包括了各种用于格式化用户输入、助理输出、系统输出、函数定义、观察输出、工具、前缀等内容的格式化器和设置。这种模板的定义通常用于在特定的系统或框架中，为不同类型的输入和输出提供统一的格式化和处理规则，以便于后续的处理和展示。

_register_template(
    name="qwen",
    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
    format_observation=StringFormatter(slots=["<|im_start|>tool\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_separator=EmptyFormatter(slots=["\n"]),
    default_system="You are a helpful assistant.",
    stop_words=["<|im_end|>"],
    replace_eos=True,
)

目前看所有的qwen模型在llama factory中都用这一套模版。

从最简化的角度来看目前我在三个阶段分别用到的数据结构
预训练数据结构

{"text":""}

对应的data_info.json中需要加入以下配置

"pre_dataset_name": {
  "file_name": "预训练数据文件在data目录下的地址",
  "columns": {
    "prompt": "text"
  }
}

微调训练数据结构

{"input_colum": "根据:TWY:滑行道;BTN:在……之间;TWY:滑行道;AND:与;TWY:滑行道;AVBL:可供使用;FOR:为了;OPS.:作业、运行、经营、操作、运转;DRG:在……期间;FLW:如下，以下;TWY:滑行道;FOR:为了;ACFT:航空器;ACFT:航空器;IN:在;APN:停机坪;FOR:为了;ACFT:航空器;ONLY.:只能;AND:与;ACFT:航空器;ON:在;RWY:跑道，逐词翻译：PORTIONOFTWYMBTNTWYLINK31ANDTWYLINK32NOTAVBLFOROPS.\nDRGTHISPERIODFLWRESTRICTIONSSHALLAPPLY:\n1.COMPATIBILITYOFTWYKRESTRICTEDFORACFTUPTOWINGSPAN68.40M.\n2.ACFTSTAND265INCARGOAPNDOWNGRADEDFORACFTUPTOWINGSPAN68.40MONLY.\n3.MOVEMENTOFA388ANDAN124ACFTONRWY10/28NOTPERMITED.","output_colum": "<部分:PORTION:0> <的:OF:1> <滑行道:TWY:2> <M:M:3> <在:BTN:4.1> <之间:BTN:4.2> <滑行道:TWY:5> <连接:LINK:6> <31:31:7> <与:AND:8> <滑行道:TWY:9> <连接:LINK:10> <32:32:11> <不可用:NOT AVBL:12> <因为:FOR:13> <运行:OPS:14> <.:.:15> <在……期间:DRG:16> <这个:THIS:17> <时期:PERIOD:18> <如下，以下:FLW:19> <限制:RESTRICTIONS:20> <应该:SHALL:21> <适用:APPLY:22> <::::23> <1:1:24> <.:.:25> <兼容:COMPATIBILITY:26> <的:OF:27> <滑行道:TWY:28> <K:K:29> <被限制:RESTRICTED:30> <对于:FOR:31> <航空器:ACFT:32> <到:UPTO:33> <翼展:WINGSPAN:34> <68.40M:68.40M:35> <.:.:36> <2:2:37> <.:.:38> <航空器:ACFT:39> <停在:STAND:40> <265:265:41> <在:IN:42> <货物:CARGO:43> <停机坪:APN:44> <降级:DOWNGRADED:45> <对于:FOR:46> <航空器的:ACFT:47> <到:UPTO:48> <翼展:WING SPAN:49> <68.40M:68.40M:50> <只能:ONLY:51> <.:.:52> <3:3:53> <.:.:54> <移动:MOVEMENT:55> <的:OF:56> <A388:A388:57> <与:AND:58> <AN124:AN124:59> <航空器:ACFT:60> <在:ON:61> <跑道:RWY:62> <10/28:10/28:63> <不:NOT:64> <被允许:PERMITED:65> <.:.:66> "}

对应的datainfo中的内容为

"sft_dataset_name": {
  "file_name": "微调数据文件在data目录下的地址",
  "columns": {
    "query": "input_colum",
    "response": "output_colum",
  }
}

因为数据量比较大所以使用jsonl,在数据量大的情况下json文件会导致模型报错。
相对于老版本的llamafactory来说新版的加入了多线程分词能力。这样预处理的过程会更快。

处理好数据以后我们开始处理训练命令。这里注意细节，我们的预训练数据叫做pre_dataset_name，微调数据叫做sft_dataset_name。目前我所在的环境是国内。所以这里我们需要一条指令让模型下载通过魔搭社区进行下载。

export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`

在配置modelscope以后要记得安装modelscope

pip install modelscope -U

这里我们用了一种比较落后的方式实用华为的npu。使用torch-npu模块来进行npu的使用。
在训练之前我们先介绍一下llama factory支撑的几种训练模式
LlamaFactory 支持的训练模式的解释：

1、dpo 强化训练 - Data Parallel Optimization 的缩写，数据并行优化。这种方法涉及在多个设备上并行训练模型，每个设备处理不同的数据批次，以提高训练效率和速度。

2、kto 强化训练 - Knowledge Transfer Optimization 的缩写，知识迁移优化。这通常涉及将预训练模型的知识迁移到新的模型上，以改善新模型的性能。

3、ppo 强化训练 - Probabilistic Policy Optimization 的缩写，概率策略优化。这是一种强化学习算法，用于优化策略的期望回报，通常用于训练代理在给定环境中执行特定任务。

4、pt 预训练 - Pre-training 的缩写，预训练。这是在大规模数据集上训练模型的过程，以便模型能够学习通用的语言表示，这些表示可以在各种下游任务中进行微调。

5、rm 强化反馈训练 - 这可能是一种使用强化学习技术的训练方法，其中模型根据收到的反馈（奖励或惩罚）来调整其行为。

6、sft 微调训练 - Supervised Fine-Tuning 的缩写，监督式微调。这是在特定任务上使用标注数据对预训练模型进行微调的过程，以提高模型在该任务上的性能。

第一步我们设置预训练训练的配置文件这里我推荐使用glm4

### model
model_name_or_path: ZhipuAI/glm-4-9b

### method
stage: pt
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: pre_dataset_name
template: glm4
cutoff_len: 4096
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/glm-4-9b/full/pt
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

这里我们指定了训练模式是pt也就是预训练，在openi平台最高可以选择4卡910显卡进行训练。也就是4*32G显存。这是足够进行预训练的。

如果需要更好的预训练效果可以通过调节以下几个参数来实现。

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1

这一部分的配置文件详细描述了训练过程的具体参数：

train

per_device_train_batch_size: 这个参数指定了每个训练设备（例如，GPU或TPU）上的批量大小。在这里，它被设置为1，这意味着每个设备在每个训练步骤中处理一个样本。较小的批量大小可以减少内存需求，但可能需要更多的训练步骤来达到收敛。 gradient_accumulation_steps: 这个参数定义了在执行权重更新之前累积梯度的步骤数。在这里，它被设置为2，意味着模型将在累积了两步的梯度之后才进行权重更新。这种方法可以在不增加内存使用的情况下模拟更大的批量大小。 learning_rate: 学习率是决定模型参数更新速度的关键因素。在这里，它被设置为1.0e-4（即0.0001），这是一个常见的初始学习率值。学习率的选择对模型训练至关重要，过高的学习率可能导致训练不稳定，而过低的学习率可能导致训练过程缓慢。 num_train_epochs: 这个参数指定了模型将在训练数据上运行的完整次数。在这里，它被设置为3.0，意味着模型将看到整个训练数据集三次。增加训练轮数可以提高模型的性能，但也可能导致过拟合。 lr_scheduler_type: 学习率调度器用于在训练过程中动态调整学习率。在这里，它被设置为“cosine”，这意味着学习率将按照余弦函数的规律变化。余弦调度器通常在训练开始时保持较高的学习率，并在训练过程中逐渐降低。 warmup_ratio: 这个参数定义了学习率热身期间的比例。在这里，它被设置为0.1，这意味着在训练的前10%时间内，学习率将从0逐渐增加到初始学习率。热身阶段有助于在训练初期稳定模型的学习。
这些参数共同决定了模型训练的效率和质量。调整这些参数可以帮助优化模型的性能，同时确保训练过程的有效性和稳定性。
我们开始安装在npu中的llama factory训练框架
第一步安装npu版本的llama factory

pip install -e '.[torch-npu,metrics]'

第二步安装npu环境

# 请替换 URL 为 CANN 版本和设备型号对应的 URL
# 安装 CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install

# 安装 CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install

# 设置环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh

第三步在我安装的时候遇到了一个小bug，因为没有云平台的root权限，所以这里我才用了conda进行环境安装。

conda install -c conda-forge libsndfile

单机多卡情况下使用deepspeed zero3会带来相对原生的单机多卡更高的计算效率。
第四步安装deepspeed。

pip install deepspeed

接下来我们运行命令开始进行训练

 llamafactory-cli train LLaMA-Factory/examples/train_full/glm4_full_pt_ds3.yaml

成功训练的日志的样子

[2024-07-09 09:10:46,149] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
07/09/2024 09:11:02 - INFO - llamafactory.hparams.parser - Process rank: 0, device: npu:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.35k/1.35k [00:00<00:00, 4.44kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.0/36.0 [00:00<00:00, 86.4B/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.21k/2.21k [00:00<00:00, 5.46kB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 205/205 [00:00<00:00, 451B/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.34k/6.34k [00:00<00:00, 19.5kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.81G/1.81G [01:23<00:00, 23.2MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:36<00:00, 18.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:36<00:00, 18.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83G/1.83G [01:18<00:00, 25.1MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80G/1.80G [01:19<00:00, 24.1MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:15<00:00, 24.0MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83G/1.83G [01:22<00:00, 24.0MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80G/1.80G [01:12<00:00, 26.4MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [01:03<00:00, 28.6MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.83G/1.83G [01:10<00:00, 27.8MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.54G/1.54G [01:00<00:00, 27.1MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.4k/28.4k [00:00<00:00, 65.7kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57.1k/57.1k [00:00<00:00, 100kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.34k/3.34k [00:00<00:00, 11.8kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.78k/3.78k [00:00<00:00, 12.2kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.3k/15.3k [00:00<00:00, 28.9kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.50M/2.50M [00:00<00:00, 3.07MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.12k/3.12k [00:00<00:00, 9.51kB/s]
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2159] 2024-07-09 09:24:48,669 >> loading file tokenizer.json
[WARNING|logging.py:313] 2024-07-09 09:24:49,392 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/09/2024 09:24:49 - INFO - llamafactory.data.template - Add <|user|>,<|observation|> to stop words.
07/09/2024 09:24:49 - INFO - llamafactory.data.loader - Loading dataset identity.json...
Generating train split: 91 examples [00:00, 1770.27 examples/s]
Converting format of dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 91/91 [00:00<00:00, 187.32 examples/s]
07/09/2024 09:25:01 - INFO - llamafactory.data.loader - Loading dataset alpaca_en_demo.json...
Generating train split: 1000 examples [00:00, 19614.77 examples/s]
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2385.01 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:43<00:00, 24.98 examples/s]
input_ids:
[151331, 151333, 151336, 198, 6023, 151337, 198, 9703, 0, 358, 1079, 5867, 606, 37953, 458, 15223, 17821, 7881, 553, 5867, 3094, 3417, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151329]
inputs:
[gMASK] <sop> <|user|> 
hi <|assistant|> 
Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today? <|endoftext|>
label_ids:
[-100, -100, -100, -100, -100, -100, 198, 9703, 0, 358, 1079, 5867, 606, 37953, 458, 15223, 17821, 7881, 553, 5867, 3094, 3417, 13, 2585, 646, 358, 7789, 498, 3351, 30, 151329]
labels:

Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today? <|endoftext|>
[INFO|configuration_utils.py:731] 2024-07-09 09:26:01,831 >> loading configuration file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/config.json
[INFO|configuration_utils.py:731] 2024-07-09 09:26:01,844 >> loading configuration file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/config.json
[INFO|configuration_utils.py:800] 2024-07-09 09:26:01,846 >> Model config ChatGLMConfig {
  "_name_or_path": "/root/.cache/modelscope/hub/ZhipuAI/glm-4-9b",
  "add_bias_linear": false,
  "add_qkv_bias": true,
  "apply_query_key_layer_scaling": true,
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "ChatGLMModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "auto_map": {
    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForCausalLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
  },
  "bias_dropout_fusion": true,
  "classifier_dropout": null,
  "eos_token_id": [
    151329,
    151336,
    151338
  ],
  "ffn_hidden_size": 13696,
  "fp32_residual_connection": false,
  "hidden_dropout": 0.0,
  "hidden_size": 4096,
  "kv_channels": 128,
  "layernorm_epsilon": 1.5625e-07,
  "model_type": "chatglm",
  "multi_query_attention": true,
  "multi_query_group_num": 2,
  "num_attention_heads": 32,
  "num_layers": 40,
  "original_rope": true,
  "pad_token_id": 151329,
  "padded_vocab_size": 151552,
  "post_layer_norm": true,
  "rmsnorm": true,
  "rope_ratio": 1,
  "seq_length": 8192,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.3",
  "use_cache": true,
  "vocab_size": 151552
}

[INFO|modeling_utils.py:3553] 2024-07-09 09:26:01,975 >> loading weights file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/model.safetensors.index.json
[INFO|modeling_utils.py:3698] 2024-07-09 09:26:01,976 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[2024-07-09 09:26:01,979] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-09 09:26:01,979] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...

日志解读
根据你提供的日志信息，这是一个涉及机器学习模型训练的过程。我会逐步解释每个部分的含义和可能的影响：

INFO 和 WARNING 日志：

Setting ds_accelerator to npu (auto detect)：指示程序将使用NPU（神经处理单元）加速器，系统自动检测到这一设置。 async_io requires the dev libaio .so object and headers but these were not found.：警告提示缺少 libaio 库，这可能影响异步IO的性能。

If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.

：建议如果已经安装了 libaio，可以尝试设置 CFLAGS 和 LDFLAGS 环境变量来正确定位该库。

下载和处理数据集：

大量的 Downloading 和 Converting format of dataset 行指示正在下载和转换数据集，这是模型训练过程中常见的操作。

模型配置和加载：

模型配置信息显示了模型的参数设置，如层数、隐藏单元大小等。 loading weights file /root/.cache/modelscope/hub/ZhipuAI/glm-4-9b/model.safetensors.index.json 表示正在加载模型的权重文件。 Detected DeepSpeed ZeRO-3: activating zero.init() for this model 表示检测到使用了 DeepSpeed ZeRO-3 技术，这是一种优化模型训练内存使用和效率的方法。

MPI 环境检测：

Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... 检测到不是使用 DeepSpeed 或分布式启动器，正在尝试检测 MPI 环境。

综上所述，日志记录了一个使用 NPU 加速器的机器学习模型训练过程，涉及数据集下载、模型加载和配置，以及一些系统环境的警告和优化建议。
这里报了个错误。mpi环境失败手动安装mpi环境

conda install -c conda-forge mpi4py openmpi

安装的时候返回了一段日志。对这段日志进行解读。

On Linux, Open MPI is built with UCX support but it is disabled by default.                                                                                                              
To enable it, first install UCX (conda install -c conda-forge ucx).                                                                                                                      
Afterwards, set the environment variables                                                                                                                                                
OMPI_MCA_pml=ucx OMPI_MCA_osc=ucx                                                                                                                                                        
before launching your MPI processes.                                                                                                                                                     
Equivalently, you can set the MCA parameters in the command line:
mpiexec --mca pml ucx --mca osc ucx ...


On Linux, Open MPI is built with CUDA awareness but it is disabled by default.
To enable it, please set the environment variable
OMPI_MCA_opal_cuda_support=true
before launching your MPI processes.
Equivalently, you can set the MCA parameter in the command line:
mpiexec --mca opal_cuda_support 1 ...
Note that you might also need to set UCX_MEMTYPE_CACHE=n for CUDA awareness via
UCX. Please consult UCX documentation for further details.


done

这段日志是在告知如何在Linux系统中启用Open MPI的UCX（Unified Communication X）支持和CUDA（Compute Unified Device Architecture）意识支持。UCX是一个高性能通信库，用于支持不同通信机制（如InfiniBand, RoCE, TCP/IP等），而CUDA是由NVIDIA开发的并行计算平台和编程模型。
以下是日志的解读：

启用UCX支持： Open MPI在Linux上编译时包含了UCX支持，但默认是禁用的。要启用UCX支持，首先需要安装UCX。可以通过conda包管理器安装，命令是 conda install -c conda-forge ucx。安装UCX后，在启动MPI进程之前，需要设置环境变量 OMPI_MCA_pml=ucx 和 OMPI_MCA_osc=ucx。或者，可以在命令行中设置MCA参数，使用命令 mpiexec --mca pml ucx --mca osc ucx ...。启用CUDA意识支持： Open MPI在Linux上编译时也包含了CUDA意识支持，但默认也是禁用的。要启用CUDA意识支持，需要设置环境变量 OMPI_MCA_opal_cuda_support=true。同样，可以在命令行中设置MCA参数，使用命令 mpiexec --mca opal_cuda_support 1 ...。如果要通过UCX启用CUDA意识支持，可能还需要设置 UCX_MEMTYPE_CACHE=n。具体细节可以查阅UCX的文档。

本来早点结束。嘿嘿又爆出了新的问题。

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/tmp/code/LLaMA-Factory/src/llamafactory/cli.py", line 110, in main
    run_exp()
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/loader.py", line 151, in load_model
    model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3710, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 928, in __init__
    self.transformer = ChatGLMModel(config, empty_init=empty_init, device=device)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 852, in __init__
    self.rotary_pos_emb = RotaryEmbedding(rotary_dim // 2, rope_ratio=config.rope_ratio,
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
    f(module, *args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b/modeling_chatglm.py", line 96, in __init__
    inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
RuntimeError: call aclnnCast failed, detail:EZ1001: 2024-07-09-09:38:52.309.843 The param dtype not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT16,DT_FLOAT,DT_DOUBLE,DT_INT8,DT_UINT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT16,DT_UINT32,DT_UINT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128,].

[ERROR] 2024-07-09-09:38:52 (PID:17196, Device:0, RankID:0) ERR01005 OPS internal error

先解读一下异常。
这段日志是Python程序运行时出现的错误堆栈，具体是使用DeepSpeed库（一种用于深度学习训练的库）在MindSpore（一种深度学习框架）上运行时遇到的。错误信息表明在执行模型初始化时出现了运行时错误，导致无法创建模型。
错误信息显示在尝试创建模型的某些组件时，由于某种原因，无法将参数的数据类型转换为DeepSpeed支持的类型。具体来说，问题出现在创建RotaryEmbedding对象时，这个对象需要一个rotary_dim参数，但是在转换过程中遇到了问题。
错误堆栈的详细部分如下：

RuntimeError: call aclnnCast failed, detail:EZ1001: 2024-07-09-09:38:52.309.843 The param dtype not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT16,DT_FLOAT,DT_DOUBLE,DT_INT8,DT_UINT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT16,DT_UINT32,DT_UINT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128,].

这段信息表明，在尝试将数据类型从DT_BFLOAT16转换为DeepSpeed支持的类型时失败了。DT_BFLOAT16不在DeepSpeed支持的类型列表中，所以转换失败。 [ERROR] 2024-07-09-09:38:52 (PID:17196, Device:0, RankID:0) ERR01005 OPS internal error
这表明DeepSpeed内部的某个操作（可能是模型初始化的一部分）遇到了内部错误。
解决这个问题的方法可能包括：检查模型的配置，确保rotary_dim参数的数据类型是DeepSpeed支持的类型之一。如果rotary_dim参数的值不是DeepSpeed支持的类型，尝试将值转换为支持的类型。检查DeepSpeed的文档，了解如何配置或调整以支持DT_BFLOAT16类型。联系DeepSpeed或MindSpore的支持团队，寻求帮助解决这个特定问题。
由于这涉及到具体的代码和库配置，最直接的方法是联系项目的开发者或社区，他们可能提供更具体的解决方案或工作around。

哎嘿 910 计算芯片版本不支持 DT_BFLOAT16。所以我们要改deepspeed的配置文件。这时候最绝望的事情来了。写到最后发现一个无法逾越的问题。

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/tmp/code/LLaMA-Factory/src/llamafactory/cli.py", line 110, in main
    run_exp()
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 88, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1221, in forward
    outputs = self.model(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1012, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/model_utils/checkpointing.py", line 65, in custom_gradient_checkpointing_func
    return gradient_checkpointing_func(func, *args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 230, in forward
    outputs = run_function(*args)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 763, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 257, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py", line 111, in zero3_linear_wrap
    return LinearFunctionForZeroStage3.apply(input, weight, bias)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch_npu/npu/amp/autocast_mode.py", line 113, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/deepspeed/runtime/zero/linear.py", line 59, in forward
    output += bias
RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-07-09-10:40:00.116.800 the size of tensor selfRef [1,120] must match the size of tensor other [0].
        TraceBack (most recent call last):
        120 and 0 cannot broadcast.
        the size of tensor selfRef [1,120] must match the size of tensor other [0].

[ERROR] 2024-07-09-10:40:00 (PID:21727, Device:0, RankID:0) ERR01005 OPS internal error

更换成qwen2-7B进行微调训练出现了tensor 不匹配的问题。对异常日志进行解读。
从日志来看，报错的原因是发生了张量操作的维度不匹配。具体来说，错误信息 the size of tensor selfRef [1,120] must match the size of tensor other [0] 表示在进行 aclnnInplaceAdd 操作时，一个张量的维度是 [1,120]，另一个张量的维度是 [0]，导致无法进行广播操作。这通常是由于数据输入的形状或大小设置不正确引起的。以下是详细的解读及可能的解决方案：

错误日志解读

主函数调用：

sys.exit(main())

程序从主函数 main 开始执行。

执行实验：

run_exp()

在 run_exp 函数中运行实验。

运行 SFT（Supervised Fine-Tuning）：

run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)

在 run_sft 中进行模型的监督微调。

训练模型：

train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

在 trainer.train 中进行训练，并可能从检查点恢复训练。

训练步骤：

tr_loss_step = self.training_step(model, inputs)

进行训练的单步操作 training_step。

计算损失：

loss = self.compute_loss(model, inputs)

计算模型的损失。

模型前向传播：

outputs = model(**inputs)

进行模型的前向传播。

深度学习库内部调用：
这里涉及多个内部函数调用，最终在 aclnnInplaceAdd 时出错：

RuntimeError: call aclnnInplaceAdd failed, detail:EZ1001: 2024-07-09-10:40:00.116.800 the size of tensor selfRef [1,120] must match the size of tensor other [0].

可能的解决方案

检查数据输入：

确保输入数据的形状和大小正确。尤其是在数据预处理步骤中，确认数据没有丢失或者形状不匹配。

模型配置检查：

检查模型的配置，尤其是线性层（如 self.q_proj）的输入输出维度是否与数据匹配。

检查自定义函数：

如果有自定义的梯度检查点函数 custom_gradient_checkpointing_func，确保其实现正确，并且不会改变输入数据的形状。

更新库和框架：

确保使用的库（如 transformers, torch, deepspeed 等）是最新版本，因为新版本可能包含错误修复和改进。

调试信息：

在模型前向传播的关键步骤添加调试信息，打印张量的形状以便确定错误发生的位置和原因。

具体到这个错误，可以首先检查 self.q_proj 的输入 hidden_states 的形状，并在出错前打印相关张量的形状，确保其维度匹配。如果问题仍然存在，建议进一步简化代码并逐步调试，以确定确切的错误原因。
接下来我们去除掉deepspeed配置项。
发生以下异常

Traceback (most recent call last):
  File "/home/ma-user/anaconda3/envs/MindSpore/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/tmp/code/LLaMA-Factory/src/llamafactory/cli.py", line 110, in main
    run_exp()
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/loader.py", line 160, in load_model
    model = init_adapter(config, model, model_args, finetuning_args, is_trainable)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/adapter.py", line 306, in init_adapter
    _setup_full_tuning(model, model_args, finetuning_args, is_trainable, cast_trainable_params_to_fp32)
  File "/tmp/code/LLaMA-Factory/src/llamafactory/model/adapter.py", line 59, in _setup_full_tuning
    param.data = param.data.to(torch.float32)
RuntimeError: NPU out of memory. Tried to allocate 2.03 GiB (NPU 0; 32.00 GiB total capacity; 29.19 GiB already allocated; 29.19 GiB current active; 412.09 MiB free; 30.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

到时间了重新想办法今天必须把这个代码跑通

总结