自动炼丹基于Llama-factory改写--学习记录

最近在使用llama-factory微调LLM，但是一个个手动修改基本的学习率，学习轮次有太麻烦了。在寻找有没有什么自动调参的办法，在github上看到Llama-factory支持Weights & Biases 记录实验数据自动微调，尝试了一下本小白不太会搞(*꒦ິ⌓꒦ີ)。而且我要的结果不仅是损失小还得是预测的结果达到一定条件，就想着自己尝试在Llama-factory上加点东西，实现我的目标。有什么不对的地方还拜托各位大佬们教教我。

需求：

需要用llama-factory微调出一个参数合适的llama3-8B模型。根据llama-factory的ReadMe,我要的命令行命令有两个，命令一：指令监督微调

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

命令二：批量预测并计算 BLEU 和 ROUGE 分数

llamafactory-cli train examples/train_lora/llama3_lora_predict.yaml

思路大概就这样，

一：修改指令监督微调的参数文件，之后调用命令一训练模型。

二：用训练好的模型做预测，获取预测值结果。

三：对结果评分，将好的结果记录下来并保存对应的训练参数。

之后循环执行，1，2，3。

———————————————————————————————————————————

开始：

第一步：修改训练yaml参数

我的训练yaml文件参数如下：我只改learning_rate，num_train_epochs。打算先找到适合的这两个参数。

参数的相关描述：Llama-factory的yaml配置参数--学习记录-CSDN博客

### model

model_name_or_path: LLM-Research/Meta-Llama-3-8B #我的是魔塔的目录

### method

stage: sft #指定微调训练方法

do_train: true #

finetuning_type: lora

lora_target: all

### dataset

dataset: validationself #需要自己在data目录下dataset_info.json文件定义

template: llama3

cutoff_len: 1024

max_samples: 1000

overwrite_cache: true

preprocessing_num_workers: 16

### output

output_dir: saves/test #输出目录

logging_steps: 10

save_steps: 500

plot_loss: true

overwrite_output_dir: true

### train

per_device_train_batch_size: 1

gradient_accumulation_steps: 8

learning_rate: 1.0e-4 #我要改的学习率

num_train_epochs: 3.0 #训练轮次

lr_scheduler_type: cosine

warmup_ratio: 0.1

bf16: true

ddp_timeout: 180000000

### eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500

先编写的相应的批量修改参数learning_rate，num_train_epochs的循环如下：

当然该循环很垃圾，暴力求解的。

import yaml  
  
# 读取原始的YAML文件  
with open('trainyear.yaml', 'r') as file:  
    config = yaml.safe_load(file)  
  
# 设置参数范围  
num_train_epochs_range = list(range(50, 101))  # 从50到100，代表5.0到10.0，步长为0.1  
learning_rate_range = [0.0001 * (10 ** x) for x in range(4)] + [0.001 * (1 + y) for y in range(90)]  # 从0.0001到0.1，步长不一但覆盖整个范围  
  
# 循环生成所有组合，并替换源文件中的参数  
for epochs in num_train_epochs_range:  
    num_train_epochs = epochs / 10.0  # 转换为浮点数  
    for lr in learning_rate_range:  
        # 更新配置中的参数  
        config['learning_rate'] = lr  
        config['num_train_epochs'] = num_train_epochs  
          
        # 写入YAML文件，替换之前的参数  
        with open('trainyear.yaml', 'w') as file:  
            yaml.safe_dump(config, file)  
  
        print(f'Updated config_updated.yaml with epochs={num_train_epochs} and lr={lr}')

第二步：编写函数实现原来命令llamafactory-cli train yaml，取得结果。

实现命令一的功能

def run_command_train():

    command = "llamafactory-cli train autoFinetuning/trainyear.yaml"

    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True)
    while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
        
        # 捕获剩余的错误输出
    err = process.stderr.read()
    if err:
        print("Standard Error:\n", err.strip())

这时模型已经训练出来了，下面我要用模型进行验证集的预测，并获取预测的结果。

第三步：验证集预测

预测函数：实现命令二的功能

def run_command_predict():


    command = "llamafactory-cli train autoFinetuning/predictyear.yaml"


    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1, universal_newlines=True)
    while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print(output.strip())
        
        # 捕获剩余的错误输出
    err = process.stderr.read()
    if err:
        print("Standard Error:\n", err.strip())

这里编写需要使用的预测yaml文件，编写参考web预览命令：

编写的用于预测的yaml如下：

### model

model_name_or_path: LLM-Research/Meta-Llama-3-8B #魔塔社区的llama路径

adapter_name_or_path: saves/test

### method

stage: sft

do_predict: true

finetuning_type: lora

### dataset

eval_dataset: validationself

dataset_dir: data

template: llama3

cutoff_len: 1024

max_samples: 50 #这个根据自己验证集数据量大小

overwrite_cache: true

preprocessing_num_workers: 16

### output

output_dir: saves/testwandb/lora/predict

overwrite_output_dir: true

### eval

per_device_eval_batch_size: 1

predict_with_generate: true

ddp_timeout: 180000000

注意：llama-factory提供的参考脚本的功能有两个一个是批量预测一个是计算 BLEU 和 ROUGE 分数。我只需要批量预测，小白我只找到了这个办法，即借用这个命令二来批量预测结果。

预测结果被保存在了配置的预测yaml文件中的该参数output_dir: saves/testwandb/lora/predict 目录下的generated_predictions.jsonl文件中。

参考：LLaMA-Factory/examples/README_zh.md at main · hiyouga/LLaMA-Factory (github.com)

第四步：获取预测值，调用api取得分数

下面要从generated_predictions.jsonl，获取批量预测的结果，并调用api来获得预测结果的分数。

def process_jsonl_file(input_file, output_dir):
    """
    处理JSONL文件，将每个JSON对象的'predict'字段保存为单独的文本文件。
    """
    index = 1  # 初始化文件索引
    with open(input_file, 'r', encoding='utf-8') as file:  # 打开输入的JSONL文件
        for line in file:  # 遍历文件中的每一行
            item = json.loads(line)  # 将JSONL行转换为字典
            predict_text = item.get('predict', '')  # 获取'predict'字段的值
            filename = f"{output_dir}/{index}.txt"  # 构造输出文件名
            index += 1  # 文件索引递增
            with open(filename, 'w', encoding='utf-8') as out_file:  # 打开输出文件
                out_file.write(predict_text)  # 写入'predict'字段的值
    print("所有文本文件已保存。")  # 打印完成信息

def send_requests_and_save_results(text_files_dir, json_files_dir, api_url, headers):
    """
    发送文本文件到API，并将响应结果保存为JSON文件。
    """
    for i in range(1, 21):  # 遍历指定范围内的文件
        file_name = f"{text_files_dir}/{i}.txt"  # 构造输入文件名
        json_file_name = f"{json_files_dir}/{i}.json"  # 构造输出文件名

        with open(file_name, 'r', encoding='utf-8') as file:  # 打开输入文件
            input_text = file.read()  # 读取文件内容

        payload = {"input_text": input_text}  # 构造请求体
        response = requests.post(api_url, json=payload, headers=headers)  # 发送POST请求
        result = response.json()  # 获取响应体中的JSON数据

        with open(json_file_name, 'w', encoding='utf-8') as file:  # 打开输出文件
            json.dump(result, file, ensure_ascii=False, indent=4)  # 将JSON数据写入文件

def calculate_average_and_sort_values(json_files_dir):
    """
    计算所有JSON文件中'is_human_written'字段的平均值。
    """
    is_human_written_values = []  # 初始化列表，用于存储'is_human_written'的值
    for i in range(1, 20):  # 遍历指定范围内的文件
        file_name = f"{json_files_dir}/{i}.json"  # 构造文件名
        try:
            with open(file_name, 'r', encoding='utf-8') as file:  # 打开文件
                data = json.load(file)  # 读取JSON数据
                is_human_written_values.append(data['data']['is_human_written'])  # 添加'is_human_written'的值到列表
        except FileNotFoundError:  # 文件不存在异常处理
            print(f"文件{file_name}不存在。")
        except json.JSONDecodeError:  # JSON解析异常处理
            print(f"文件{file_name}不是有效的JSON格式。")
        except KeyError:  # 缺少'is_human_written'字段异常处理
            print(f"文件{file_name}中缺少'is_human_written'字段。")

    if is_human_written_values:  # 如果列表不为空
        average_value = sum(is_human_written_values) / len(is_human_written_values)  # 计算平均值
        return average_value
    else:  # 如果列表为空
        print("没有足够的数据来计算平均值。")  # 打印提示信息

第五步：永远保存最好的分数，及其对应的参数

改写第一步的循环：初始存放最好的分数的数组，记录对应参数。

    # 初始化一个列表来保存（maxnum, num_train_epochs, lr）元组  
    top_5_results = []  


    # 读取原始的YAML文件  
    with open('/mnt/workspace/LLaMA-Factory/autoFinetuning/trainyear.yaml', 'r') as file:  
        config = yaml.safe_load(file)  
    
    # 设置参数范围  
    num_train_epochs_range = list(range(75, 73))  # 从50到100，代表5.0到10.0，步长为0.1  
    # learning_rate_range = [0.0001 * (10 ** x) for x in range(4)] + [0.001 * (1 + y) for y in range(90)]  # 从0.0001到0.1，步长不一但覆盖整个范围  
    learning_rate_range = [5e-5]#5e-5, 3e-5
        

    for epochs in num_train_epochs_range:    
        num_train_epochs = epochs / 10.0  # 转换为浮点数    
        for lr in learning_rate_range:    
            # 更新配置中的参数    
            config['learning_rate'] = lr    
            config['num_train_epochs'] = num_train_epochs    
            
            # 写入YAML文件，替换之前的参数    
            with open('/mnt/workspace/LLaMA-Factory/autoFinetuning/trainyear.yaml', 'w') as file:    
                yaml.safe_dump(config, file)    
    
            print(f'Updated trainyear.yaml with epochs={num_train_epochs} and lr={lr}')  
            # Step 1: 根据修改好的yaml训练模型 
            run_command_train()  
            print("Step1完成.............................................................Step1")
            # Step 2: 用训练好的模型批量预测数据 
            run_command_predict()  
            print("Step2完成.............................................................Step2")
           
            # Step 3: 处理预测数据JSONL文件并保存文本文件  
            process_jsonl_file(input_file, output_dir)  
            print("Step3完成.............................................................Step3")
   
            # Step 4: 发送请求到API并保存结果为JSON文件  
            print("Step4完成.............................................................Step4")
    
            # Step 5: 计算并排序JSON文件中的'is_human_written'值  
            maxnum = calculate_average_and_sort_values(json_files_dir)  
            print("Step5完成.............................................................Step5")
    
            if len(top_5_results) < 5:
                top_5_results.append((maxnum, num_train_epochs, lr))
            else:
                top_5_results.sort(key=lambda x: x[0])  # 按maxnum升序排序
                if maxnum > top_5_results[0][0]:
                    top_5_results.pop(0)
                    top_5_results.append((maxnum, num_train_epochs, lr))
                top_5_results.sort(reverse=True, key=lambda x: x[0])  # 按maxnum降序排序

    
            
    
    # 将top_5_results写入文件  
    with open('top_5_results.txt', 'w') as f:  
        for result in top_5_results:  
            f.write(f'maxnum: {result[0]}, num_train_epochs: {result[1]}, lr: {result[2]}\n')

总结:

待完成

总结

**文章总结**
这篇文章主要描述了如何在`Llama-factory`平台上使用自动参数调整的方法来微调`llama3-8B`模型，以达到一个适合项目的参数配置。作者面临的问题是手动调整基本的学习率和学习轮次太过繁琐，并且所需质量不仅仅是模型损失小，还需要预测结果达到一定条件。为此，作者希望通过编程自动化实现一个过程来迭代寻找最优的学习率和训练轮次。
**过程概述**：
1. **准备阶段**:
- 作者首先明确了需求是微调`Llama3-8B`模型。
- 识别了可用的命令和配置文件格式，以实现对模型的微调（指令监督）和批量预测任务。
2. **开发阶段**:
- 作者修改了训练`YAML`文件，特别是`learning_rate`和`num_train_epochs`两个参数，为接下来的自动化调整做准备。
- 编写了一个循环程序，用以生成不同的参数组合，并逐个应用到训练配置中。
- 编写了脚本来实现训练过程、预测任务、以及预测后获取结果并评分的自动化处理。
- 利用了`Llama-factory`的自然接口来执行训练（`llamafactory-cli train`）和预测（`另一只手实现批量预测`的等效命令）命令。
- 为了评估预测结果，作者使用了一个外部的API服务来获取BLEU和ROUGE分数，但重点是评估了结果是否“像人类写的”（通过某个特定字段`is_human_written`）。
- 保存并更新了最高得分的模型参数和分数记录。
3. **遇到的问题**:
- 对于如何更好地集成Weights & Biases来进行记录和调参缺乏足够的经验。
- 散装代码的进一步优化和整理的必要性。
4. **未来工作**:
- 完成并优化当前的自动化流程。
- 思考如何更有效地整合自动调参工具和服务。
- 可能涉及进一步研究模型其他参数的影响。
**总结**:
这是一篇关于自动化微调`Llama3-8B`模型学习参数并记录最佳结果组合的文档。文中描述了作者设定目标、计划实施、编写代码并进行自动化流程测试的全过程。尽管在初始实现中遇到了些许困难，但整体展示了自动化在模型参数调优中的重要作用和价值。