Llama3-Tutorial（Llama 3 超级课堂）作业

1.Llama 3 Web Demo 部署

cd ~

git clone https://github.com/SmartFlowAI/Llama3-Tutorial

安装 XTuner 时会自动安装其他依赖

cd ~

git clone -b v0.1.18 https://github.com/InternLM/XTuner

cd XTuner

pip install -e .

运行 web_demo.py

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \

  ~/model/Meta-Llama-3-8B-Instruct

此时点击URL并不能在本地浏览器直接访问，需用到1.4内容，对8501这个端口进行配置，然后在本地浏览器上直接访问http://localhost:8501

2.XTuner 小助手认知微调

2.1自我认知训练数据集准备

cd ~/Llama3-Tutorial
python tools/gdata.py

2.2模型训练

cd ~/Llama3-Tutorial

# 开始训练,使用 deepspeed 加速，A100 40G显存耗时24分钟

xtuner train configs/assistant/llama3_8b_instruct_qlora_assistant.py --work-dir /root/llama3_pth

# Adapter PTH 转 HF 格式

xtuner convert pth_to_hf /root/llama3_pth/llama3_8b_instruct_qlora_assistant.py \

/root/llama3_pth/iter_500.pth \

/root/llama3_hf_adapter

# 模型合并

export MKL_SERVICE_FORCE_INTEL=1

xtuner convert merge /root/model/Meta-Llama-3-8B-Instruct \

/root/llama3_hf_adapter\

/root/llama3_hf_merged

2.3推理验证

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \

/root/llama3_hf_merged

这里同样需要对端口进行映射

3LMDeploy 部署 Llama 3 模型

3.1环境配置

安装lmdeploy最新版。

pip install -U lmdeploy[all]

3.2LMDeploy Chat CLI 工具

直接在终端运行

conda activate lmdeploy
lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct

3.3LMDeploy模型量化(lite)

下面，改变--cache-max-entry-count参数，设为0.5。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.5

面来一波“极限”，把--cache-max-entry-count参数设置为0.01，约等于禁止KV Cache占用显存。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct/ --cache-max-entry-count 0.01

3.4W4A16量化

执行一条命令完成模型量化工作。

lmdeploy lite auto_awq \
   /root/model/Meta-Llama-3-8B-Instruct \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/model/Meta-Llama-3-8B-Instruct_4bit

运行时间较长，请耐心等待。量化工作结束后，新的HF模型被保存到Meta-Llama-3-8B-Instruct_4bit目录。

下面使用Chat功能运行W4A16量化后的模型。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct_4bit --model-format awq

为了更加明显体会到W4A16的作用，我们将KV Cache比例再次调为0.01，查看显存占用情况。

lmdeploy chat /root/model/Meta-Llama-3-8B-Instruct_4bit --model-format awq --cache-max-entry-count 0.01

可以看到，显存占用变为6738MB，明显降低。

3.4在线量化

通过以下命令启动API服务器，推理Meta-Llama-3-8B-Instruct模型：

lmdeploy serve api_server \
    /root/model/Meta-Llama-3-8B-Instruct \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

其中，model-format、quant-policy这些参数是与第三章中量化推理模型一致的；server-name和server-port表示API服务器的服务IP与服务端口；tp参数表示并行数量（GPU数量）。通过运行以上指令，我们成功启动了API服务器，请勿关闭该窗口，后面我们要新建客户端连接该服务。你也可以直接打开http://{host}:23333查看接口的具体使用说明，如下图所示。

这一步由于Server在远程服务器上，所以本地需要做一下ssh转发才能直接访问。在你本地打开一个cmd窗口，输入命令如下：

3.5命令行客户端连接API服务器

在“4.1”中，我们在终端里新开了一个API服务器。本节中，我们要新建一个命令行客户端去连接API服务器。首先通过VS Code新建一个终端：激活conda环境

conda activate llm

运行命令行客户端：

lmdeploy serve api_client http://localhost:23333

运行后，可以通过命令行窗口直接与模型对话

3.6网页客户端连接API服务器

关闭刚刚的VSCode终端，但服务器端的终端不要关闭。运行之前确保自己的gradio版本低于4.0.0。

pip install gradio==3.50.2

新建一个VSCode终端，激活conda环境。

conda activate lmdeploy

使用Gradio作为前端，启动网页客户端。

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

打开浏览器，访问地址http://127.0.0.1:6006 然后就可以与模型进行对话了！

3.7推理速度

使用 LMDeploy 在 A100（80G）推理 Llama3，每秒请求处理数（RPS）高达 25，是 vLLM 推理效率的 1.8+ 倍。

克隆仓库

cd ~
git clone https://github.com/InternLM/lmdeploy.git

下载测试数据

cd /root/lmdeploy
wget https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

执行 benchmark 命令(如果你的显存较小，可以调低--cache-max-entry-count

python benchmark/profile_throughput.py \
    ShareGPT_V3_unfiltered_cleaned_split.json \
    /root/model/Meta-Llama-3-8B-Instruct \
    --cache-max-entry-count 0.8 \
    --concurrency 256 \
    --model-format hf \
    --quant-policy 0 \
    --num-prompts 10000

结果是：

3.8使用LMDeploy运行视觉多模态大模型Llava-Llama-3

安装依赖

pip install git+https://github.com/haotian-liu/LLaVA.git

运行模型

运行touch /root/pipeline_llava.py 新建一个文件夹，复制下列代码进去

from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
pipe = pipeline('xtuner/llava-llama-3-8b-v1_1-hf',
                chat_template_config=ChatTemplateConfig(model_name='llama3'))

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response.text)

4.Llama 3 图片理解能力微调（XTuner+LLaVA 版）

安装 XTuner

cd ~
git clone -b v0.1.18 https://github.com/InternLM/XTuner
cd XTuner
pip install -e .[all]

准备Visual Encoder权重

准备 Llava 所需要的 openai/clip-vit-large-patch14-336，权重，即 Visual Encoder 权重。

mkdir -p ~/model

cd ~/model

ln -s /root/share/new_models/openai/clip-vit-large-patch14-336 .

也可以访问 https://huggingface.co/openai/clip-vit-large-patch14-336 以进行下载。

准备Image Projector权重

mkdir -p ~/model

cd ~/model

ln -s /root/share/new_models/xtuner/llama3-llava-iter_2181.pth .

相关权重可以访问：https://huggingface.co/xtuner/llava-llama-3-8b 以及 https://huggingface.co/xtuner/llava-llama-3-8b-v1_1 。（已经过微调，并非 Pretrain 阶段的 Image Projector）

数据准备

Tutorial/xtuner/llava/xtuner_llava.md at camp2 · InternLM/Tutorial · GitHub 中的教程来准备微调数据。选择使用过拟合的方式快速实现。可以执行以下代码：

cd ~

git clone https://github.com/InternLM/tutorial -b camp2

python ~/tutorial/xtuner/llava/llava_data/repeat.py \

-i ~/tutorial/xtuner/llava/llava_data/unique_data.json \

-o ~/tutorial/xtuner/llava/llava_data/repeated_data.json \

-n 200

启动训练

准备好了可以一键启动的配置文件，主要是修改好了模型路径、对话模板以及数据路径。使用如下指令以启动训练：

xtuner train ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py --work-dir ~/llama3_llava_pth --deepspeed deepspeed_zero2

训练过程所需显存约为44447 MiB，在单卡 A100 上训练所需时间为30分钟。

在训练好之后，我们将原始 image projector 和我们微调得到的 image projector 都转换为 HuggingFace 格式，为了下面的效果体验做准备。

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \

~/model/llama3-llava-iter_2181.pth \

~/llama3_llava_pth/pretrain_iter_2181_hf

xtuner convert pth_to_hf ~/Llama3-Tutorial/configs/llama3-llava/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_lora_e1_finetune.py \

~/llama3_llava_pth/iter_1200.pth \

~/llama3_llava_pth/iter_1200_hf

在转换完成后，我们就可以在命令行简单体验一下微调后模型的效果了。

问题1：Describe this image. 问题2：What is the equipment in the image?

Pretrain 模型

export MKL_SERVICE_FORCE_INTEL=1

xtuner chat /root/model/Meta-Llama-3-8B-Instruct \

--visual-encoder /root/model/clip-vit-large-patch14-336 \

--llava /root/llama3_llava_pth/pretrain_iter_2181_hf \

--prompt-template llama3_chat \

--image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

此时可以看到，Pretrain 模型只会为图片打标签，并不能回答问题。

Finetune 后模型

export MKL_SERVICE_FORCE_INTEL=1
xtuner chat /root/model/Meta-Llama-3-8B-Instruct \
  --visual-encoder /root/model/clip-vit-large-patch14-336 \
  --llava /root/llama3_llava_pth/iter_1200_hf \
  --prompt-template llama3_chat \
  --image /root/tutorial/xtuner/llava/llava_data/test_img/oph.jpg

经过 Finetune 后，我们可以发现，模型已经可以根据图片回答我们的问题了。

5.优秀学员任务

不太清楚最后一条“Llama3 工具调用能力训练”的具体任务是什么，但是六节课的所有内容全部实现完毕，下面放剩下几节课的结果图。

OpenCompass

Llama 3 Agent 能力体验与微调