快速体验LLaMA-Factory 私有化部署和高效微调Llama3模型（曙光超算互联网平台异构加速卡DCU）

序言

本文以 LLaMA-Factory 为例，在超算互联网平台SCNet上使用异构加速卡AI 显存64GB PCIE，私有化部署Llama3模型，并对 Llama3-8B-Instruct 模型进行 LoRA 微调、推理和合并。

快速体验基础版本，请参考另一篇博客：快速体验LLaMA3模型微调（曙光超算互联网平台国产异构加速卡DCU）

一、参考资料

github仓库代码：LLaMA-Factory，使用最新的代码分支：v0.8.3

超算互联网平台

异构加速卡AI 显存64GB PCIE

Llama3本地部署与高效微调入门

快速体验LLaMA3模型微调（曙光超算互联网平台国产异构加速卡DCU）

二、准备环境

1. 系统镜像

异构加速卡AI为国产加速卡，基于DTK软件栈（对标NVIDIA的CUDA），请选择 dtk24.04 版本的镜像环境。

以jupyterlab-pytorch:2.1.0-ubuntu20.04-dtk24.04-py310 镜像为例。

2. 软硬件依赖

特别注意：要求最低版本 transformers 4.41.2，vllm 0.4.3 。

必需项至少推荐 python 3.8 3.11 torch 1.13.1 2.3.0 transformers 4.41.2 4.41.2 datasets 2.16.0 2.19.2 accelerate 0.30.1 0.30.1 peft 0.11.1 0.11.1 trl 0.8.6 0.9.4 可选项至少推荐 CUDA 11.6 12.2 deepspeed 0.10.0 0.14.0 bitsandbytes 0.39.0 0.43.1 vllm 0.4.3 0.4.3 flash-attn 2.3.0 2.5.9

3. 克隆base环境

root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# conda create -n llama_factory_torch --clone base
Retrieving notices: ...working... done
Source:      /opt/conda
Destination: /opt/conda/envs/llama_factory_torch
The following packages cannot be cloned out of the root environment:
 - https://repo.anaconda.com/pkgs/main/linux-64::conda-23.7.4-py310h06a4308_0
Packages: 44
Files: 53489

Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate llama_factory_torch
#
# To deactivate an active environment, use
#
#     $ conda deactivate

4. 安装 LLaMA Factory

root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# source activate llama_factory_torch
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# pip install -e ".[torch,metrics]"
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
  ...
  Checking if build backend supports build_editable ... done
Building wheels for collected packages: llamafactory, jieba
  Building editable for llamafactory (pyproject.toml) ... done
  Created wheel for llamafactory: filename=llamafactory-0.8.4.dev0-0.editable-py3-none-any.whl size=20781 sha256=85d430f487d58b0358c4332a21e07f316114d6c9997c8b6e1d88d3831a826b54
  Stored in directory: /tmp/pip-ephem-wheel-cache-gmhnwt2w/wheels/e9/b4/89/f13e921e37904ee0c839434aad2d7b2951c2c68e596667c7ef
  Building wheel for jieba (setup.py) ... done
  Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314459 sha256=49056fbfc2f07fc007494b9193edaa4c8f47119f1790b6ef8b769b7c994b7bf4
  Stored in directory: /root/.cache/pip/wheels/b2/9b/80/7537177f75993c29af08e0d00c753724c7f06c646352be50a3
Successfully built llamafactory jieba
DEPRECATION: lmdeploy 0.1.0-git782048c.abi0.dtk2404.torch2.1. has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of lmdeploy or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
DEPRECATION: mmcv 2.0.1-gitc0ccf15.abi0.dtk2404.torch2.1. has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of mmcv or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: pydub, jieba, urllib3, tomlkit, shtab, semantic-version, scipy, ruff, rouge-chinese, joblib, importlib-resources, ffmpy, docstring-parser, aiofiles, nltk, tyro, sse-starlette, tokenizers, gradio-client, transformers, trl, peft, gradio, llamafactory
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.13
    Uninstalling urllib3-1.26.13:
      Successfully uninstalled urllib3-1.26.13
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.15.0
    Uninstalling tokenizers-0.15.0:
      Successfully uninstalled tokenizers-0.15.0
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.0
    Uninstalling transformers-4.38.0:
      Successfully uninstalled transformers-4.38.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lmdeploy 0.1.0-git782048c.abi0.dtk2404.torch2.1. requires transformers==4.33.2, but you have transformers 4.43.3 which is incompatible.
Successfully installed aiofiles-23.2.1 docstring-parser-0.16 ffmpy-0.4.0 gradio-4.40.0 gradio-client-1.2.0 importlib-resources-6.4.0 jieba-0.42.1 joblib-1.4.2 llamafactory-0.8.4.dev0 nltk-3.8.1 peft-0.12.0 pydub-0.25.1 rouge-chinese-1.0.3 ruff-0.5.5 scipy-1.14.0 semantic-version-2.10.0 shtab-1.7.1 sse-starlette-2.1.3 tokenizers-0.19.1 tomlkit-0.12.0 transformers-4.43.3 trl-0.9.6 tyro-0.8.5 urllib3-2.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip

5. 解决依赖包冲突

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# pip install --no-deps -e .
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: llamafactory
  Building editable for llamafactory (pyproject.toml) ... done
  Created wheel for llamafactory: filename=llamafactory-0.8.4.dev0-0.editable-py3-none-any.whl size=20781 sha256=3a2c017ba41af3a1cd3f2482bbc71cae820fe1d2708a0b642899c64a8a3b3461
  Stored in directory: /tmp/pip-ephem-wheel-cache-_abegaoi/wheels/e9/b4/89/f13e921e37904ee0c839434aad2d7b2951c2c68e596667c7ef
Successfully built llamafactory
Installing collected packages: llamafactory
  Attempting uninstall: llamafactory
    Found existing installation: llamafactory 0.8.4.dev0
    Uninstalling llamafactory-0.8.4.dev0:
      Successfully uninstalled llamafactory-0.8.4.dev0
Successfully installed llamafactory-0.8.4.dev0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip

6. 安装 `vllm 0.4.3`

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# ^C
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# pip install --no-dependencies vllm==0.4.3
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting vllm==0.4.3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1a/1e/10bcb6566f4fa8b95ff85bddfd1675ff7db33ba861f59bd70aa3b92a46b7/vllm-0.4.3-cp310-cp310-manylinux1_x86_64.whl (131.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 131.1/131.1 MB 28.9 MB/s eta 0:00:00
Installing collected packages: vllm
  Attempting uninstall: vllm
    Found existing installation: vllm 0.3.3+git3380931.abi0.dtk2404.torch2.1
    Uninstalling vllm-0.3.3+git3380931.abi0.dtk2404.torch2.1:
      Successfully uninstalled vllm-0.3.3+git3380931.abi0.dtk2404.torch2.1
Successfully installed vllm-0.4.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip

7. `requirements.txt`

accelerate==0.32.1
addict==2.4.0
aiofiles==23.2.1
aiohttp==3.9.5
aiosignal==1.3.1
aliyun-python-sdk-core==2.15.1
aliyun-python-sdk-kms==2.16.3
annotated-types==0.7.0
anyio==4.4.0
apex @ https://cancon.hpccube.com:65024/directlink/4/apex/DAS1.0/apex-1.1.0+das1.0+0dd7f68.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=fdeb7c8a0b354a6a2faa61ae2055b2c2e7deb07bfa4aa7811068c5e02455ee1e
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bitsandbytes @ https://cancon.hpccube.com:65024/directlink/4/bitsandbytes/DAS1.0/bitsandbytes-0.37.0+das1.0+gitd3d888f.abi0.dtk2404.torch2.1-py3-none-any.whl#sha256=c46eb3f1555f2153424c3c0297e6645c0881cb76965cf5f3d11f77b52d80c19c
bleach==6.1.0
boltons @ file:///croot/boltons_1677628692245/work
brotlipy==0.7.0
certifi @ file:///croot/certifi_1707229174982/work/certifi
cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.1.7
coloredlogs==15.0.1
comm==0.2.2
conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work
conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work
contourpy==1.2.1
crcmod==1.7
cryptography @ file:///croot/cryptography_1665612644927/work
cycler==0.12.1
datasets==2.19.2
debugpy==1.8.1
decorator==5.1.1
deepspeed @ https://cancon.hpccube.com:65024/directlink/4/deepspeed/DAS1.0/deepspeed-0.12.3+das1.0+gita724046.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=726d64f73ab2ed7bcd716dcb2af53bb3c790ab4a24180b1b9319e7a7ab2cc569
defusedxml==0.7.1
diffusers==0.29.2
dill==0.3.8
dnspython==2.6.1
docstring_parser==0.16
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.19.1
ffmpy==0.4.0
filelock==3.14.0
fire==0.6.0
flash-attn @ https://cancon.hpccube.com:65024/directlink/4/flash_attn/DAS1.0/flash_attn-2.0.4+das1.0+82379d7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=2facc1831d95b55bf1bca88c7f23163751f4c749e4f7fc9256d8311ddbb5d399
flatbuffers==24.3.25
fonttools==4.52.4
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.3.1
gradio==4.40.0
gradio_client==1.2.0
h11==0.14.0
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface==0.0.1
huggingface-hub==0.23.2
humanfriendly==10.0
hypothesis==5.35.1
idna @ file:///croot/idna_1666125576474/work
importlib_metadata==7.1.0
importlib_resources==6.4.0
invisible-watermark==0.2.0
ipykernel==6.29.4
ipython==8.24.0
ipywidgets==8.1.3
isoduration==20.11.0
jedi==0.19.1
jieba==0.42.1
Jinja2==3.1.4
jmespath==0.10.0
joblib==1.4.2
json5==0.9.25
jsonpatch @ file:///croot/jsonpatch_1714483231291/work
jsonpointer==2.1
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_ext_dataset==0.1.0
jupyter_ext_logo==0.1.0
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-language-pack-zh-CN==4.0.post6
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.11
kiwisolver==1.4.5
lightop @ https://cancon.hpccube.com:65024/directlink/4/lightop/DAS1.0/lightop-0.3+das1.0+837dbb7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=7f4eb1190a570c05a63a4aade326c87367c4e5ccf6ff82ad5e92220790817e5c
-e git+https://github.com/hiyouga/LLaMA-Factory.git@668654b5adae3f897d5291b81410226e1304eff9#egg=llamafactory
lmdeploy @ https://cancon.hpccube.com:65024/directlink/4/lmdeploy/DAS1.0/lmdeploy-0.1.0_das1.0+git782048c.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=499940e022de16b3f1211a52c2daa3a603b109a015487499c9e11a53c6d5ad2c
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mmcv @ https://cancon.hpccube.com:65024/directlink/4/mmcv/DAS1.0/mmcv-2.0.1_das1.0+gitc0ccf15.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=4fc5ff39d232e5ca1efebf7cfdfcf9bc0675308cf40e5f17237c4f2eec66f210
mmengine==0.10.4
mmengine-lite==0.10.4
modelscope==1.17.0
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
nltk==3.8.1
notebook_shim==0.2.4
numpy==1.24.3
onnxruntime @ https://cancon.hpccube.com:65024/directlink/4/onnxruntime/DAS1.0/onnxruntime-1.15.0+das1.0+gita9ca438.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=509446b41adb89e7507700482cb99e2c399ab3164bc9ea6d9a50e11f84a2406e
opencv-python==4.9.0.80
orjson==3.10.3
oss2==2.18.6
overrides==7.7.0
packaging @ file:///croot/packaging_1710807400464/work
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
peft==0.12.0
pexpect==4.9.0
pillow==10.3.0
platformdirs==4.2.2
pluggy @ file:///tmp/build/80754af9/pluggy_1648024709248/work
prometheus_client==0.20.0
prompt_toolkit==3.0.45
protobuf==5.27.0
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycosat @ file:///croot/pycosat_1666805502580/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pycryptodome==3.20.0
pydantic==2.7.2
pydantic_core==2.18.3
pydub==0.25.1
Pygments==2.18.0
pynvml==11.5.0
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing==3.1.2
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
ray==2.9.3
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rouge-chinese==1.0.3
rpds-py==0.18.1
ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work
ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work
ruff==0.5.5
safetensors==0.4.3
scipy==1.14.0
semantic-version==2.10.0
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
shtab==1.7.1
six @ file:///tmp/build/80754af9/six_1644875935023/work
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
sse-starlette==2.1.3
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
termcolor==2.4.0
terminado==0.18.1
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
tomlkit==0.12.0
toolz @ file:///croot/toolz_1667464077321/work
torch @ https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.0/torch-2.1.0+das1.0+git00661e0.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0b5f4be74ffdd6fe7540a844bf4f02e432b7d267b5e9fdd7f9448192d93bf3b6
torchaudio @ https://cancon.hpccube.com:65024/directlink/4/torchaudio/DAS1.0/torchaudio-2.1.2+das1.0+253903e.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=2a7b3bbe8b558f48784f302900fd1dff3ff9d10a3c139e00f2b136a76d6d7f1c
torchvision @ https://cancon.hpccube.com:65024/directlink/4/vision/DAS1.0/torchvision-0.16.0+das1.0+gitc9e7141.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=4d5e5071e89892cccb24c3ee0216cd79b3c22bc5cf1eb0eb49c2792d9f49fb62
tornado==6.4
tqdm @ file:///opt/conda/conda-bld/tqdm_1664392687731/work
traitlets==5.14.3
transformers==4.43.3
triton @ https://cancon.hpccube.com:65024/directlink/4/triton/DAS1.0/triton-2.1.0+das1.0+git3841f975.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0dda810eb171af0b3f5cf90a1a4b2f41c9ef0ef08453762a798c86dd01fe976f
trl==0.9.6
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.0
tyro==0.8.5
tzdata==2024.1
ujson==5.10.0
uri-template==1.3.0
urllib3==2.2.2
uvicorn==0.30.0
uvloop==0.19.0
vllm==0.4.3
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers @ https://cancon.hpccube.com:65024/directlink/4/xformers/DAS1.0/xformers-0.0.25+das1.0+gitd11e899.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=b086d1bd50bd19c82ca44c424fe193dfcdd48bdd6695d3e6a58f53764c64f428
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.19.0

8. `envs.yaml`

name: llama_factory_torch
channels:
  - https://repo.anaconda.com/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - boltons=23.0.0=py310h06a4308_0
  - brotlipy=0.7.0=py310h7f8727e_1002
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2024.3.11=h06a4308_0
  - certifi=2024.2.2=py310h06a4308_0
  - cffi=1.15.1=py310h74dc2b5_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - conda-content-trust=0.1.3=py310h06a4308_0
  - conda-package-handling=1.9.0=py310h5eee18b_1
  - cryptography=38.0.1=py310h9ce1e76_0
  - idna=3.4=py310h06a4308_0
  - jsonpatch=1.33=py310h06a4308_1
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.3=he6710b0_2
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.3=h5eee18b_3
  - openssl=1.1.1w=h7f8727e_0
  - pluggy=1.0.0=py310h06a4308_1
  - pycosat=0.6.4=py310h5eee18b_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pyopenssl=22.0.0=pyhd3eb1b0_0
  - pysocks=1.7.1=py310h06a4308_0
  - python=3.10.8=haa1d7c7_0
  - readline=8.2=h5eee18b_0
  - ruamel.yaml=0.17.21=py310h5eee18b_0
  - ruamel.yaml.clib=0.2.6=py310h5eee18b_1
  - setuptools=65.5.0=py310h06a4308_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.40.0=h5082296_0
  - tk=8.6.12=h1ccaba5_0
  - toolz=0.12.0=py310h06a4308_0
  - tqdm=4.64.1=py310h06a4308_0
  - wheel=0.37.1=pyhd3eb1b0_0
  - xz=5.2.8=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
      - accelerate==0.32.1
      - addict==2.4.0
      - aiofiles==23.2.1
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - aliyun-python-sdk-core==2.15.1
      - aliyun-python-sdk-kms==2.16.3
      - annotated-types==0.7.0
      - anyio==4.4.0
      - apex==1.1.0+0dd7f68.abi0.dtk2404.torch2.1
      - argon2-cffi==23.1.0
      - argon2-cffi-bindings==21.2.0
      - arrow==1.3.0
      - asttokens==2.4.1
      - async-lru==2.0.4
      - async-timeout==4.0.3
      - attrs==23.2.0
      - babel==2.15.0
      - beautifulsoup4==4.12.3
      - bitsandbytes==0.37.0+gitd3d888f.abi0.dtk2404.torch2.1
      - bleach==6.1.0
      - click==8.1.7
      - coloredlogs==15.0.1
      - comm==0.2.2
      - contourpy==1.2.1
      - crcmod==1.7
      - cycler==0.12.1
      - datasets==2.19.2
      - debugpy==1.8.1
      - decorator==5.1.1
      - deepspeed==0.12.3+gita724046.abi0.dtk2404.torch2.1.0
      - defusedxml==0.7.1
      - diffusers==0.29.2
      - dill==0.3.8
      - dnspython==2.6.1
      - docstring-parser==0.16
      - einops==0.8.0
      - email-validator==2.1.1
      - exceptiongroup==1.2.1
      - executing==2.0.1
      - fastapi==0.111.0
      - fastapi-cli==0.0.4
      - fastjsonschema==2.19.1
      - ffmpy==0.4.0
      - filelock==3.14.0
      - fire==0.6.0
      - flash-attn==2.0.4+82379d7.abi0.dtk2404.torch2.1
      - flatbuffers==24.3.25
      - fonttools==4.52.4
      - fqdn==1.5.1
      - frozenlist==1.4.1
      - fsspec==2024.3.1
      - gradio==4.40.0
      - gradio-client==1.2.0
      - h11==0.14.0
      - hjson==3.1.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface==0.0.1
      - huggingface-hub==0.23.2
      - humanfriendly==10.0
      - hypothesis==5.35.1
      - importlib-metadata==7.1.0
      - importlib-resources==6.4.0
      - invisible-watermark==0.2.0
      - ipykernel==6.29.4
      - ipython==8.24.0
      - ipywidgets==8.1.3
      - isoduration==20.11.0
      - jedi==0.19.1
      - jieba==0.42.1
      - jinja2==3.1.4
      - jmespath==0.10.0
      - joblib==1.4.2
      - json5==0.9.25
      - jsonpointer==2.4
      - jsonschema==4.22.0
      - jsonschema-specifications==2023.12.1
      - jupyter-client==8.6.2
      - jupyter-core==5.7.2
      - jupyter-events==0.10.0
      - jupyter-ext-dataset==0.1.0
      - jupyter-ext-logo==0.1.0
      - jupyter-lsp==2.2.5
      - jupyter-server==2.14.0
      - jupyter-server-terminals==0.5.3
      - jupyterlab==4.2.1
      - jupyterlab-language-pack-zh-cn==4.0.post6
      - jupyterlab-pygments==0.3.0
      - jupyterlab-server==2.27.2
      - jupyterlab-widgets==3.0.11
      - kiwisolver==1.4.5
      - lightop==0.3+837dbb7.abi0.dtk2404.torch2.1
      - llamafactory==0.8.4.dev0
      - lmdeploy==0.1.0-git782048c.abi0.dtk2404.torch2.1.
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - matplotlib==3.9.0
      - matplotlib-inline==0.1.7
      - mdurl==0.1.2
      - mistune==3.0.2
      - mmcv==2.0.1-gitc0ccf15.abi0.dtk2404.torch2.1.
      - mmengine==0.10.4
      - mmengine-lite==0.10.4
      - modelscope==1.17.0
      - mpmath==1.3.0
      - msgpack==1.0.8
      - multidict==6.0.5
      - multiprocess==0.70.16
      - nbclient==0.10.0
      - nbconvert==7.16.4
      - nbformat==5.10.4
      - nest-asyncio==1.6.0
      - networkx==3.3
      - ninja==1.11.1.1
      - nltk==3.8.1
      - notebook-shim==0.2.4
      - numpy==1.24.3
      - onnxruntime==1.15.0+gita9ca438.abi0.dtk2404
      - opencv-python==4.9.0.80
      - orjson==3.10.3
      - oss2==2.18.6
      - overrides==7.7.0
      - packaging==24.0
      - pandas==2.2.2
      - pandocfilters==1.5.1
      - parso==0.8.4
      - peft==0.12.0
      - pexpect==4.9.0
      - pillow==10.3.0
      - pip==24.0
      - platformdirs==4.2.2
      - prometheus-client==0.20.0
      - prompt-toolkit==3.0.45
      - protobuf==5.27.0
      - psutil==5.9.8
      - ptyprocess==0.7.0
      - pure-eval==0.2.2
      - py-cpuinfo==9.0.0
      - pyarrow==16.1.0
      - pyarrow-hotfix==0.6
      - pycryptodome==3.20.0
      - pydantic==2.7.2
      - pydantic-core==2.18.3
      - pydub==0.25.1
      - pygments==2.18.0
      - pynvml==11.5.0
      - pyparsing==3.1.2
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.0.1
      - python-json-logger==2.0.7
      - python-multipart==0.0.9
      - pytz==2024.1
      - pywavelets==1.6.0
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - ray==2.9.3
      - referencing==0.35.1
      - regex==2024.5.15
      - requests==2.32.3
      - rfc3339-validator==0.1.4
      - rfc3986-validator==0.1.1
      - rich==13.7.1
      - rouge-chinese==1.0.3
      - rpds-py==0.18.1
      - ruff==0.5.5
      - safetensors==0.4.3
      - scipy==1.14.0
      - semantic-version==2.10.0
      - send2trash==1.8.3
      - sentencepiece==0.2.0
      - shellingham==1.5.4
      - shtab==1.7.1
      - sniffio==1.3.1
      - sortedcontainers==2.4.0
      - soupsieve==2.5
      - sse-starlette==2.1.3
      - stack-data==0.6.3
      - starlette==0.37.2
      - sympy==1.12.1
      - termcolor==2.4.0
      - terminado==0.18.1
      - tiktoken==0.7.0
      - tinycss2==1.3.0
      - tokenizers==0.19.1
      - tomli==2.0.1
      - tomlkit==0.12.0
      - torch==2.1.0+git00661e0.abi0.dtk2404
      - torchaudio==2.1.2+253903e.abi0.dtk2404.torch2.1.0
      - torchvision==0.16.0+gitc9e7141.abi0.dtk2404.torch2.1
      - tornado==6.4
      - traitlets==5.14.3
      - transformers==4.43.3
      - triton==2.1.0+git3841f975.abi0.dtk2404
      - trl==0.9.6
      - typer==0.12.3
      - types-python-dateutil==2.9.0.20240316
      - typing-extensions==4.12.0
      - tyro==0.8.5
      - tzdata==2024.1
      - ujson==5.10.0
      - uri-template==1.3.0
      - urllib3==2.2.2
      - uvicorn==0.30.0
      - uvloop==0.19.0
      - vllm==0.4.3
      - watchfiles==0.22.0
      - wcwidth==0.2.13
      - webcolors==1.13
      - webencodings==0.5.1
      - websocket-client==1.8.0
      - websockets==12.0
      - widgetsnbextension==4.0.11
      - xformers==0.0.25+gitd11e899.abi0.dtk2404.torch2.1
      - xxhash==3.4.1
      - yapf==0.40.2
      - yarl==1.9.4
      - zipp==3.19.0
prefix: /opt/conda/envs/llama_factory_torch

三、服务器信息

服务器的规格信息，请参考另一篇博客：曙光超算互联网平台SCNet之国产异构加速卡DCU

1. CPU

root@notebook-1819288410202615810-scnlbe5oi5-43560:~# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-254
Off-line CPU(s) list:            255
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       HygonGenuine
CPU family:                      24
Model:                           4
Model name:                      Hygon C86 7490 64-core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         1600.000
CPU max MHz:                     2700.0000
CPU min MHz:                     1600.0000
BogoMIPS:                        5400.11
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-254
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; Load fences, __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full retpoline
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
                                 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc ext
                                 d_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe
                                 popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dno
                                 wprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb hw_pstate sme retp
                                 oline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni
                                  xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cle
                                 an flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov su
                                 ccor smca

2. 显卡

单卡环境

多卡环境

3. 内存

root@notebook-1819288410202615810-scnlbe5oi5-43560:~# free -h
              total        used        free      shared  buff/cache   available
Mem:          1.0Ti        79Gi        20Gi        11Mi       907Gi       926Gi
Swap:            0B          0B          0B

4. 硬盘

root@notebook-1819288410202615810-scnlbe5oi5-43560:~# df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   11T  2.2T  7.8T  22% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    504G     0  504G   0% /sys/fs/cgroup
ks_p300s_public           53P   37P   16P  71% /etc/sugon_motd
/dev/md0                  11T  2.2T  7.8T  22% /etc/hosts
/dev/mapper/centos-root  3.5T   20G  3.5T   1% /etc/tmp
tmpfs                    330G   32K  330G   1% /dev/shm
tmpfs                    330G   12K  330G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    504G     0  504G   0% /proc/acpi
tmpfs                    504G     0  504G   0% /proc/scsi
tmpfs                    504G     0  504G   0% /sys/firmware

5. 系统信息

root@notebook-1819288410202615810-scnlbe5oi5-43560:~# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.6 LTS"
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

四、私有化部署Llama 3模型

官网文章：Meet Your New Assistant: Meta AI, Built With Llama 3 | Meta

GitHub仓库代码：https://github.com/meta-llama/llama3

文档地址：https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3

在线体验地址：https://llama.meta.com/llama3/

1. 环境准备

# 创建虚拟环境
conda create -n llama_factory_torch python=3.10

# 激活虚拟环境
conda activate llama_factory_torch
# 如果报错，执行以下指令
source activate llama_factory_torch

# 下载LLama_Factory源码
git clone https://github.com/hiyouga/LLaMA-Factory.git

# 升级pip
# 建议在执行项目的依赖安装之前升级 pip 的版本，如果使用的是旧版本的 pip，可能无法安装一些最新的包，或者可能无法正确解析依赖关系。
python -m pip install --upgrade pip
# 如果失败，执行以下指令
pip install --upgrade pip

# 安装requirements.txt
pip install -r requirements.txt --index-url https://mirrors.huaweicloud.com/repository/pypi/simple

2. 下载Llama3模型

mkdir models
cd models

# 以 Llama3-8b 为例
git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3-8B-Instruct.git

HuggingFace主页：meta-llama/Meta-Llama-3-8B-Instruct Github主页：llama3 ModelScope主页：LLM-Research/Meta-Llama-3-8B-Instruct

3. 启动web服务

python src/webui.py \
    --model_name_or_path "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct" \
    --template llama3 \
    --infer_backend vllm \
    --vllm_enforce_eager

# 后台启动
nohup python src/webui.py \
    --model_name_or_path "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/" \
    --template llama3 \
    --infer_backend vllm \
    --vllm_enforce_eager \
    > out.log 2>&1 &

服务启动成功，输出结果如下：

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-50216:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# python src/webui.py     --model_name_or_path "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/"     --template llama3     --infer_backend vllm     --vllm_enforce_eager
gradio_share: False
Running on local URL:  http://0.0.0.0:7860
Running on public URL: https://36dfc90e71a7e8c548.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)

4. 访问服务

访问 http://127.0.0.1:8000。

通过上述步骤就已经完成了LLaMA-Factory模型的完整私有化部署过程。

五、微调Llama 3模型

github仓库代码：LLaMA-Factory

1. 引言

LLaMA Factory是一个在GitHub上开源的项目，该项目给自身的定位是：提供一个易于使用的大语言模型（LLM）微调框架，支持LLaMA、Baichuan、Qwen、ChatGLM等架构的大模型。更细致的看，该项目提供了从预训练、指令微调到RLHF阶段的开源微调解决方案。截止目前（2024年7月30日）支持约20+种不同的模型和内置了60+的数据集，同时封装出了非常高效和易用的开发者使用方法。而其中最让人喜欢的是其开发的LLaMA Board，这是一个零代码、可视化的一站式网页微调界面，它允许我们通过Web UI轻松设置各种微调过程中的超参数，且整个训练过程的实时进度都会在Web UI中进行同步更新。

简单理解，通过该项目我们只需下载相应的模型，并根据项目要求准备符合标准的微调数据集，即可快速开始微调过程，而这样的操作可以有效地将特定领域的知识注入到通用模型中，增强模型对特定知识领域的理解和认知能力，以达到“通用模型到垂直模型的快速转变”。

前一章节私有化部署的Llama 3模型还不具备中文处理能力，本章将用中文数据集对模型进行微调。

2. 整体流程

基于LLaMA-Factory的Llama3中文能力微调过程，如下图所示：

3. 下载数据集

LLaMA-Factory/data/dataset_info.json 文件是数据集的注册文件，包含了所有可用的数据集。查看中文数据集的数据字典：alpaca_zh。

下载数据集：alpaca_zh

git clone https://www.modelscope.cn/datasets/llamafactory/alpaca_zh.git

# 拷贝到data目录
cp alpaca_data_zh_51k.json ./data

修改 data/dataset_info.json 文件：

"alpaca_zh": {
    "hf_hub_url": "llamafactory/alpaca_zh",
    "ms_hub_url": "llamafactory/alpaca_zh"
}
改为
"alpaca_zh": {
    "file_name": "alpaca_data_zh_51k.json"
},

4. 创建微调脚本

所谓高效微调框架，我们可以将其理解为很多功能都进行了高层封装的工具库，为了使用这些工具完成大模型微调，我们需要编写一些脚本（也就是操作系统可以执行的命令集），来调用这些工具完成大模型微调。

切换到 ./LLaMA-Factory 目录，创建一个名为 single_lora_llama3.sh 脚本文件（该文件任意取名）。

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
 
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"
 
 
# 如果是预训练，添加参数       --stage pt \
# 如果是指令监督微调，添加参数  --stage sft \
# 如果是奖励模型训练，添加参数  --stage rm \
# 添加 --quantization_bit 4 就是4bit量化的QLoRA微调，不添加此参数就是LoRA微调 \
 
 
 
python src/train.py \   ## 单卡运行
  --stage sft \                                     ## --stage pt （预训练模式）  --stage sft（指令监督模式）
  --do_train True \                                 ## 执行训练模型
  --model_name_or_path models/Meta-Llama-3-8B-Instruct \     ## 模型的存储路径
  --dataset alpaca_zh \                                ## 训练数据的存储路径，存放在 LLaMA-Factory/data路径下
  --template llama3 \                                 ## 选择llama3模版
  --lora_target all \                     ## 默认模块应作为
  --output_dir saves/llama3-8b/lora/sft \        ## 微调后的模型保存路径
  --overwrite_cache \                               ## 是否忽略并覆盖已存在的缓存数据
  --per_device_train_batch_size 2 \                 ## 用于训练的批处理大小。可根据 GPU 显存大小自行设置。
  --gradient_accumulation_steps 8 \                 ##  梯度累加次数
  --lr_scheduler_type cosine \                      ## 指定学习率调度器的类型
  --logging_steps 5 \                               ## 指定了每隔多少训练步骤记录一次日志。这包括损失、学习率以及其他重要的训练指标，有助于监控训练过程。
  --save_steps 100 \                                ## 每隔多少训练步骤保存一次模型。这是模型保存和检查点创建的频率，允许你在训练过程中定期保存模型的状态
  --learning_rate 5.0e-5 \                            ## 学习率
  --num_train_epochs 1.0 \                          ## 指定了训练过程将遍历整个数据集的次数。一个epoch表示模型已经看过一次所有的训练数据。
  --finetuning_type lora \                          ## 参数指定了微调的类型，lora代表使用LoRA（Low-Rank Adaptation）技术进行微调。
  --fp16 \                                          ## 开启半精度浮点数训练
  --lora_rank 4 \                                   ## 在使用LoRA微调时设置LoRA适应层的秩。

注意：实际脚本文件不要出现中文备注，否则运行失败。

此外，为了保险起见，我们需要对齐格式内容进行调整，以满足Ubuntu操作系统运行需要（此前是从Windows系统上复制过去的文件，一般都需要进行如此操作）：

sed -i 's/\r$//' ./single_lora_llama3.sh

5. 运行微调脚本

运行微调脚本，获取模型微调权重。

5.1 单卡运行

# 修改权限
chmod +x ./single_lora_llama3.sh

# 执行脚本
./single_lora_llama3.sh

单卡显存不足，导致模型加载失败，请参考下文FAQ。

5.2 多卡运行（单机多卡）

使用 llamafactory-cli 启动 DeepSpeed 引擎进行单机多卡训练。

# 拷贝一份
cp examples/train_lora/llama3_lora_sft.yaml examples/train_lora/llama3_lora_sft.yaml.bak

根据 single_lora_llama3.sh 脚本内容，修改 LLaMA-Factory/examples/train_lora/llama3_lora_sft.yaml 文件。为了启动 DeepSpeed 引擎，配置文件中 deepspeed 参数指定 DeepSpeed 配置文件的路径。

### model
# model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
model_name_or_path: models/Meta-Llama-3-8B-Instruct

### deepspeed
deepspeed: examples/deepspeed/ds_z3_config.json

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: alpaca_zh
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 12

### output
output_dir: saves/llama3-8b/lora/sft
logging_steps: 5
save_steps: 100
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

执行微调

FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

输出结果

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/L
LaMA-Factory# FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-08-05 11:56:48,638] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/05/2024 11:56:51 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:23095
[2024-08-05 11:56:53,841] torch.distributed.run: [WARNING]
[2024-08-05 11:56:53,841] torch.distributed.run: [WARNING] *****************************************
[2024-08-05 11:56:53,841] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-08-05 11:56:53,841] torch.distributed.run: [WARNING] *****************************************
[2024-08-05 11:56:58,301] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 11:56:58,342] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 11:56:58,613] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 11:56:58,710] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-05 11:57:01,430] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0805 11:57:01.432113 13332 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=381559744
08/05/2024 11:57:01 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/05/2024 11:57:01 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[2024-08-05 11:57:01,477] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0805 11:57:01.478453 13334 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=368429888
08/05/2024 11:57:01 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/05/2024 11:57:01 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[2024-08-05 11:57:01,772] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-05 11:57:01,773] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0805 11:57:01.774361 13331 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=356681040
08/05/2024 11:57:01 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/05/2024 11:57:01 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_auto.py:682] 2024-08-05 11:57:01,782 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:731] 2024-08-05 11:57:01,783 >> loading configuration file models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-05 11:57:01,784 >> Model config LlamaConfig {
  "_name_or_path": "models/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2024-08-05 11:57:01,785 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-08-05 11:57:01,786 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-05 11:57:01,786 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-05 11:57:01,786 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-05 11:57:01,786 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-08-05 11:57:01,786 >> loading configuration file models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-05 11:57:01,787 >> Model config LlamaConfig {
  "_name_or_path": "models/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[WARNING|logging.py:328] 2024-08-05 11:57:01,909 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
08/05/2024 11:57:01 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/05/2024 11:57:01 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/05/2024 11:57:02 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/05/2024 11:57:02 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[2024-08-05 11:57:02,152] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0805 11:57:02.154500 13333 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=366632656
08/05/2024 11:57:02 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/05/2024 11:57:02 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2533] 2024-08-05 11:57:02,177 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/05/2024 11:57:02 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/05/2024 11:57:02 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/05/2024 11:57:02 - INFO - llamafactory.data.loader - Loading dataset alpaca_data_zh_51k.json...
I0805 11:57:02.293980 13332 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
I0805 11:57:02.387957 13334 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
08/05/2024 11:57:02 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/05/2024 11:57:02 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
I0805 11:57:02.918886 13333 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Converting format of dataset (num_proc=12): 100%|█████████████████████| 1000/1000 [00:00<00:00, 4863.55 examples/s]
I0805 11:57:04.194933 13331 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0805 11:57:04.821501 13331 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
08/05/2024 11:57:04 - INFO - llamafactory.data.loader - Loading dataset alpaca_data_zh_51k.json...
08/05/2024 11:57:04 - INFO - llamafactory.data.loader - Loading dataset alpaca_data_zh_51k.json...
08/05/2024 11:57:04 - INFO - llamafactory.data.loader - Loading dataset alpaca_data_zh_51k.json...
Running tokenizer on dataset (num_proc=12): 100%|██████████████████████| 1000/1000 [00:02<00:00, 342.83 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 98739, 109425, 19000, 9080, 40053, 104654, 16325, 111689, 83747, 11883, 53610, 11571, 128009, 128006, 78191, 128007, 271, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

我们如何在日常生活中减少用水？<|eot_id|><|start_header_id|>assistant<|end_header_id|>

1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
labels:
1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-05 11:57:08,282 >> loading configuration file models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-05 11:57:08,283 >> Model config LlamaConfig {
  "_name_or_path": "models/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3631] 2024-08-05 11:57:08,316 >> loading weights file models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:3776] 2024-08-05 11:57:08,317 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:1038] 2024-08-05 11:57:08,327 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

[2024-08-05 11:57:31,778] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 723, num_elems = 70.55B
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 30/30 [09:03<00:00, 18.13s/it]
08/05/2024 12:06:35 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/05/2024 12:06:35 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/05/2024 12:06:35 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/05/2024 12:06:35 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/05/2024 12:06:35 - INFO - llamafactory.model.model_utils.misc - Found linear modules: gate_proj,up_proj,o_proj,down_proj,q_proj,k_proj,v_proj
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 30/30 [09:17<00:00, 18.58s/it]
08/05/2024 12:06:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/05/2024 12:06:49 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/05/2024 12:06:49 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/05/2024 12:06:49 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/05/2024 12:06:49 - INFO - llamafactory.model.model_utils.misc - Found linear modules: up_proj,k_proj,q_proj,o_proj,gate_proj,down_proj,v_proj
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 30/30 [09:17<00:00, 18.58s/it]
08/05/2024 12:06:49 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/05/2024 12:06:49 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/05/2024 12:06:49 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/05/2024 12:06:49 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/05/2024 12:06:49 - INFO - llamafactory.model.model_utils.misc - Found linear modules: gate_proj,down_proj,v_proj,up_proj,k_proj,q_proj,o_proj
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 30/30 [09:20<00:00, 18.69s/it]
[INFO|modeling_utils.py:4463] 2024-08-05 12:06:52,779 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-08-05 12:06:52,779 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at models/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-05 12:06:52,788 >> loading configuration file models/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-05 12:06:52,788 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

08/05/2024 12:06:52 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/05/2024 12:06:52 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/05/2024 12:06:52 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/05/2024 12:06:52 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/05/2024 12:06:52 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,gate_proj,o_proj,v_proj,k_proj,up_proj,down_proj
08/05/2024 12:06:54 - INFO - llamafactory.model.loader - trainable params: 103,546,880 || all params: 70,657,253,376 || trainable%: 0.1465
08/05/2024 12:06:54 - INFO - llamafactory.model.loader - trainable params: 103,546,880 || all params: 70,657,253,376 || trainable%: 0.1465
08/05/2024 12:06:54 - INFO - llamafactory.model.loader - trainable params: 103,546,880 || all params: 70,657,253,376 || trainable%: 0.1465
08/05/2024 12:06:54 - INFO - llamafactory.model.loader - trainable params: 103,546,880 || all params: 70,657,253,376 || trainable%: 0.1465
I0805 12:06:54.580374 13333 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=428523376
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:648] 2024-08-05 12:06:54,586 >> Using auto half precision backend
I0805 12:06:54.596994 13332 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=454694432
I0805 12:06:54.624678 13334 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=431493824
[2024-08-05 12:06:55,048] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.3, git-hash=a724046, git-branch=HEAD
I0805 12:06:55.155179 13331 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=418727664
[2024-08-05 12:06:55,155] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-08-05 12:06:55,166] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-08-05 12:06:55,166] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-08-05 12:06:55,376] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-08-05 12:06:55,377] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-08-05 12:06:55,377] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-08-05 12:06:55,377] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-08-05 12:06:55,603] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning
[2024-08-05 12:06:55,603] [INFO] [utils.py:803:see_memory_usage] MA 33.05 GB         Max_MA 38.24 GB         CA 36.45 GB         Max_CA 44 GB
[2024-08-05 12:06:55,606] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.46 GB, percent = 11.1%
[2024-08-05 12:06:55,628] [INFO] [stage3.py:127:__init__] Reduce bucket size 67108864
[2024-08-05 12:06:55,628] [INFO] [stage3.py:128:__init__] Prefetch bucket size 60397977
[2024-08-05 12:06:55,853] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-08-05 12:06:55,853] [INFO] [utils.py:803:see_memory_usage] MA 33.05 GB         Max_MA 33.05 GB         CA 36.45 GB         Max_CA 36 GB
[2024-08-05 12:06:55,859] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.48 GB, percent = 11.1%
Parameter Offload: Total persistent parameters: 49815552 in 1041 params
[2024-08-05 12:06:57,084] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-08-05 12:06:57,086] [INFO] [utils.py:803:see_memory_usage] MA 32.9 GB         Max_MA 33.05 GB         CA 36.45 GB         Max_CA 36 GB
[2024-08-05 12:06:57,089] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.55 GB, percent = 11.1%
[2024-08-05 12:06:57,405] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions
[2024-08-05 12:06:57,406] [INFO] [utils.py:803:see_memory_usage] MA 32.9 GB         Max_MA 32.9 GB         CA 36.45 GB         Max_CA 36 GB
[2024-08-05 12:06:57,408] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.52 GB, percent = 11.1%
[2024-08-05 12:06:58,057] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1
[2024-08-05 12:06:58,059] [INFO] [utils.py:803:see_memory_usage] MA 32.9 GB         Max_MA 32.9 GB         CA 36.06 GB         Max_CA 36 GB
[2024-08-05 12:06:58,061] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.52 GB, percent = 11.1%
[2024-08-05 12:06:58,332] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions
[2024-08-05 12:06:58,332] [INFO] [utils.py:803:see_memory_usage] MA 32.9 GB         Max_MA 32.9 GB         CA 36.06 GB         Max_CA 36 GB
[2024-08-05 12:06:58,335] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.52 GB, percent = 11.1%
[2024-08-05 12:06:58,605] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions
[2024-08-05 12:06:58,606] [INFO] [utils.py:803:see_memory_usage] MA 33.0 GB         Max_MA 33.05 GB         CA 36.06 GB         Max_CA 36 GB
[2024-08-05 12:06:58,608] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.51 GB, percent = 11.1%
[2024-08-05 12:06:58,880] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states
[2024-08-05 12:06:58,881] [INFO] [utils.py:803:see_memory_usage] MA 33.0 GB         Max_MA 33.0 GB         CA 36.06 GB         Max_CA 36 GB
[2024-08-05 12:06:58,883] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.67 GB, percent = 11.1%
[2024-08-05 12:06:59,223] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states
[2024-08-05 12:06:59,224] [INFO] [utils.py:803:see_memory_usage] MA 33.19 GB         Max_MA 33.38 GB         CA 36.06 GB         Max_CA 36 GB
[2024-08-05 12:06:59,227] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.7 GB, percent = 11.1%
[2024-08-05 12:06:59,227] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
[2024-08-05 12:07:00,101] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer
[2024-08-05 12:07:00,103] [INFO] [utils.py:803:see_memory_usage] MA 33.36 GB         Max_MA 33.37 GB         CA 36.06 GB         Max_CA 36 GB
[2024-08-05 12:07:00,105] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory:  used = 111.94 GB, percent = 11.1%
[2024-08-05 12:07:00,105] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-08-05 12:07:00,106] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-08-05 12:07:00,106] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-08-05 12:07:00,106] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2024-08-05 12:07:00,121] [INFO] [config.py:974:print] DeepSpeedEngine configuration:
[2024-08-05 12:07:00,122] [INFO] [config.py:978:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2024-08-05 12:07:00,122] [INFO] [config.py:978:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-08-05 12:07:00,123] [INFO] [config.py:978:print]   amp_enabled .................. False
[2024-08-05 12:07:00,123] [INFO] [config.py:978:print]   amp_params ................... False
[2024-08-05 12:07:00,123] [INFO] [config.py:978:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   bfloat16_enabled ............. True
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   checkpoint_parallel_write_pipeline  False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   checkpoint_tag_validation_enabled  True
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   checkpoint_tag_validation_fail  False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fd68cb2f460>
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   communication_data_type ...... None
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   curriculum_enabled_legacy .... False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   curriculum_params_legacy ..... False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   data_efficiency_enabled ...... False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   dataloader_drop_last ......... False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   disable_allgather ............ False
[2024-08-05 12:07:00,124] [INFO] [config.py:978:print]   dump_state ................... False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   dynamic_loss_scale_args ...... None
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_enabled ........... False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_gas_boundary_resolution  1
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_layer_num ......... 0
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_max_iter .......... 100
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_stability ......... 1e-06
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_tol ............... 0.01
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   eigenvalue_verbose ........... False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   elasticity_enabled ........... False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   fp16_auto_cast ............... None
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   fp16_enabled ................. False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   fp16_master_weights_and_gradients  False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   global_rank .................. 0
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   grad_accum_dtype ............. None
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   gradient_accumulation_steps .. 8
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   gradient_clipping ............ 1.0
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   gradient_predivide_factor .... 1.0
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   initial_dynamic_scale ........ 1
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   load_universal_checkpoint .... False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   loss_scale ................... 1.0
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   memory_breakdown ............. False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   mics_hierarchial_params_gather  False
[2024-08-05 12:07:00,125] [INFO] [config.py:978:print]   mics_shard_size .............. -1
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   optimizer_legacy_fusion ...... False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   optimizer_name ............... None
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   optimizer_params ............. None
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   pld_enabled .................. False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   pld_params ................... False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   prescale_gradients ........... False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   scheduler_name ............... None
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   scheduler_params ............. None
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   seq_parallel_communication_data_type  torch.float32
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   sparse_attention ............. None
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   sparse_gradients_enabled ..... False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   steps_per_print .............. inf
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   train_batch_size ............. 64
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   train_micro_batch_size_per_gpu  2
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   use_node_local_storage ....... False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   wall_clock_breakdown ......... False
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   weight_quantization_config ... None
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   world_size ................... 4
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   zero_allow_untested_optimizer  True
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=67108864 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=60397977 param_persistence_threshold=81920 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-08-05 12:07:00,126] [INFO] [config.py:978:print]   zero_enabled ................. True
[2024-08-05 12:07:00,127] [INFO] [config.py:978:print]   zero_force_ds_cpu_optimizer .. True
[2024-08-05 12:07:00,127] [INFO] [config.py:978:print]   zero_optimization_stage ...... 3
[2024-08-05 12:07:00,128] [INFO] [config.py:964:print_user_config]   json = {
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 2,
    "gradient_accumulation_steps": 8,
    "gradient_clipping": 1.0,
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1.000000e+09,
        "reduce_bucket_size": 6.710886e+07,
        "stage3_prefetch_bucket_size": 6.039798e+07,
        "stage3_param_persistence_threshold": 8.192000e+04,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "steps_per_print": inf
}
[INFO|trainer.py:2134] 2024-08-05 12:07:00,129 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-08-05 12:07:00,129 >>   Num examples = 900
[INFO|trainer.py:2136] 2024-08-05 12:07:00,129 >>   Num Epochs = 1
[INFO|trainer.py:2137] 2024-08-05 12:07:00,129 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2140] 2024-08-05 12:07:00,129 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:2141] 2024-08-05 12:07:00,129 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-08-05 12:07:00,129 >>   Total optimization steps = 14
[INFO|trainer.py:2143] 2024-08-05 12:07:00,147 >>   Number of trainable parameters = 103,546,880
  0%|                                                                                       | 0/14 [00:00<?, ?it/s]I0805 12:08:02.956094 24377 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
{'loss': 2.5017, 'grad_norm': 0.7846836546663523, 'learning_rate': 4.267766952966369e-05, 'epoch': 0.35}
{'loss': 2.3575, 'grad_norm': 1.0236646094475954, 'learning_rate': 1.2500000000000006e-05, 'epoch': 0.71}
100%|███████████████████████████████████████████████████████████████████████████| 14/14 [1:29:15<00:00, 379.93s/it][INFO|trainer.py:3503] 2024-08-05 13:37:45,022 >> Saving model checkpoint to saves/llama3-8b/lora/sft/checkpoint-14
[INFO|configuration_utils.py:731] 2024-08-05 13:37:45,096 >> loading configuration file models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-05 13:37:45,097 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2702] 2024-08-05 13:37:45,397 >> tokenizer config file saved in saves/llama3-8b/lora/sft/checkpoint-14/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-08-05 13:37:45,399 >> Special tokens file saved in saves/llama3-8b/lora/sft/checkpoint-14/special_tokens_map.json
[2024-08-05 13:37:45,688] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step14 is about to be saved!
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2024-08-05 13:37:45,767] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: saves/llama3-8b/lora/sft/checkpoint-14/global_step14/zero_pp_rank_0_mp_rank_00_model_states.pt
[2024-08-05 13:37:45,768] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving saves/llama3-8b/lora/sft/checkpoint-14/global_step14/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2024-08-05 13:37:45,831] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved saves/llama3-8b/lora/sft/checkpoint-14/global_step14/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2024-08-05 13:37:45,835] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving saves/llama3-8b/lora/sft/checkpoint-14/global_step14/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-08-05 13:37:46,167] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved saves/llama3-8b/lora/sft/checkpoint-14/global_step14/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-08-05 13:37:46,169] [INFO] [engine.py:3417:_save_zero_checkpoint] zero checkpoint saved saves/llama3-8b/lora/sft/checkpoint-14/global_step14/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-08-05 13:37:46,381] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step14 is ready now!
[INFO|trainer.py:2394] 2024-08-05 13:37:46,402 >>

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 5446.2551, 'train_samples_per_second': 0.165, 'train_steps_per_second': 0.003, 'train_loss': 2.343491962977818, 'epoch': 0.99}
100%|███████████████████████████████████████████████████████████████████████████| 14/14 [1:30:46<00:00, 389.02s/it]
[INFO|trainer.py:3503] 2024-08-05 13:39:31,900 >> Saving model checkpoint to saves/llama3-8b/lora/sft
[INFO|configuration_utils.py:731] 2024-08-05 13:39:31,937 >> loading configuration file models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-05 13:39:31,938 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2702] 2024-08-05 13:39:32,327 >> tokenizer config file saved in saves/llama3-8b/lora/sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-08-05 13:39:32,328 >> Special tokens file saved in saves/llama3-8b/lora/sft/special_tokens_map.json
***** train metrics *****
  epoch                    =     0.9912
  total_flos               =    32611GF
  train_loss               =     2.3435
  train_runtime            = 1:30:46.25
  train_samples_per_second =      0.165
  train_steps_per_second   =      0.003
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
08/05/2024 13:39:42 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
08/05/2024 13:39:42 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot.
[INFO|trainer.py:3819] 2024-08-05 13:39:42,289 >>
***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-05 13:39:42,290 >>   Num examples = 100
[INFO|trainer.py:3824] 2024-08-05 13:39:42,290 >>   Batch size = 1
100%|██████████████████████████████████████████████████████████████████████████████| 25/25 [09:40<00:00, 23.20s/it]
***** eval metrics *****
  epoch                   =     0.9912
  eval_loss               =     2.1066
  eval_runtime            = 0:10:04.47
  eval_samples_per_second =      0.165
  eval_steps_per_second   =      0.041
[INFO|modelcard.py:449] 2024-08-05 13:49:46,769 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

微调结束后，得到新的模型权重文件：

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/L
LaMA-Factory# tree -L 3 saves/llama3-8b/lora/sft/
saves/llama3-8b/lora/sft/
|-- README.md
|-- adapter_config.json
|-- adapter_model.safetensors
|-- all_results.json
|-- checkpoint-14
|   |-- README.md
|   |-- adapter_config.json
|   |-- adapter_model.safetensors
|   |-- global_step14
|   |   |-- bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
|   |   |-- bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
|   |   |-- bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
|   |   |-- bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
|   |   |-- zero_pp_rank_0_mp_rank_00_model_states.pt
|   |   |-- zero_pp_rank_1_mp_rank_00_model_states.pt
|   |   |-- zero_pp_rank_2_mp_rank_00_model_states.pt
|   |   `-- zero_pp_rank_3_mp_rank_00_model_states.pt
|   |-- latest
|   |-- rng_state_0.pth
|   |-- rng_state_1.pth
|   |-- rng_state_2.pth
|   |-- rng_state_3.pth
|   |-- scheduler.pt
|   |-- special_tokens_map.json
|   |-- tokenizer.json
|   |-- tokenizer_config.json
|   |-- trainer_state.json
|   |-- training_args.bin
|   `-- zero_to_fp32.py
|-- eval_results.json
|-- special_tokens_map.json
|-- tokenizer.json
|-- tokenizer_config.json
|-- train_results.json
|-- trainer_log.jsonl
|-- trainer_state.json
|-- training_args.bin
`-- training_loss.png

运行时的资源占用情况

6. 合并模型权重

合并模型权重在CPU上进行。将微调后的模型权重文件和此前的原始模型权重文件进行合并，以获得最终的微调模型。

cp examples/merge_lora/llama3_lora_sft.yaml examples/merge_lora/llama3_lora_sft.yaml.bak

修改 examples/merge_lora/llama3_lora_sft.yaml 文件：

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
# model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
# model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct
model_name_or_path: models/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft
template: llama3
finetuning_type: lora

### export
export_dir: output/llama3_lora_sft
export_size: 2
export_device: cpu
export_legacy_format: false

执行合并

llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
[2024-08-04 18:32:12,686] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|tokenization_auto.py:682] 2024-08-04 18:32:16,059 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:731] 2024-08-04 18:32:16,063 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-04 18:32:16,064 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2024-08-04 18:32:16,066 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-08-04 18:32:16,066 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 18:32:16,066 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 18:32:16,066 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 18:32:16,066 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-08-04 18:32:16,067 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-04 18:32:16,067 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[WARNING|logging.py:328] 2024-08-04 18:32:16,209 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[INFO|tokenization_utils_base.py:2533] 2024-08-04 18:32:16,498 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/04/2024 18:32:16 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 18:32:16 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-04 18:32:16,519 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-04 18:32:16,520 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

08/04/2024 18:32:16 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
[INFO|modeling_utils.py:3631] 2024-08-04 18:32:16,548 >> loading weights file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-04 18:32:16,553 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-04 18:32:16,554 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:38<00:00,  1.29s/it]
[INFO|modeling_utils.py:4463] 2024-08-04 18:32:55,856 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-08-04 18:32:55,856 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-04 18:32:55,868 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-04 18:32:55,869 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

08/04/2024 18:32:55 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/04/2024 18:35:43 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
08/04/2024 18:35:43 - INFO - llamafactory.model.adapter - Loaded adapter(s): /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/saves/llama3-8b/lora/sft/
08/04/2024 18:35:43 - INFO - llamafactory.model.loader - all params: 70,553,706,496
08/04/2024 18:35:43 - INFO - llamafactory.train.tuner - Convert model dtype to: torch.bfloat16.
[INFO|configuration_utils.py:472] 2024-08-04 18:35:43,295 >> Configuration saved in /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/output/llama3_lora_sft/config.json
[INFO|configuration_utils.py:807] 2024-08-04 18:35:43,298 >> Configuration saved in /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/output/llama3_lora_sft/generation_config.json
[INFO|modeling_utils.py:2763] 2024-08-04 18:43:36,708 >> The model is bigger than the maximum size per checkpoint (2GB) and is going to be split in 82 checkpoint shards. You can find where each parameters has been saved in the index located at /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/output/llama3_lora_sft/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2702] 2024-08-04 18:43:36,717 >> tokenizer config file saved in /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/output/llama3_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-08-04 18:43:36,719 >> Special tokens file saved in /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/output/llama3_lora_sft/special_tokens_map.json

模型权重合并后，得到新的模型权重文件：

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/L
LaMA-Factory# tree -L 6 models/llama3_lora_sft/
models/llama3_lora_sft/
|-- config.json
|-- generation_config.json
|-- model-00001-of-00009.safetensors
|-- model-00002-of-00009.safetensors
|-- model-00003-of-00009.safetensors
|-- model-00004-of-00009.safetensors
|-- model-00005-of-00009.safetensors
|-- model-00006-of-00009.safetensors
|-- model-00007-of-00009.safetensors
|-- model-00008-of-00009.safetensors
|-- model-00009-of-00009.safetensors
|-- model.safetensors.index.json
|-- special_tokens_map.json
|-- tokenizer.json
`-- tokenizer_config.json

运行时的资源占用情况

7. 模型推理

cp examples/inference/llama3_lora_sft.yaml examples/inference/llama3_lora_sft.yaml.bak

修改 examples/inference/llama3_lora_sft.yaml 文件：

model_name_or_path: output/llama3_lora_sft   # 合并模型的路径
adapter_name_or_path: saves/llama3-8b/lora/sft  # 微调模型的路径
template: llama3
finetuning_type: lora

执行推理

llamafactory-cli chat examples/inference/llama3_lora_sft.yaml

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
[2024-08-05 14:35:13,263] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|tokenization_utils_base.py:2287] 2024-08-05 14:35:17,727 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-08-05 14:35:17,727 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-05 14:35:17,728 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-05 14:35:17,728 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-05 14:35:17,728 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2024-08-05 14:35:18,040 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/05/2024 14:35:18 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-05 14:35:18,058 >> loading configuration file output/llama3_lora_sft/config.json
[INFO|configuration_utils.py:800] 2024-08-05 14:35:18,060 >> Model config LlamaConfig {
  "_name_or_path": "output/llama3_lora_sft",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

08/05/2024 14:35:18 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
[INFO|modeling_utils.py:3631] 2024-08-05 14:35:18,091 >> loading weights file output/llama3_lora_sft/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-05 14:35:18,112 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-05 14:35:18,114 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 82/82 [04:33<00:00,  3.34s/it]
[INFO|modeling_utils.py:4463] 2024-08-05 14:39:53,074 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-08-05 14:39:53,074 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at output/llama3_lora_sft.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-05 14:39:53,097 >> loading configuration file output/llama3_lora_sft/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-05 14:39:53,098 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

08/05/2024 14:39:53 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/05/2024 14:40:06 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
08/05/2024 14:40:06 - INFO - llamafactory.model.adapter - Loaded adapter(s): saves/llama3-8b/lora/sft
08/05/2024 14:40:06 - INFO - llamafactory.model.loader - all params: 70,553,706,496
Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.

User: 中国深圳有哪些旅游景点
Assistant: 深圳是一个旅游目的地，拥有许多旅游景点，以下是一些主要的旅游景点：

1. ** Window of the World**（世界之窗）：是一个主题公园，展示了世界各地的著名景点和文化遗产。
2. **Splendid China Folk Village**（锦绣中华民俗文化村）：是一个展示中国传统文化和民俗的主题公园。
3. **Shenzhen Bay Sports Center**（深圳湾体育中心）：是一个多功能的体育场馆，举办过多场国际性比赛和演唱会。
4. **Lianhua Mountain Park**（莲花山公园）：是一个大型公园，拥有美丽的自然景色和多种娱乐设施。
5. **Dafen Oil Painting Village**（大芬油画村）：是一个艺术家聚集的村庄，展示了中国油画艺术的发展和成果。
6. **Shenzhen Museum**（深圳博物馆）：是一个展示深圳历史和文化的博物馆。
7. **Xiaomeisha Beach**（小梅沙海滩）：是一个popular的海滩，拥有美丽的海景和多种水上娱乐设施。
8. **Qianhai Bay**（前海湾）：是一个新的旅游区，拥有现代化的建筑和多种娱乐设施。
9. **Meridian View Center**（子午线观景台）：是一个观景台，提供了深圳市区的全景视图。
10. **Shenzhen Tower**（深圳电视塔）：是一个观景台，提供了深圳市区的全景视图。
11. **Fairy Lake Botanical Garden**（仙湖植物园）：是一个植物园，拥有多种植物和美丽的自然景色。
12. **Wutong Mountain**（梧桐山）：是一个自然保护区，拥有多种植物和动物。
13. **Dapeng Peninsula**（大鹏半岛）：是一个自然保护区，拥有美丽的海景和多种水上娱乐设施。
14. **Shenzhen Bay Park**（深圳湾公园）：是一个大型公园，拥有美丽的自然景色和多种娱乐设施。
15. **Luohu Commercial City**（罗湖商业城）：是一个大型购物中心，拥有多种商品和娱乐设施。

这些只是深圳的部分旅游景点，深圳还有许多其他的旅游景点和娱乐设施，等待您来探索。

User: 中国广州有哪些旅游景点
Assistant: 广州是一个历史悠久的城市，拥有许多旅游景点，以下是一些主要的旅游景点：

1. **Canton Tower**（广州电视塔）：是一个观景台，提供了广州市区的全景视图。
2. **Chimelong Tourist Resort**（长隆旅游度假区）：是一个大型旅游度假区，拥有主题公园、水上乐园、动物园等多种娱乐设施。
3. **Yuexiu Park**（越秀公园）：是一个大型公园，拥有美丽的自然景色、湖泊、瀑布等多种景观。
4. **Shamian Island**（沙面岛）：是一个历史悠久的小岛，拥有许多欧式建筑和传统的广州文化。
5. **Guangzhou Museum**（广州博物馆）：是一个展示广州历史和文化的博物馆。
6. **Temple of the Six Banyan Trees**（六榕寺）：是一个佛寺，拥有历史悠久的建筑和文化遗产。
7. **Baiyun Mountain**（白云山）：是一个自然保护区，拥有多种植物和动物。
8. **Liwan Lake**（荔湾湖）：是一个人工湖泊，拥有美丽的自然景色和多种娱乐设施。
9. **Qingping Market**（清平市场）：是一个传统的市场，出售各种传统的广州特产和手工艺品。
10. **Haizhu Square**（海珠广场）：是一个大型广场，拥有美丽的自然景色和多种娱乐设施。
11. **Guangzhou Opera House**（广州大剧院）：是一个现代化的剧院，举办各种演出和音乐会。
12. **Huacheng Square**（华城广场）：是一个大型广场，拥有美丽的自然景色和多种娱乐设施。
13. **Guangdong Museum**（广东博物馆）：是一个展示广东省历史和文化的博物馆。
14. **Yuexiu Hill**（越秀山）：是一个历史悠久的山丘，拥有多种植物和动物。
15. **Shangxiajiu Pedestrian Street**（上下九步行街）：是一个繁华的步行街，出售各种商品和美食。

这些只是广州的部分旅游景点，广州还有许多其他的旅游景点和娱乐设施，等待您来探索。

此外，广州还有一些特色景点，例如：

* **Cantonese cuisine**（广州菜）：广州是中国最著名的美食城市之一，拥有多种特色美食。
* **Traditional Cantonese architecture**（传统广州建筑）：广州拥有许多传统的广州建筑，例如欧式建筑、中国古典建筑等。
* **Pearl River**（珠江）：广州的母亲河，提供了美丽的河景和多种水上娱乐设施。
* **Night cruise**（夜游船）：可以欣赏广州的夜景和河景。

运行时的资源占用情况

六、FAQ

Q：`ImportError: libcuda.so.1: cannot open shared object file: No such file or directory`

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# python src/webui.py \
>     --model_name_or_path "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct" \
>     --template llama3 \
>     --infer_backend vllm \
>     --vllm_enforce_eager
No ROCm runtime is found, using ROCM_HOME='/opt/dtk'
/opt/conda/envs/llama3/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_hip.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[2024-08-02 14:02:58,860] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/webui.py", line 17, in <module>
    from llamafactory.webui.interface import create_ui
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/__init__.py", line 38, in <module>
    from .cli import VERSION
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/cli.py", line 22, in <module>
    from .api.app import run_api
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/api/app.py", line 21, in <module>
    from ..chat import ChatModel
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/chat/__init__.py", line 16, in <module>
    from .chat_model import ChatModel
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 26, in <module>
    from .vllm_engine import VllmEngine
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/chat/vllm_engine.py", line 29, in <module>
    from vllm import AsyncEngineArgs, AsyncLLMEngine, RequestOutput, SamplingParams
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/vllm/__init__.py", line 4, in <module>
    from vllm.engine.async_llm_engine import AsyncLLMEngine
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 12, in <module>
    from vllm.engine.llm_engine import LLMEngine
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 16, in <module>
    from vllm.model_executor.model_loader import get_architecture_class_name
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 10, in <module>
    from vllm.model_executor.models.llava import LlavaForConditionalGeneration
  File "/opt/conda/envs/llama3/lib/python3.10/site-packages/vllm/model_executor/models/llava.py", line 11, in <module>
    from vllm.model_executor.layers.activation import get_act_fn
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/vllm/model_executor/layers/activation.py", line 9, in <module>
    from vllm._C import ops
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# python
Python 3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False

错误原因：当前PyTorch不支持DCU。

解决方法：在光合社区中查询并下载安装PyTorch。以 torch-2.1.0+das1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64 为例，尝试安装 torch-2.1.0。

Q： `ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True or check your proxy settings to allow access to localhost.`

Running on local URL:  http://0.0.0.0:7860
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/LLaMA-Factory/src/webui.py", line 27, in <module>
    main()
  File "/public/home/scnlbe5oi5/LLaMA-Factory/src/webui.py", line 23, in main
    create_ui().queue().launch(share=gradio_share, server_name=server_name, inbrowser=True)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.11/site-packages/gradio/blocks.py", line 2446, in launch
    raise ValueError(
ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True or check your proxy settings to allow access to localhost.

# 解决方法
create_ui(True).queue().launch(share=gradio_share, server_name=server_name, inbrowser=True)
改为
create_ui(True).queue().launch(share=True, server_name=server_name, inbrowser=True)

Q：单卡显存不足

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-20553:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# ./single_lora_llama3.sh
[2024-08-02 15:22:58,956] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/02/2024 15:23:02 - WARNING - llamafactory.hparams.parser - We recommend enable mixed precision training.
08/02/2024 15:23:02 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: None
[INFO|tokenization_auto.py:682] 2024-08-02 15:23:02,665 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:731] 2024-08-02 15:23:02,690 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-02 15:23:02,692 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2024-08-02 15:23:02,698 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-08-02 15:23:02,698 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-02 15:23:02,698 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-02 15:23:02,698 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-02 15:23:02,698 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-08-02 15:23:02,699 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-02 15:23:02,699 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[WARNING|logging.py:328] 2024-08-02 15:23:02,907 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[INFO|tokenization_utils_base.py:2533] 2024-08-02 15:23:03,240 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/02/2024 15:23:03 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/02/2024 15:23:03 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/02/2024 15:23:03 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 507/507 [00:00<00:00, 1.98MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 18.7M/18.7M [00:52<00:00, 354kB/s]
Generating train split: 51155 examples [00:00, 81737.09 examples/s]
Converting format of dataset: 100%|█████████████████████████████████████████████████████████████████| 51155/51155 [00:00<00:00, 87369.27 examples/s]
Running tokenizer on dataset: 100%|██████████████████████████████████████████████████████████████████| 51155/51155 [00:17<00:00, 2914.66 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 98739, 109425, 19000, 9080, 40053, 104654, 16325, 111689, 83747, 11883, 53610, 11571, 128009, 128006, 78191, 128007, 271, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

我们如何在日常生活中减少用水？<|eot_id|><|start_header_id|>assistant<|end_header_id|>

1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
labels:
1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-02 15:24:20,727 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-02 15:24:20,728 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3631] 2024-08-02 15:24:20,758 >> loading weights file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-02 15:24:20,824 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-02 15:24:20,825 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

Loading checkpoint shards:   0%|                                                                                             | 0/30 [00:00<?, ?it/s]Loading checkpoint shards:  47%|███████████████████████████████████████▏                                            | 14/30 [03:14<03:41, 13.87s/it]
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 28, in <module>
    main()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 19, in main
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 153, in load_model
    model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3916, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4390, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 404, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacty of 63.98 GiB of which 0 bytes is free. Of the allocated memory 63.40 GiB is allocated by PyTorch, and 816.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

错误原因：异构加速卡的显存不足，导致模型加载失败。

解决方法：减小batch_size，或者增加异构加速卡数量。

Q：多卡显存不足

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# torchrun  --standalone --nnodes=1 --nproc-per-node=4  src/train.py \
>   --stage sft \
>   --do_train True \
>   --model_name_or_path /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/ \
  --output_dir /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/out>   --dataset alpaca_zh \
>   --template llama3 \
>   --lora_target q_proj,v_proj \
>   --output_dir /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/output \
 >   --overwrite_cache \
>   --per_device_train_batch_size 2 \
>   --gradient_accumulation_steps 64 \
>   --lr_scheduler_type cosine \
>   --logging_steps 5 \
>   --save_steps 100 \
>   --learning_rate 5e-5 \
>   --num_train_epochs 1.0 \
>   --finetuning_type lora \
>   --lora_rank 4
[2024-08-03 07:36:41,589] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-08-03 07:36:41,590] torch.distributed.run: [WARNING]
[2024-08-03 07:36:41,590] torch.distributed.run: [WARNING] *****************************************
[2024-08-03 07:36:41,590] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-08-03 07:36:41,590] torch.distributed.run: [WARNING] *****************************************
[2024-08-03 07:36:46,569] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-03 07:36:46,647] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-03 07:36:46,649] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-03 07:36:46,656] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0803 07:36:50.128130 169256 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=369040736
08/03/2024 07:36:50 - WARNING - llamafactory.hparams.parser - We recommend enable mixed precision training.
08/03/2024 07:36:50 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/03/2024 07:36:50 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: None
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
08/03/2024 07:36:50 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/03/2024 07:36:50 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
I0803 07:36:50.887167 169256 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0803 07:36:50.981253 169258 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=371077680
08/03/2024 07:36:50 - WARNING - llamafactory.hparams.parser - We recommend enable mixed precision training.
08/03/2024 07:36:50 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/03/2024 07:36:50 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0803 07:36:50.989648 169257 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=360123488
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0803 07:36:50.994580 169255 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=380578768
08/03/2024 07:36:50 - WARNING - llamafactory.hparams.parser - We recommend enable mixed precision training.
08/03/2024 07:36:50 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/03/2024 07:36:50 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: None
08/03/2024 07:36:51 - WARNING - llamafactory.hparams.parser - We recommend enable mixed precision training.
08/03/2024 07:36:51 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/03/2024 07:36:51 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: None
[INFO|tokenization_auto.py:682] 2024-08-03 07:36:51,003 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:731] 2024-08-03 07:36:51,003 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-03 07:36:51,004 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2024-08-03 07:36:51,008 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-08-03 07:36:51,008 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-03 07:36:51,008 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-03 07:36:51,008 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-03 07:36:51,008 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-08-03 07:36:51,008 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-03 07:36:51,009 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[WARNING|logging.py:328] 2024-08-03 07:36:51,135 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
08/03/2024 07:36:51 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/03/2024 07:36:51 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[INFO|tokenization_utils_base.py:2533] 2024-08-03 07:36:51,399 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/03/2024 07:36:51 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/03/2024 07:36:51 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/03/2024 07:36:51 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/03/2024 07:36:51 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/03/2024 07:36:51 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
I0803 07:36:51.659898 169258 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0803 07:36:51.715169 169257 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Converting format of dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 51155/51155 [00:01<00:00, 45265.18 examples/s]
I0803 07:36:57.874289 169255 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0803 07:36:58.349363 169255 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
08/03/2024 07:36:58 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
08/03/2024 07:36:58 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
08/03/2024 07:36:58 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
Running tokenizer on dataset: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 51155/51155 [00:17<00:00, 2973.65 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 98739, 109425, 19000, 9080, 40053, 104654, 16325, 111689, 83747, 11883, 53610, 11571, 128009, 128006, 78191, 128007, 271, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

我们如何在日常生活中减少用水？<|eot_id|><|start_header_id|>assistant<|end_header_id|>

1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
labels:
1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-03 07:37:15,793 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-03 07:37:15,794 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3631] 2024-08-03 07:37:15,822 >> loading weights file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-03 07:37:15,823 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-03 07:37:15,824 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

Loading checkpoint shards:  47%|██████████████████████████████████████████████████████████▎                                                                  | 14/30 [06:13<07:07, 26.71s/it]
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 28, in <module>
Loading checkpoint shards:  47%|██████████████████████████████████████████████████████████▎                                                                  | 14/30 [06:14<07:07, 26.73s/it]
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 28, in <module>
Loading checkpoint shards:  47%|██████████████████████████████████████████████████████████▎                                                                  | 14/30 [06:13<07:07, 26.71s/it]
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 28, in <module>
Loading checkpoint shards:  47%|██████████████████████████████████████████████████████████▎                                                                  | 14/30 [06:13<07:07, 26.71s/it]
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 28, in <module>
    main()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 19, in main
        main()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 19, in main
    main()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 19, in main
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
main()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 19, in main
            run_exp()run_exp()

  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)    run_exp()

  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
        run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 153, in load_model

model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 153, in load_model
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 153, in load_model
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
    model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 153, in load_model
    model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
      File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
    model = AutoModelForCausalLM.from_pretrained(**init_kwargs)  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained

  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3916, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3916, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3916, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3916, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4390, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4390, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4390, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4390, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model
        new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(

      File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 404, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 404, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 448.00 MiB. GPU 3 has a total capacty of 63.98 GiB of which 0 bytes is free. Of the allocated memory 63.40 GiB is allocated by PyTorch, and 815.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 404, in set_module_tensor_to_device

  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 404, in set_module_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 448.00 MiB. GPU 1 has a total capacty of 63.98 GiB of which 0 bytes is free. Of the allocated memory 63.40 GiB is allocated by PyTorch, and 815.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
    new_value = value.to(device)
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 448.00 MiB. GPU 2 has a total capacty of 63.98 GiB of which 0 bytes is free. Of the allocated memory 63.40 GiB is allocated by PyTorch, and 815.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacty of 63.98 GiB of which 0 bytes is free. Of the allocated memory 63.40 GiB is allocated by PyTorch, and 815.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF
[2024-08-03 07:43:37,156] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 169255) of binary: /opt/conda/envs/llama_factory_torch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory_torch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-03_07:43:37
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 169256)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-08-03_07:43:37
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 169257)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-08-03_07:43:37
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 169258)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-03_07:43:37
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 169255)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

错误原因：LLaMA-Factory 支持 DDP , DeepSpeed 和 FSDP 三种分布式引擎。其中，DDP引擎不支持模型切分，每张卡都加载一份模型，导致显存不足。

引擎数据切分模型切分优化器切分参数卸载 DDP 支持不支持不支持不支持 DeepSpeed 支持支持支持支持 FSDP 支持支持支持支持

解决方法：采用DeepSpeed 或者 FSDP引擎进行分布式训练。

Q：`ValueError: Please launch distributed training with llamafactory-cli or torchrun.`

(llama_factory_torch) root@notebook-1819291427828183041-scnlbe5oi5-51898:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# ./single_lora_llama3.sh
[2024-08-02 16:55:03,538] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 28, in <module>
    main()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/train.py", line 19, in main
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 45, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 195, in get_train_args
    raise ValueError("Please launch distributed training with `llamafactory-cli` or `torchrun`.")
ValueError: Please launch distributed training with `llamafactory-cli` or `torchrun`.

错误原因：多卡的服务器环境中，默认使用多卡，该脚本不支持多卡运行。

解决方法：使用单卡运行。

CUDA_VISIBLE_DEVICES=0 ./single_lora_llama3.sh

Q：`RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback): No module named 'oss2'`

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-08-04 09:24:42,724] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/04/2024 09:24:45 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:24626
[2024-08-04 09:24:47,796] torch.distributed.run: [WARNING]
[2024-08-04 09:24:47,796] torch.distributed.run: [WARNING] *****************************************
[2024-08-04 09:24:47,796] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-08-04 09:24:47,796] torch.distributed.run: [WARNING] *****************************************
[2024-08-04 09:24:52,280] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 09:24:52,298] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 09:24:52,462] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 09:24:52,612] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 09:24:55,341] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 09:24:55.342823 38022 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=374876544
08/04/2024 09:24:55 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 09:24:55 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[2024-08-04 09:24:55,492] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 09:24:55.493716 38021 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=357873744
08/04/2024 09:24:55 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 09:24:55 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[2024-08-04 09:24:55,594] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 09:24:55.595580 38023 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=381227184
08/04/2024 09:24:55 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 09:24:55 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[2024-08-04 09:24:55,753] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-04 09:24:55,753] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 09:24:55.755142 38020 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=372711472
08/04/2024 09:24:55 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 09:24:55 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
2024-08-04 09:24:55,854 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
2024-08-04 09:24:55,960 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
2024-08-04 09:24:56,206 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
2024-08-04 09:24:56,383 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[INFO|tokenization_utils_base.py:2287] 2024-08-04 09:24:56,643 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 09:24:56,643 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 09:24:56,643 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 09:24:56,643 >> loading file tokenizer_config.json
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
I0804 09:24:56.879868 38021 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[INFO|tokenization_utils_base.py:2533] 2024-08-04 09:24:56,942 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 09:24:56 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/04/2024 09:24:56 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
I0804 09:24:56.954972 38022 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0804 09:24:57.078207 38023 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0804 09:24:57.246373 38020 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0804 09:24:57.975442 38020 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
    return importlib.import_module('.' + module_name, self.__name__)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
08/04/2024 09:24:58 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
08/04/2024 09:24:58 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/__init__.py", line 2, in <module>
    from modelscope.msdatasets.ms_dataset import MsDataset
08/04/2024 09:24:58 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/ms_dataset.py", line 16, in <module>
    from modelscope.msdatasets.data_loader.data_loader_manager import (
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader_manager.py", line 12, in <module>
    from modelscope.msdatasets.data_loader.data_loader import OssDownloader
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader.py", line 15, in <module>
    from modelscope.msdatasets.data_files.data_files_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_files/data_files_manager.py", line 11, in <module>
    from modelscope.msdatasets.download.dataset_builder import (
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/dataset_builder.py", line 24, in <module>
    from modelscope.msdatasets.download.download_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/download_manager.py", line 9, in <module>
    from modelscope.msdatasets.utils.oss_utils import OssUtilities
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/utils/oss_utils.py", line 7, in <module>
    import oss2
ModuleNotFoundError: No module named 'oss2'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 46, in run_sft
    dataset_module = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 233, in get_dataset
    dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 153, in _get_merged_dataset
    datasets.append(_load_single_dataset(dataset_attr, model_args, data_args, training_args))
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 87, in _load_single_dataset
    from modelscope import MsDataset
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 434, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 453, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback):
No module named 'oss2'
Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
    return importlib.import_module('.' + module_name, self.__name__)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/importlib/__init__.py", line 126, in import_module
    Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
return _bootstrap._gcd_import(name[level:], package, level)    return importlib.import_module('.' + module_name, self.__name__)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import

  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/__init__.py", line 2, in <module>
    from modelscope.msdatasets.ms_dataset import MsDataset
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/ms_dataset.py", line 16, in <module>
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/__init__.py", line 2, in <module>
    from modelscope.msdatasets.data_loader.data_loader_manager import (
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader_manager.py", line 12, in <module>
Traceback (most recent call last):
    from modelscope.msdatasets.ms_dataset import MsDataset
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/ms_dataset.py", line 16, in <module>
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
    from modelscope.msdatasets.data_loader.data_loader import OssDownloader
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader.py", line 15, in <module>
    from modelscope.msdatasets.data_loader.data_loader_manager import (
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader_manager.py", line 12, in <module>
    from modelscope.msdatasets.data_files.data_files_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_files/data_files_manager.py", line 11, in <module>
    from modelscope.msdatasets.data_loader.data_loader import OssDownloader
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader.py", line 15, in <module>
    from modelscope.msdatasets.download.dataset_builder import (
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/dataset_builder.py", line 24, in <module>
    from modelscope.msdatasets.data_files.data_files_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_files/data_files_manager.py", line 11, in <module>
    from modelscope.msdatasets.download.download_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/download_manager.py", line 9, in <module>
        from modelscope.msdatasets.download.dataset_builder import (    return importlib.import_module('.' + module_name, self.__name__)

  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/dataset_builder.py", line 24, in <module>
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/importlib/__init__.py", line 126, in import_module
from modelscope.msdatasets.utils.oss_utils import OssUtilities
      File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/utils/oss_utils.py", line 7, in <module>
from modelscope.msdatasets.download.download_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/download_manager.py", line 9, in <module>
        from modelscope.msdatasets.utils.oss_utils import OssUtilitiesreturn _bootstrap._gcd_import(name[level:], package, level)

  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/utils/oss_utils.py", line 7, in <module>
    import oss2
ModuleNotFoundError: No module named 'oss2'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
import oss2
    ModuleNotFoundError: launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
No module named 'oss2'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 46, in run_sft
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    dataset_module = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 233, in get_dataset
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/__init__.py", line 2, in <module>
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 46, in run_sft
    from modelscope.msdatasets.ms_dataset import MsDataset
    dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/ms_dataset.py", line 16, in <module>

  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 153, in _get_merged_dataset
    from modelscope.msdatasets.data_loader.data_loader_manager import (
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader_manager.py", line 12, in <module>
    from modelscope.msdatasets.data_loader.data_loader import OssDownloader
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_loader/data_loader.py", line 15, in <module>
    dataset_module = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)        datasets.append(_load_single_dataset(dataset_attr, model_args, data_args, training_args))
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 87, in _load_single_dataset
    from modelscope import MsDataset
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist

  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 233, in get_dataset
from modelscope.msdatasets.data_files.data_files_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/data_files/data_files_manager.py", line 11, in <module>
    from modelscope.msdatasets.download.dataset_builder import (  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 434, in __getattr__

  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/dataset_builder.py", line 24, in <module>
    from modelscope.msdatasets.download.download_manager import \
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/download/download_manager.py", line 9, in <module>
    from modelscope.msdatasets.utils.oss_utils import OssUtilities
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/msdatasets/utils/oss_utils.py", line 7, in <module>
        dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)module = self._get_module(self._class_to_module[name])

  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 153, in _get_merged_dataset
      File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 453, in _get_module
import oss2
ModuleNotFoundError: No module named 'oss2'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    raise RuntimeError(
RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback):
No module named 'oss2'
    datasets.append(_load_single_dataset(dataset_attr, model_args, data_args, training_args))
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 87, in _load_single_dataset
    from modelscope import MsDataset
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 434, in __getattr__
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    module = self._get_module(self._class_to_module[name])
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 453, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback):
No module named 'oss2'
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 46, in run_sft
    dataset_module = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 233, in get_dataset
    dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 153, in _get_merged_dataset
    datasets.append(_load_single_dataset(dataset_attr, model_args, data_args, training_args))
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/data/loader.py", line 87, in _load_single_dataset
    from modelscope import MsDataset
  File "<frozen importlib._bootstrap>", line 1075, in _handle_fromlist
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 434, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/modelscope/utils/import_utils.py", line 453, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback):
No module named 'oss2'
I0804 09:24:58.887567 38023 ProcessGroupNCCL.cpp:874] [Rank 3] Destroyed 1communicators on CUDA device 3
I0804 09:24:58.909525 38022 ProcessGroupNCCL.cpp:874] [Rank 2] Destroyed 1communicators on CUDA device 2
I0804 09:24:58.909849 38020 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
I0804 09:24:59.381202 38021 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
[2024-08-04 09:25:02,912] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 38020 closing signal SIGTERM
[2024-08-04 09:25:02,912] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 38021 closing signal SIGTERM
[2024-08-04 09:25:03,077] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 38022) of binary: /opt/conda/envs/llama_factory_torch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory_torch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-04_09:25:02
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 38023)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-04_09:25:02
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 38022)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

错误原因：缺少 oss2 依赖包，无法找到数据集。

解决方法：安装 oss2 依赖包。

How to handle the “No module named oss2” error when you use OSS by using Python

pip install --no-dependencies oss2

Q：`if not 0.0 <= lr: TypeError: '<=' not supported between instances of 'float' and 'str'`

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-08-04 10:30:00,673] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
08/04/2024 10:30:03 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:27599
[2024-08-04 10:30:05,637] torch.distributed.run: [WARNING]
[2024-08-04 10:30:05,637] torch.distributed.run: [WARNING] *****************************************
[2024-08-04 10:30:05,637] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-08-04 10:30:05,637] torch.distributed.run: [WARNING] *****************************************
[2024-08-04 10:30:10,197] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 10:30:10,239] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 10:30:10,390] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 10:30:10,435] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-04 10:30:13,600] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-04 10:30:13,600] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 10:30:13.601742 108226 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=365008080
08/04/2024 10:30:13 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 10:30:13 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_auto.py:682] 2024-08-04 10:30:13,610 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
[INFO|configuration_utils.py:731] 2024-08-04 10:30:13,611 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-04 10:30:13,612 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2287] 2024-08-04 10:30:13,622 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2287] 2024-08-04 10:30:13,622 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 10:30:13,622 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 10:30:13,622 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-04 10:30:13,622 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:731] 2024-08-04 10:30:13,623 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-04 10:30:13,623 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[2024-08-04 10:30:13,638] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 10:30:13.639712 108230 ProcessGroupNCCL.cpp:686] [Rank 3] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=376601696
08/04/2024 10:30:13 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 10:30:13 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[2024-08-04 10:30:13,690] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 10:30:13.691934 108229 ProcessGroupNCCL.cpp:686] [Rank 2] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=382577952
08/04/2024 10:30:13 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 10:30:13 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[WARNING|logging.py:328] 2024-08-04 10:30:13,762 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[2024-08-04 10:30:13,822] [INFO] [comm.py:637:init_distributed] cdb=None
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0804 10:30:13.824091 108227 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 180000000000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=372943824
08/04/2024 10:30:13 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/04/2024 10:30:13 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
[INFO|tokenization_utils_base.py:2533] 2024-08-04 10:30:14,052 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/04/2024 10:30:14 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
I0804 10:30:14.382220 108230 ProcessGroupNCCL.cpp:2780] Rank 3 using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0804 10:30:14.445211 108229 ProcessGroupNCCL.cpp:2780] Rank 2 using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0804 10:30:14.611828 108227 ProcessGroupNCCL.cpp:2780] Rank 1 using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/datasets/builder.py:883: FutureWarning: 'try_from_hf_gcs' was deprecated in version 2.16.0 and will be removed in 3.0.0.
  warnings.warn(
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4131.94 examples/s]
I0804 10:30:19.983676 108226 ProcessGroupNCCL.cpp:2780] Rank 0 using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
I0804 10:30:20.622058 108226 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
08/04/2024 10:30:20 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
08/04/2024 10:30:20 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
08/04/2024 10:30:20 - INFO - llamafactory.data.loader - Loading dataset llamafactory/alpaca_zh...
Running tokenizer on dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 284.38 examples/s]
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/datasets/builder.py:883: FutureWarning: 'try_from_hf_gcs' was deprecated in version 2.16.0 and will be removed in 3.0.0.
  warnings.warn(
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 98739, 109425, 19000, 9080, 40053, 104654, 16325, 111689, 83747, 11883, 53610, 11571, 128009, 128006, 78191, 128007, 271, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

我们如何在日常生活中减少用水？<|eot_id|><|start_header_id|>assistant<|end_header_id|>

1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 16, 13, 86758, 56602, 53610, 123128, 116051, 56602, 53610, 85315, 233, 118959, 83601, 115, 65455, 34208, 53610, 104432, 65455, 1811, 720, 17, 13, 86758, 53610, 74396, 58291, 53610, 44559, 114, 51109, 43167, 118006, 123641, 53610, 3922, 78657, 108914, 81802, 245, 34208, 108914, 118959, 1811, 720, 18, 13, 74662, 117481, 16325, 115890, 56602, 53610, 126966, 1811, 720, 19, 13, 111909, 222, 33976, 53610, 36651, 34208, 66285, 234, 36117, 231, 73548, 9554, 118487, 53610, 106041, 91495, 82317, 13646, 23951, 59464, 127150, 1811, 720, 20, 13, 107934, 245, 105444, 94, 21082, 118504, 106649, 3922, 38129, 103167, 111155, 85315, 233, 118959, 65455, 56602, 95337, 11883, 53610, 1811, 720, 21, 13, 114524, 43167, 106786, 53610, 3922, 114593, 106143, 108309, 58291, 93994, 66776, 120522, 11883, 113173, 1811, 720, 22, 13, 20033, 115, 109895, 58291, 125405, 46034, 13646, 30356, 111110, 53610, 104432, 65455, 1811, 720, 23, 13, 66827, 237, 83747, 27699, 229, 53610, 105301, 117027, 9554, 21082, 1811, 720, 24, 13, 59330, 121, 88367, 43240, 30590, 30358, 59464, 38129, 118954, 53610, 10110, 124714, 108914, 105060, 33748, 5486, 118959, 102452, 53610, 101837, 121, 34208, 85315, 233, 118959, 9554, 53610, 75376, 720, 605, 13, 127609, 125025, 126369, 60455, 96455, 45736, 9554, 108914, 81802, 245, 33748, 34208, 108914, 105060, 33748, 1811, 128009]
labels:
1. 使用节水装置，如节水淋浴喷头和水龙头。
2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。
3. 在社区中提高节水意识。
4. 检查水管和灌溉系统的漏水情况，并及时修复它们。
5. 洗澡时间缩短，使用低流量淋浴头节约用水。
6. 收集雨水，用于园艺或其他非饮用目的。
7. 刷牙或擦手时关掉水龙头。
8. 减少浇水草坪的时间。
9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。
10. 只购买能源效率高的洗碗机和洗衣机。<|eot_id|>
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/datasets/builder.py:883: FutureWarning: 'try_from_hf_gcs' was deprecated in version 2.16.0 and will be removed in 3.0.0.
  warnings.warn(
/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/datasets/builder.py:883: FutureWarning: 'try_from_hf_gcs' was deprecated in version 2.16.0 and will be removed in 3.0.0.
  warnings.warn(
[INFO|configuration_utils.py:731] 2024-08-04 10:30:25,948 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-04 10:30:25,950 >> Model config LlamaConfig {
  "_name_or_path": "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.3",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3631] 2024-08-04 10:30:25,987 >> loading weights file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:3776] 2024-08-04 10:30:25,989 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:1038] 2024-08-04 10:30:26,011 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

[2024-08-04 10:33:34,680] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 723, num_elems = 70.55B
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [08:37<00:00, 17.26s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [08:38<00:00, 17.27s/it]
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [08:38<00:00, 17.27s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [08:38<00:00, 17.27s/it]
[INFO|modeling_utils.py:4463] 2024-08-04 10:42:12,932 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4471] 2024-08-04 10:42:12,932 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-04 10:42:12,939 >> loading configuration file /public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/models/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-04 10:42:12,940 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": 128001
}

08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/04/2024 10:42:12 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - ZeRO3 / FSDP detected, remaining trainable params in float32.
08/04/2024 10:42:12 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/04/2024 10:42:13 - INFO - llamafactory.model.loader - trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
08/04/2024 10:42:13 - INFO - llamafactory.model.loader - trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
08/04/2024 10:42:13 - INFO - llamafactory.model.loader - trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
08/04/2024 10:42:13 - INFO - llamafactory.model.loader - trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:648] 2024-08-04 10:42:13,263 >> Using auto half precision backend
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 2036, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 411, in deepspeed_init
    optimizer, lr_scheduler = deepspeed_optim_sched(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 336, in deepspeed_optim_sched
    optimizer = trainer.create_optimizer()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 70, in create_optimizer
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 2036, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 411, in deepspeed_init
    optimizer, lr_scheduler = deepspeed_optim_sched(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 336, in deepspeed_optim_sched
    optimizer = trainer.create_optimizer()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 70, in create_optimizer
    return super().create_optimizer()
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1099, in create_optimizer
    return super().create_optimizer()
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1099, in create_optimizer
    self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
    self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/optim/adamw.py", line 28, in __init__
    if not 0.0 <= lr:

  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/optim/adamw.py", line 28, in __init__
TypeError: '<=' not supported between instances of 'float' and 'str'
    if not 0.0 <= lr:
TypeError: '<=' not supported between instances of 'float' and 'str'
[INFO|deepspeed.py:329] 2024-08-04 10:42:13,513 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 2036, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 411, in deepspeed_init
    optimizer, lr_scheduler = deepspeed_optim_sched(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 336, in deepspeed_optim_sched
    optimizer = trainer.create_optimizer()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 70, in create_optimizer
    return super().create_optimizer()
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1099, in create_optimizer
    self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/optim/adamw.py", line 28, in __init__
    if not 0.0 <= lr:
TypeError: '<=' not supported between instances of 'float' and 'str'
Traceback (most recent call last):
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 2036, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 411, in deepspeed_init
    optimizer, lr_scheduler = deepspeed_optim_sched(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 336, in deepspeed_optim_sched
    optimizer = trainer.create_optimizer()
  File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 70, in create_optimizer
    return super().create_optimizer()
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/trainer.py", line 1099, in create_optimizer
    self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/optim/adamw.py", line 28, in __init__
    if not 0.0 <= lr:
TypeError: '<=' not supported between instances of 'float' and 'str'
[2024-08-04 10:42:21,484] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 108226) of binary: /opt/conda/envs/llama_factory_torch/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/llama_factory_torch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-04_10:42:21
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 108227)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-08-04_10:42:21
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 108229)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-08-04_10:42:21
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 108230)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-04_10:42:21
  host      : notebook-1813389960667746306-scnlbe5oi5-12495
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 108226)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

现在yaml配置形式的学习率不支持科学记数法输入 #4014

TypeError: ‘<=’ not supported between instances of ‘float’ and ‘str’ #4149

错误原因：yaml文件对科学记数法支持不佳。

解决方法：

learning_rate: 5e-5
改为
learning_rate: 5.0e-5

总结

快速体验LLaMA-Factory 私有化部署和高效微调Llama3模型（曙光超算互联网平台异构加速卡DCU）

序言

一、参考资料

二、准备环境

1. 系统镜像

2. 软硬件依赖

3. 克隆base环境

4. 安装 LLaMA Factory

5. 解决依赖包冲突

6. 安装 vllm 0.4.3

7. requirements.txt

8. envs.yaml

三、服务器信息

1. CPU

2. 显卡

3. 内存

4. 硬盘

5. 系统信息

四、私有化部署Llama 3模型

1. 环境准备

2. 下载Llama3模型

3. 启动web服务

4. 访问服务

五、微调Llama 3模型

1. 引言

2. 整体流程

3. 下载数据集

4. 创建微调脚本

5. 运行微调脚本

5.1 单卡运行

5.2 多卡运行（单机多卡）

6. 合并模型权重

7. 模型推理

六、FAQ

Q：ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Q： ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True or check your proxy settings to allow access to localhost.

Q：单卡显存不足

Q：多卡显存不足

Q：ValueError: Please launch distributed training with llamafactory-cli or torchrun.

Q：RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback): No module named 'oss2'

Q：if not 0.0 <= lr: TypeError: '<=' not supported between instances of 'float' and 'str'

6. 安装 `vllm 0.4.3`

7. `requirements.txt`

8. `envs.yaml`

Q：`ImportError: libcuda.so.1: cannot open shared object file: No such file or directory`

Q： `ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True or check your proxy settings to allow access to localhost.`

Q：`ValueError: Please launch distributed training with llamafactory-cli or torchrun.`

Q：`RuntimeError: Failed to import modelscope.msdatasets because of the following error (look up to see its traceback): No module named 'oss2'`

Q：`if not 0.0 <= lr: TypeError: '<=' not supported between instances of 'float' and 'str'`