【国产异构加速卡】基于llama.cpp实现Llama3模型的guff格式转换、4bit量化以及推理加速

重要说明：本文从网上资料整理而来，仅记录博主学习相关知识点的过程，侵删。

序言

本文使用llama.cpp框架，对 Llama3-8B-Instruct 模型进行gguf格式转换，8bit量化，并在CPU和GPU上对8bit模型进行推理。

测试平台：超算互联网平台SCNet

GPU：异构加速卡AI 显存64GB PCIE（基于ROCm平台的GPGPU）

测试服务器的详细配置，请参考：超算互联网平台SCNet之国产异构加速卡

一、参考资料

llama.cpp 代码仓库：https://github.com/ggerganov/llama.cpp

Tutorial: How to convert HuggingFace model to GGUF format #2948

llamacpp_zh

使用llama.cpp实现LLM大模型的格式转换、量化、推理、部署

【学习笔记】：Ubuntu 22 使用模型量化工具llama.cpp部署大模型 CPU+GPU

llama3 微调教程之 llama factory 的安装部署与模型微调过程，模型量化和gguf转换。

二、llama.cpp相关介绍

1. `llama.cpp`简介

llama.cpp 是一个C++库，用于在本地或云端高效地执行大型语言模型（LLM）的推理任务。该库是一个纯C/C++实现，不依赖任何外部库，并且针对x86架构提供了AVX、AVX2和AVX512加速支持。此外，它还提供了2、3、4、5、6以及8位量化功能，以加快推理速度并减少内存占用。对于大于总VRAM容量的大规模模型，该库还支持CPU+GPU混合推理模式进行部分加速。

与传统的基于 Python 的实现相比，llama.cpp 通过直接在 C/C++ 环境中运行，减少了对解释器的依赖，从而可能提高性能并降低资源消耗。此外，llama.cpp 支持跨平台，可以在多种操作系统上编译和运行，包括但不限于 macOS、Linux、Windows，以及通过 Docker 容器化部署。

2. `llama.cpp` 优势

选择llama.cpp作为LLM推理的平台，有几个显著优势：

无依赖实现：llama.cpp不依赖Python、PyTorch或TensorFlow等框架，可以直接在C/C++环境中运行，减少了复杂性和潜在的性能瓶颈。跨平台支持：从支持苹果硅片到各种GPU和CPU，llama.cpp优化了多种硬件的性能，确保在不同系统上都能获得最佳性能。灵活的性能配置：用户可以通过设置不同的位深（1.5位至8位）来量化模型，这有助于在保持推理速度的同时减少内存使用。

3. llama.cpp目标

llama.cpp 出现之后，在 GitHub 上狂砍 63.2k star（截止到2024年8月8日），比 stable diffusion 还要夸张，堪称 “star rocket”。这背后是 llama.cpp 切中了 “AI at the edge” 这一方向。“AI at the edge“ 中的 edge 可以理解为与 cloud 相对的概念。不管是个人的 laptop，gaming PC，手机，甚至树莓派，都可以称为 edge。

4. GGUT格式

探索GGUF：利用llama.cpp高效运行大型语言模型

4.1 引言

在人工智能领域，大型语言模型的发展日新月异，它们在自然语言处理、机器翻译、智能助手等多个领域展现出了前所未有的能力。然而，随着模型规模的不断扩大，这些庞大的神经网络模型在存储、传输和加载上面临着一系列挑战。传统的文件格式在处理这些庞大的数据集时显得力不从心，不仅效率低下，而且兼容性和扩展性也难以满足日益增长的需求。

在这样的背景下，开发者Georgi Gerganov提出GGUF格式，该模型格式可以对模型进行高效的压缩，减少模型的大小与内存占用，从而提升模型的推理速度和效率。

4.2 GGUT简介

GGUF（Georgi Gerganov’s Universal Format），即 Georgi Gerganov 通用格式，是 llama.cpp 项目中提出的一种创新模型文件格式。GGUF格式是专为大型语言模型设计的二进制文件格式，旨在解决当前大模型在实际应用中遇到的存储效率、加载速度、兼容性和扩展性等问题。GGUF通过优化数据结构和编码方式，显著提升了模型文件的存储效率，同时保证了快速的加载性能。此外，它的设计考虑了跨平台和跨框架的兼容性，使得模型能够无缝地在不同的硬件和软件环境中运行，极大地促进了大型模型的广泛应用和进一步发展。当前，GGUF格式广泛应用于各类大模型的部署和分享，特别是在Hugging Face等开源社区中广受欢迎。

关于 GGUF 的更多信息可以参考：2398#issuecomment-1682837610。

4.3 GGUT优势

GGUF格式模型在实际使用中体现出的主要特点和优势包括：

高效存储：GGUF格式优化了数据的存储方式，减少了存储空间的占用，这对于大型模型尤为重要。快速加载：GGUF格式支持快速加载模型数据，这对于需要即时响应的应用场景非常有用，比如在线聊天机器人或实时翻译系统。高效推理：GGUF 格式对模型数据进行了优化，以实现更快的加载时间和推理速度，这对于需要快速响应的应用场景至关重要。内存优化：通过精心设计的数据结构和存储方案，GGUF 减少了模型在运行时的内存占用，使得在资源受限的设备上部署大型语言模型成为可能。复杂令牌化支持：GGUF 支持复杂的令牌化过程，包括对特殊令牌的识别和处理，这使得模型能够更准确地理解和生成语言文本。跨平台兼容性：作为一种统一的格式，GGUF 格式的模型文件可以在多种硬件和操作系统上使用，确保了模型的广泛适用性。灵活性和扩展性：GGUF 格式设计考虑了未来的扩展，可以适应不同语言模型的需求，包括自定义词汇和特殊操作。量化支持：GGUF 支持多种量化技术，允许模型在不同精度级别上运行，从而在性能和模型大小之间取得平衡。

通过这些创新，GGUF 格式成为了 llama.cpp 高效运行大型语言模型的关键因素，为开发者提供了一个强大的工具，以在各种环境中部署和使用先进的自然语言处理能力。

5. GGML

ggml.ai 官网：http://ggml.ai/

ggml 代码仓库：https://github.com/ggerganov/ggml

llama.cpp 代码仓库：https://github.com/ggerganov/llama.cpp

whisper.cpp 代码仓库：https://github.com/ggerganov/whisper.cpp

解开封印！加倍 LLM 推理吞吐: ggml.ai 与 llama.cpp

5.1 ggml简介

5.2 ggml目标

6. llama-cpp-python

llama-cpp-python 代码仓库：https://github.com/abetlen/llama-cpp-python

llama-cpp-python 文档：https://llama-cpp-python.readthedocs.io/en/latest/

Installing llama-cpp-python with GPU Support

llama-cpp-python 是 llama-cpp的python高级API。

三、快速体验llama.cpp

经过测试，tag=b3045 亲测有效。

llama.cpp/tree/b3045 代码仓库：https://github.com/ggerganov/llama.cpp/tree/b3045

1. 准备环境

测试环境，仅供参考。

`requirements.txt`

accelerate==0.32.1
addict==2.4.0
aiohttp==3.9.5
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
apex @ https://cancon.hpccube.com:65024/directlink/4/apex/DAS1.0/apex-1.1.0+das1.0+0dd7f68.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=fdeb7c8a0b354a6a2faa61ae2055b2c2e7deb07bfa4aa7811068c5e02455ee1e
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.15.0
beautifulsoup4==4.12.3
bitsandbytes @ https://cancon.hpccube.com:65024/directlink/4/bitsandbytes/DAS1.0/bitsandbytes-0.37.0+das1.0+gitd3d888f.abi0.dtk2404.torch2.1-py3-none-any.whl#sha256=c46eb3f1555f2153424c3c0297e6645c0881cb76965cf5f3d11f77b52d80c19c
bleach==6.1.0
boltons @ file:///croot/boltons_1677628692245/work
brotlipy==0.7.0
certifi @ file:///croot/certifi_1707229174982/work/certifi
cffi @ file:///tmp/abs_98z5h56wf8/croots/recipe/cffi_1659598650955/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.1.7
coloredlogs==15.0.1
comm==0.2.2
conda-content-trust @ file:///tmp/abs_5952f1c8-355c-4855-ad2e-538535021ba5h26t22e5/croots/recipe/conda-content-trust_1658126371814/work
conda-package-handling @ file:///croot/conda-package-handling_1666940373510/work
contourpy==1.2.1
cryptography @ file:///croot/cryptography_1665612644927/work
cycler==0.12.1
datasets==2.19.2
debugpy==1.8.1
decorator==5.1.1
deepspeed @ https://cancon.hpccube.com:65024/directlink/4/deepspeed/DAS1.0/deepspeed-0.12.3+das1.0+gita724046.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=726d64f73ab2ed7bcd716dcb2af53bb3c790ab4a24180b1b9319e7a7ab2cc569
defusedxml==0.7.1
diffusers==0.29.2
dill==0.3.8
dnspython==2.6.1
einops==0.8.0
email_validator==2.1.1
exceptiongroup==1.2.1
executing==2.0.1
fastapi==0.111.0
fastapi-cli==0.0.4
fastjsonschema==2.19.1
filelock==3.14.0
fire==0.6.0
flash-attn @ https://cancon.hpccube.com:65024/directlink/4/flash_attn/DAS1.0/flash_attn-2.0.4+das1.0+82379d7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=2facc1831d95b55bf1bca88c7f23163751f4c749e4f7fc9256d8311ddbb5d399
flatbuffers==24.3.25
fonttools==4.52.4
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.3.1
h11==0.14.0
hf_transfer==0.1.8
hjson==3.1.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface==0.0.1
huggingface-hub==0.24.5
humanfriendly==10.0
hypothesis==5.35.1
idna @ file:///croot/idna_1666125576474/work
importlib_metadata==7.1.0
invisible-watermark==0.2.0
ipykernel==6.29.4
ipython==8.24.0
ipywidgets==8.1.3
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
json5==0.9.25
jsonpatch @ file:///croot/jsonpatch_1714483231291/work
jsonpointer==2.1
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_ext_dataset==0.1.0
jupyter_ext_logo==0.1.0
jupyter_server==2.14.0
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab-language-pack-zh-CN==4.0.post6
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.11
kiwisolver==1.4.5
lightop @ https://cancon.hpccube.com:65024/directlink/4/lightop/DAS1.0/lightop-0.3+das1.0+837dbb7.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=7f4eb1190a570c05a63a4aade326c87367c4e5ccf6ff82ad5e92220790817e5c
lmdeploy @ https://cancon.hpccube.com:65024/directlink/4/lmdeploy/DAS1.0/lmdeploy-0.1.0_das1.0+git782048c.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=499940e022de16b3f1211a52c2daa3a603b109a015487499c9e11a53c6d5ad2c
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mmcv @ https://cancon.hpccube.com:65024/directlink/4/mmcv/DAS1.0/mmcv-2.0.1_das1.0+gitc0ccf15.abi0.dtk2404.torch2.1.-cp310-cp310-manylinux2014_x86_64.whl#sha256=4fc5ff39d232e5ca1efebf7cfdfcf9bc0675308cf40e5f17237c4f2eec66f210
mmengine==0.10.4
mmengine-lite==0.10.4
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1.1
notebook_shim==0.2.4
numpy==1.24.3
onnxruntime @ https://cancon.hpccube.com:65024/directlink/4/onnxruntime/DAS1.0/onnxruntime-1.15.0+das1.0+gita9ca438.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=509446b41adb89e7507700482cb99e2c399ab3164bc9ea6d9a50e11f84a2406e
opencv-python==4.9.0.80
orjson==3.10.3
overrides==7.7.0
packaging @ file:///croot/packaging_1710807400464/work
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pillow==10.3.0
platformdirs==4.2.2
pluggy @ file:///tmp/build/80754af9/pluggy_1648024709248/work
prometheus_client==0.20.0
prompt_toolkit==3.0.45
protobuf==5.27.0
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==16.1.0
pyarrow-hotfix==0.6
pycosat @ file:///croot/pycosat_1666805502580/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.7.2
pydantic_core==2.18.3
Pygments==2.18.0
pynvml==11.5.0
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing==3.1.2
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
PyWavelets==1.6.0
PyYAML==6.0.1
pyzmq==26.0.3
ray==2.9.3
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.1
rpds-py==0.18.1
ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work
ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work
safetensors==0.4.3
Send2Trash==1.8.3
sentencepiece==0.2.0
shellingham==1.5.4
six @ file:///tmp/build/80754af9/six_1644875935023/work
sniffio==1.3.1
sortedcontainers==2.4.0
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12.1
termcolor==2.4.0
terminado==0.18.1
tiktoken==0.7.0
tinycss2==1.3.0
tokenizers==0.15.0
tomli==2.0.1
toolz @ file:///croot/toolz_1667464077321/work
torch @ https://cancon.hpccube.com:65024/directlink/4/pytorch/DAS1.0/torch-2.1.0+das1.0+git00661e0.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0b5f4be74ffdd6fe7540a844bf4f02e432b7d267b5e9fdd7f9448192d93bf3b6
torchaudio @ https://cancon.hpccube.com:65024/directlink/4/torchaudio/DAS1.0/torchaudio-2.1.2+das1.0+253903e.abi0.dtk2404.torch2.1.0-cp310-cp310-manylinux2014_x86_64.whl#sha256=2a7b3bbe8b558f48784f302900fd1dff3ff9d10a3c139e00f2b136a76d6d7f1c
torchvision @ https://cancon.hpccube.com:65024/directlink/4/vision/DAS1.0/torchvision-0.16.0+das1.0+gitc9e7141.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=4d5e5071e89892cccb24c3ee0216cd79b3c22bc5cf1eb0eb49c2792d9f49fb62
tornado==6.4
tqdm @ file:///opt/conda/conda-bld/tqdm_1664392687731/work
traitlets==5.14.3
transformers==4.38.0
triton @ https://cancon.hpccube.com:65024/directlink/4/triton/DAS1.0/triton-2.1.0+das1.0+git3841f975.abi0.dtk2404-cp310-cp310-manylinux2014_x86_64.whl#sha256=0dda810eb171af0b3f5cf90a1a4b2f41c9ef0ef08453762a798c86dd01fe976f
typer==0.12.3
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.0
tzdata==2024.1
ujson==5.10.0
uri-template==1.3.0
urllib3 @ file:///croot/urllib3_1670526988650/work
uvicorn==0.30.0
uvloop==0.19.0
vllm @ https://cancon.hpccube.com:65024/directlink/4/vllm/DAS1.0/vllm-0.3.3+das1.0+git3380931.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=23bcdb8a6eb0382770dc7460ea3f7c85cd0c885913b28759eb8a9894731cdb87
watchfiles==0.22.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
websockets==12.0
widgetsnbextension==4.0.11
xformers @ https://cancon.hpccube.com:65024/directlink/4/xformers/DAS1.0/xformers-0.0.25+das1.0+gitd11e899.abi0.dtk2404.torch2.1-cp310-cp310-manylinux2014_x86_64.whl#sha256=b086d1bd50bd19c82ca44c424fe193dfcdd48bdd6695d3e6a58f53764c64f428
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.19.0

`envs.yaml`

name: llama.cpp
channels:
  - https://repo.anaconda.com/pkgs/main
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - boltons=23.0.0=py310h06a4308_0
  - brotlipy=0.7.0=py310h7f8727e_1002
  - bzip2=1.0.8=h7b6447c_0
  - ca-certificates=2024.3.11=h06a4308_0
  - certifi=2024.2.2=py310h06a4308_0
  - cffi=1.15.1=py310h74dc2b5_0
  - charset-normalizer=2.0.4=pyhd3eb1b0_0
  - conda-content-trust=0.1.3=py310h06a4308_0
  - conda-package-handling=1.9.0=py310h5eee18b_1
  - cryptography=38.0.1=py310h9ce1e76_0
  - idna=3.4=py310h06a4308_0
  - jsonpatch=1.33=py310h06a4308_1
  - ld_impl_linux-64=2.38=h1181459_1
  - libffi=3.3=he6710b0_2
  - libgcc-ng=11.2.0=h1234567_1
  - libgomp=11.2.0=h1234567_1
  - libstdcxx-ng=11.2.0=h1234567_1
  - libuuid=1.41.5=h5eee18b_0
  - ncurses=6.3=h5eee18b_3
  - openssl=1.1.1w=h7f8727e_0
  - pluggy=1.0.0=py310h06a4308_1
  - pycosat=0.6.4=py310h5eee18b_0
  - pycparser=2.21=pyhd3eb1b0_0
  - pyopenssl=22.0.0=pyhd3eb1b0_0
  - pysocks=1.7.1=py310h06a4308_0
  - python=3.10.8=haa1d7c7_0
  - readline=8.2=h5eee18b_0
  - ruamel.yaml=0.17.21=py310h5eee18b_0
  - ruamel.yaml.clib=0.2.6=py310h5eee18b_1
  - setuptools=65.5.0=py310h06a4308_0
  - six=1.16.0=pyhd3eb1b0_1
  - sqlite=3.40.0=h5082296_0
  - tk=8.6.12=h1ccaba5_0
  - toolz=0.12.0=py310h06a4308_0
  - tqdm=4.64.1=py310h06a4308_0
  - urllib3=1.26.13=py310h06a4308_0
  - wheel=0.37.1=pyhd3eb1b0_0
  - xz=5.2.8=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
      - accelerate==0.32.1
      - addict==2.4.0
      - aiohttp==3.9.5
      - aiosignal==1.3.1
      - annotated-types==0.7.0
      - anyio==4.4.0
      - apex==1.1.0+0dd7f68.abi0.dtk2404.torch2.1
      - argon2-cffi==23.1.0
      - argon2-cffi-bindings==21.2.0
      - arrow==1.3.0
      - asttokens==2.4.1
      - async-lru==2.0.4
      - async-timeout==4.0.3
      - attrs==23.2.0
      - babel==2.15.0
      - beautifulsoup4==4.12.3
      - bitsandbytes==0.37.0+gitd3d888f.abi0.dtk2404.torch2.1
      - bleach==6.1.0
      - click==8.1.7
      - coloredlogs==15.0.1
      - comm==0.2.2
      - contourpy==1.2.1
      - cycler==0.12.1
      - datasets==2.19.2
      - debugpy==1.8.1
      - decorator==5.1.1
      - deepspeed==0.12.3+gita724046.abi0.dtk2404.torch2.1.0
      - defusedxml==0.7.1
      - diffusers==0.29.2
      - dill==0.3.8
      - dnspython==2.6.1
      - einops==0.8.0
      - email-validator==2.1.1
      - exceptiongroup==1.2.1
      - executing==2.0.1
      - fastapi==0.111.0
      - fastapi-cli==0.0.4
      - fastjsonschema==2.19.1
      - filelock==3.14.0
      - fire==0.6.0
      - flash-attn==2.0.4+82379d7.abi0.dtk2404.torch2.1
      - flatbuffers==24.3.25
      - fonttools==4.52.4
      - fqdn==1.5.1
      - frozenlist==1.4.1
      - fsspec==2024.3.1
      - h11==0.14.0
      - hf-transfer==0.1.8
      - hjson==3.1.0
      - httpcore==1.0.5
      - httptools==0.6.1
      - httpx==0.27.0
      - huggingface==0.0.1
      - huggingface-hub==0.24.5
      - humanfriendly==10.0
      - hypothesis==5.35.1
      - importlib-metadata==7.1.0
      - invisible-watermark==0.2.0
      - ipykernel==6.29.4
      - ipython==8.24.0
      - ipywidgets==8.1.3
      - isoduration==20.11.0
      - jedi==0.19.1
      - jinja2==3.1.4
      - json5==0.9.25
      - jsonpointer==2.4
      - jsonschema==4.22.0
      - jsonschema-specifications==2023.12.1
      - jupyter-client==8.6.2
      - jupyter-core==5.7.2
      - jupyter-events==0.10.0
      - jupyter-ext-dataset==0.1.0
      - jupyter-ext-logo==0.1.0
      - jupyter-lsp==2.2.5
      - jupyter-server==2.14.0
      - jupyter-server-terminals==0.5.3
      - jupyterlab==4.2.1
      - jupyterlab-language-pack-zh-cn==4.0.post6
      - jupyterlab-pygments==0.3.0
      - jupyterlab-server==2.27.2
      - jupyterlab-widgets==3.0.11
      - kiwisolver==1.4.5
      - lightop==0.3+837dbb7.abi0.dtk2404.torch2.1
      - lmdeploy==0.1.0-git782048c.abi0.dtk2404.torch2.1.
      - markdown-it-py==3.0.0
      - markupsafe==2.1.5
      - matplotlib==3.9.0
      - matplotlib-inline==0.1.7
      - mdurl==0.1.2
      - mistune==3.0.2
      - mmcv==2.0.1-gitc0ccf15.abi0.dtk2404.torch2.1.
      - mmengine==0.10.4
      - mmengine-lite==0.10.4
      - mpmath==1.3.0
      - msgpack==1.0.8
      - multidict==6.0.5
      - multiprocess==0.70.16
      - nbclient==0.10.0
      - nbconvert==7.16.4
      - nbformat==5.10.4
      - nest-asyncio==1.6.0
      - networkx==3.3
      - ninja==1.11.1.1
      - notebook-shim==0.2.4
      - numpy==1.24.3
      - onnxruntime==1.15.0+gita9ca438.abi0.dtk2404
      - opencv-python==4.9.0.80
      - orjson==3.10.3
      - overrides==7.7.0
      - packaging==24.0
      - pandas==2.2.2
      - pandocfilters==1.5.1
      - parso==0.8.4
      - pexpect==4.9.0
      - pillow==10.3.0
      - pip==24.0
      - platformdirs==4.2.2
      - prometheus-client==0.20.0
      - prompt-toolkit==3.0.45
      - protobuf==5.27.0
      - psutil==5.9.8
      - ptyprocess==0.7.0
      - pure-eval==0.2.2
      - py-cpuinfo==9.0.0
      - pyarrow==16.1.0
      - pyarrow-hotfix==0.6
      - pydantic==2.7.2
      - pydantic-core==2.18.3
      - pygments==2.18.0
      - pynvml==11.5.0
      - pyparsing==3.1.2
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.0.1
      - python-json-logger==2.0.7
      - python-multipart==0.0.9
      - pytz==2024.1
      - pywavelets==1.6.0
      - pyyaml==6.0.1
      - pyzmq==26.0.3
      - ray==2.9.3
      - referencing==0.35.1
      - regex==2024.5.15
      - requests==2.32.3
      - rfc3339-validator==0.1.4
      - rfc3986-validator==0.1.1
      - rich==13.7.1
      - rpds-py==0.18.1
      - safetensors==0.4.3
      - send2trash==1.8.3
      - sentencepiece==0.2.0
      - shellingham==1.5.4
      - sniffio==1.3.1
      - sortedcontainers==2.4.0
      - soupsieve==2.5
      - stack-data==0.6.3
      - starlette==0.37.2
      - sympy==1.12.1
      - termcolor==2.4.0
      - terminado==0.18.1
      - tiktoken==0.7.0
      - tinycss2==1.3.0
      - tokenizers==0.15.0
      - tomli==2.0.1
      - torch==2.1.0+git00661e0.abi0.dtk2404
      - torchaudio==2.1.2+253903e.abi0.dtk2404.torch2.1.0
      - torchvision==0.16.0+gitc9e7141.abi0.dtk2404.torch2.1
      - tornado==6.4
      - traitlets==5.14.3
      - transformers==4.38.0
      - triton==2.1.0+git3841f975.abi0.dtk2404
      - typer==0.12.3
      - types-python-dateutil==2.9.0.20240316
      - typing-extensions==4.12.0
      - tzdata==2024.1
      - ujson==5.10.0
      - uri-template==1.3.0
      - uvicorn==0.30.0
      - uvloop==0.19.0
      - vllm==0.3.3+git3380931.abi0.dtk2404.torch2.1
      - watchfiles==0.22.0
      - wcwidth==0.2.13
      - webcolors==1.13
      - webencodings==0.5.1
      - websocket-client==1.8.0
      - websockets==12.0
      - widgetsnbextension==4.0.11
      - xformers==0.0.25+gitd11e899.abi0.dtk2404.torch2.1
      - xxhash==3.4.1
      - yapf==0.40.2
      - yarl==1.9.4
      - zipp==3.19.0
prefix: /opt/conda/envs/llama.cpp

2. 下载llama.cpp

# 下载llama.cpp
# 如果下载失败，可以手动下载，再上传到服务器
git clone https://github.com/ggerganov/llama.cpp.git 

# 检出b3045标签，并创建b3045分支
git checkout -b b3045 b3045

cd llama.cpp

编译前的文件目录：

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# tree -L 1
.
|-- AUTHORS
|-- CMakeLists.txt
|-- CMakePresets.json
|-- LICENSE
|-- Makefile
|-- Package.swift
|-- README-sycl.md
|-- README.md
|-- SECURITY.md
|-- ci
|-- cmake
|-- codecov.yml
|-- common
|-- convert-hf-to-gguf-update.py
|-- convert-hf-to-gguf.py
|-- convert-llama-ggml-to-gguf.py
|-- convert.py
|-- docs
|-- examples
|-- flake.lock
|-- flake.nix
|-- ggml-alloc.c
|-- ggml-alloc.h
|-- ggml-backend-impl.h
|-- ggml-backend.c
|-- ggml-backend.h
|-- ggml-common.h
|-- ggml-cuda
|-- ggml-cuda.cu
|-- ggml-cuda.h
|-- ggml-impl.h
|-- ggml-kompute.cpp
|-- ggml-kompute.h
|-- ggml-metal.h
|-- ggml-metal.m
|-- ggml-metal.metal
|-- ggml-opencl.cpp
|-- ggml-opencl.h
|-- ggml-quants.c
|-- ggml-quants.h
|-- ggml-rpc.cpp
|-- ggml-rpc.h
|-- ggml-sycl.cpp
|-- ggml-sycl.h
|-- ggml-vulkan-shaders.hpp
|-- ggml-vulkan.cpp
|-- ggml-vulkan.h
|-- ggml.c
|-- ggml.h
|-- ggml_vk_generate_shaders.py
|-- gguf-py
|-- grammars
|-- kompute
|-- kompute-shaders
|-- llama.cpp
|-- llama.h
|-- media
|-- models
|-- mypy.ini
|-- pocs
|-- prompts
|-- pyrightconfig.json
|-- requirements
|-- requirements.txt
|-- scripts
|-- sgemm.cpp
|-- sgemm.h
|-- spm-headers
|-- tests
|-- unicode-data.cpp
|-- unicode-data.h
|-- unicode.cpp
`-- unicode.h

3. 编译llama.cpp

Build llama.cpp locally

3.1 编译CPU版本

# 非首次编译
make clean

make -j32

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# make -j32
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE
I NVCCFLAGS: -std=c++11 -O3
I LDFLAGS:
I CC:        cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I CXX:       c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml.c -o ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c llama.cpp -o llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/common.cpp -o common.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/sampling.cpp -o sampling.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/grammar-parser.cpp -o grammar-parser.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/json-schema-to-grammar.cpp -o json-schema-to-grammar.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/console.cpp -o console.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c sgemm.cpp -o sgemm.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c unicode.cpp -o unicode.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c unicode-data.cpp -o unicode-data.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/train.cpp -o train.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/ngram-cache.cpp -o ngram-cache.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c common/build-info.cpp -o build-info.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  build-info.o ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/main/main.cpp -o examples/main/main.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/simple/simple.cpp -o examples/simple/simple.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/batched/batched.cpp -o examples/batched/batched.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/server/server.cpp -o examples/server/server.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/infill/infill.cpp -o examples/infill/infill.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llava/clip.cpp  -o examples/llava/clip.o -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create

====  Run ./main -h for help.  ====

c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  build-info.o ggml.o llama.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -c examples/llava/llava.cpp -o examples/llava/llava.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server

编译后的文件目录：

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# tree -L 1
.
|-- AUTHORS
|-- CMakeLists.txt
|-- CMakePresets.json
|-- LICENSE
|-- Makefile
|-- Package.swift
|-- README-sycl.md
|-- README.md
|-- SECURITY.md
|-- baby-llama
|-- batched
|-- batched-bench
|-- beam-search
|-- benchmark-matmult
|-- build-info.o
|-- ci
|-- cmake
|-- codecov.yml
|-- common
|-- common.o
|-- console.o
|-- convert-hf-to-gguf-update.py
|-- convert-hf-to-gguf.py
|-- convert-llama-ggml-to-gguf.py
|-- convert-llama2c-to-ggml
|-- convert.py
|-- docs
|-- embedding
|-- eval-callback
|-- examples
|-- export-lora
|-- finetune
|-- flake.lock
|-- flake.nix
|-- ggml-alloc.c
|-- ggml-alloc.h
|-- ggml-alloc.o
|-- ggml-backend-impl.h
|-- ggml-backend.c
|-- ggml-backend.h
|-- ggml-backend.o
|-- ggml-common.h
|-- ggml-cuda
|-- ggml-cuda.cu
|-- ggml-cuda.h
|-- ggml-impl.h
|-- ggml-kompute.cpp
|-- ggml-kompute.h
|-- ggml-metal.h
|-- ggml-metal.m
|-- ggml-metal.metal
|-- ggml-opencl.cpp
|-- ggml-opencl.h
|-- ggml-quants.c
|-- ggml-quants.h
|-- ggml-quants.o
|-- ggml-rpc.cpp
|-- ggml-rpc.h
|-- ggml-sycl.cpp
|-- ggml-sycl.h
|-- ggml-vulkan-shaders.hpp
|-- ggml-vulkan.cpp
|-- ggml-vulkan.h
|-- ggml.c
|-- ggml.h
|-- ggml.o
|-- ggml_vk_generate_shaders.py
|-- gguf
|-- gguf-py
|-- gguf-split
|-- grammar-parser.o
|-- grammars
|-- gritlm
|-- imatrix
|-- infill
|-- json-schema-to-grammar.o
|-- kompute
|-- kompute-shaders
|-- libllava.a
|-- llama-bench
|-- llama.cpp
|-- llama.h
|-- llama.o
|-- llava-cli
|-- lookahead
|-- lookup
|-- lookup-create
|-- lookup-merge
|-- lookup-stats
|-- main
|-- media
|-- models
|-- mypy.ini
|-- ngram-cache.o
|-- parallel
|-- passkey
|-- perplexity
|-- pocs
|-- prompts
|-- pyrightconfig.json
|-- q8dot
|-- quantize
|-- quantize-stats
|-- requirements
|-- requirements.txt
|-- retrieval
|-- sampling.o
|-- save-load-state
|-- scripts
|-- server
|-- sgemm.cpp
|-- sgemm.h
|-- sgemm.o
|-- simple
|-- speculative
|-- spm-headers
|-- tests
|-- tokenize
|-- train-text-from-scratch
|-- train.o
|-- unicode-data.cpp
|-- unicode-data.h
|-- unicode-data.o
|-- unicode.cpp
|-- unicode.h
|-- unicode.o
`-- vdot

解释说明

main，用于推理模型。 quantize，用于量化模型。 server，用于提供模型API服务。

3.2 编译GPU版本（hipBLAS）

speedup ROCm AMD Unified Memory Architecture #7399

Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

HIP_VISIBLE_DEVICES

User Guide for AMDGPU Backend

用 llama.cpp 跑 llama 2，用 AMD Radeon RX 6900 做 GPU 加速

note：
国产异构加速卡是基于ROCm平台的GPGPU，编译步骤可参考 hipBLAS。

# 查看GPU架构
rocminfo | grep gfx
或者
rocminfo | grep gfx | head -1 | awk '{print $2}'

# 编译
make -j32 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx928

(ollama) root@notebook-1823641624653922306-scnlbe5oi5-42808:~# rocminfo | grep gfx
  Name:                    gfx928
      Name:                    amdgcn-amd-amdhsa--gfx928:sramecc+:xnack-

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# make -j32 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx928
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA
I NVCCFLAGS: -std=c++11 -O3
I LDFLAGS:   -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
I CC:        cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
I CXX:       c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
...
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/main/main.cpp -o examples/main/main.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/simple/simple.cpp -o examples/simple/simple.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/batched/batched.cpp -o examples/batched/batched.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/server/server.cpp -o examples/server/server.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/infill/infill.cpp -o examples/infill/infill.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  build-info.o ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llava/clip.cpp  -o examples/llava/clip.o -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas

====  Run ./main -h for help.  ====

c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  build-info.o ggml.o llama.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  -c examples/llava/llava.cpp -o examples/llava/llava.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE -DGGML_USE_HIPBLAS -DGGML_USE_CUDA -DGGML_HIP_UMA  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-cuda.o ggml-cuda/fattn-tile-f16.o ggml-cuda/diagmask.o ggml-cuda/getrows.o ggml-cuda/mmvq.o ggml-cuda/quantize.o ggml-cuda/fattn-tile-f32.o ggml-cuda/upscale.o ggml-cuda/acc.o ggml-cuda/fattn.o ggml-cuda/concat.o ggml-cuda/fattn-vec-f16.o ggml-cuda/fattn-vec-f32.o ggml-cuda/softmax.o ggml-cuda/argsort.o ggml-cuda/convert.o ggml-cuda/pad.o ggml-cuda/rope.o ggml-cuda/dmmv.o ggml-cuda/cpy.o ggml-cuda/unary.o ggml-cuda/scale.o ggml-cuda/mmq.o ggml-cuda/pool2d.o ggml-cuda/tsembd.o ggml-cuda/sumrows.o ggml-cuda/arange.o ggml-cuda/binbcast.o ggml-cuda/clamp.o ggml-cuda/im2col.o ggml-cuda/norm.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server -L/opt/dtk/lib -Wl,-rpath=/opt/dtk/lib -lhipblas -lamdhip64 -lrocblas

4. 准备模型

在 huggingface 上找到合适格式的模型，下载至 llama.cpp 的 models 目录下。或本地已下载的模型上传至models目录。

4.1 下载原版LLaMA模型

如果下载的是Meta原版LLaMA模型，则需要将原版LLaMA模型转换为HF格式。

使用transformers提供的脚本convert_llama_weights_to_hf.py，将原版LLaMA模型转换为HuggingFace格式。

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir path_to_original_llama_root_dir \
    --model_size 7B \
    --output_dir path_to_original_llama_hf_dir

值得注意的是，将原版LLaMA的tokenizer.model放在--input_dir指定的目录，其余文件放在${input_dir}/${model_size}下。执行以下命令后，--output_dir中将存放转换好的HF版权重。

4.2 下载gguf模型

可以直接下载gguf模型，跳过gguf格式转换过程。

QuantFactory/Meta-Llama-3-8B-Instruct-GGUF

./main -m $(./scripts/hf.sh --repo QuantFactory/Meta-Llama-3-8B-Instruct-GGUF --file Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models) 

./main -m $(./scripts/hf.sh --url https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models)

./main -m $(./scripts/hf.sh https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_0.gguf --outdir ./models)

4.3 下载HuggingFace模型

以LLM-Research/Meta-Llama-3-8B-Instruct 模型为例。由于从Hugging Face申请许可失败，从ModelScope魔塔社区中下载该模型。

模型下载方法，请参考：Hugging Face和ModelScope大模型/数据集的下载加速方法

5.（可选）合并LoRA权重

由于原版的LLaMA模型不具备中文理解能力，而Chinese-LLaMA-Alpaca具有良好的中文理解能力。因此，通过对原版LLaMA模型（HF格式）扩充中文词表，并与LoRA权重进行合并，生成全量模型权重。

合并LoRA权重的详细步骤，请参考：llama.cpp一种在本地CPU上部署的量化模型（超低配推理llama）

6. 转换gguf格式

Converting HuggingFace Models to GGUF/GGML

The convert-hf-to-gguf-update.py seems doesn’t work. #7088

Tutorial: How to convert HuggingFace model to GGUF format #2948

llama.cpp 支持转换的模型格式有PyTorch 的 .pth，huggingface的 .safetensors，还有之前 llamma.cpp 采用的 ggmlv3。

6.1 convert脚本

convert脚本包括：

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ll | grep convert
-rwxr-xr-x  1 root root   13029 Aug  8 10:57 convert-hf-to-gguf-update.py*
-rwxr-xr-x  1 root root  127129 Aug  8 10:57 convert-hf-to-gguf.py*
-rwxr-xr-x  1 root root   18993 Aug  8 10:57 convert-llama-ggml-to-gguf.py*
-rwxr-xr-x  1 root root 2218136 Aug  8 11:02 convert-llama2c-to-ggml*
-rwxr-xr-x  1 root root   69417 Aug  8 10:57 convert.py*

解释说明

convert_hf_to_gguf_update.py: Downloads the tokenizer models of the specified models from Huggingface and generates the get_vocab_base_pre() function for convert_hf_to_gguf.py. convert-hf-to-gguf.py: Convert from HuggingFace format to gguf. convert-llama-ggml-to-gguf.py: Convert from ggml format to gguf. convert-llama2c-to-ggml: Convert from llama2.c model format to ggml. convert.py.

6.2 convert.py

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# python convert.py -h
usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--no-vocab] [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR]
                  [--vocab-type VOCAB_TYPE] [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--big-endian]
                  [--pad-vocab] [--skip-unknown] [--verbose] [--metadata METADATA] [--get-outfile]
                  model

Convert a LLaMA model to a GGML compatible file

positional arguments:
  model                 directory containing model file, or model file itself (*.pth, *.pt, *.bin)

options:
  -h, --help            show this help message and exit
  --dump                don't convert, just show what's in the model
  --dump-single         don't convert, just show what's in a single model file
  --vocab-only          extract only the vocab
  --no-vocab            store model without the vocab
  --outtype {f32,f16,q8_0}
                        output format - note: q8_0 may be very slow (default: f16 or f32 based on input)
  --vocab-dir VOCAB_DIR
                        directory containing tokenizer.model, if separate from model file
  --vocab-type VOCAB_TYPE
                        vocab types to try in order, choose from 'spm', 'bpe', 'hfft' (default: spm,hfft)
  --outfile OUTFILE     path to write to; default: based on input
  --ctx CTX             model training context (default: based on input)
  --concurrency CONCURRENCY
                        concurrency used for conversion (default: 8)
  --big-endian          model is executed on big endian machine
  --pad-vocab           add pad tokens when model vocab expects more than tokenizer metadata provides
  --skip-unknown        skip unknown tensor names instead of failing
  --verbose             increase output verbosity
  --metadata METADATA   Specify the path for a metadata file
  --get-outfile         get calculated default outfile name

解释说明

--outtype，包括：{f32,f16,q8_0}。 --vocab-type，包括：{'spm', 'bpe', 'hfft'}。

6.3 执行转换

将Hugging Face下载的模型转换为gguf格式，输出类型为FP16。

Llama-3相比其前两代显著扩充了词表大小，由32K扩充至128K，并且改为BPE词表。因此需要使用--vocab-type参数指定分词算法，默认值是spm，如果是bpe，需要显示指定。

注意：官方文档说 convert.py 不支持LLaMA 3，喊使用 convert-hf-to-gguf.py，但它不支持 --vocab-type，且出现异常：error: unrecognized arguments: --vocab-type bpe，因此使用 convert.py 不会出错。

python convert.py models/Meta-Llama-3-8B-Instruct/ --outfile models/ggml-vocab-llama3-8B-instruct-f16.gguf --outtype f16 --vocab-type bpe

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# python convert.py models/Meta-Llama-3-8B-Instruct/ --outfile models/ggml-vocab-llama3-8B-instruct-f16.gguf --outtype f16 --vocab-type bpe
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file models/Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('models/Meta-Llama-3-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('models/Meta-Llama-3-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128009}, add special tokens unset>
INFO:convert:Writing models/ggml-vocab-llama3-8B-instruct-f16.gguf, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   4
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   5
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   5
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   5
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   6
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   6
INFO:convert:[ 12/291] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   6
INFO:convert:[ 13/291] Writing tensor blk.1.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   7
INFO:convert:[ 14/291] Writing tensor blk.1.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   7
INFO:convert:[ 15/291] Writing tensor blk.1.ffn_norm.weight                  | size   4096           | type F32  | T+   7
INFO:convert:[ 16/291] Writing tensor blk.1.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   7
INFO:convert:[ 17/291] Writing tensor blk.1.attn_output.weight               | size   4096 x   4096  | type F16  | T+   7
INFO:convert:[ 18/291] Writing tensor blk.1.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   7
INFO:convert:[ 19/291] Writing tensor blk.1.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   7
INFO:convert:[ 20/291] Writing tensor blk.2.attn_norm.weight                 | size   4096           | type F32  | T+   7
INFO:convert:[ 21/291] Writing tensor blk.2.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   8
INFO:convert:[ 22/291] Writing tensor blk.2.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   8
INFO:convert:[ 23/291] Writing tensor blk.2.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   8
INFO:convert:[ 24/291] Writing tensor blk.2.ffn_norm.weight                  | size   4096           | type F32  | T+   8
INFO:convert:[ 25/291] Writing tensor blk.2.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   8
INFO:convert:[ 26/291] Writing tensor blk.2.attn_output.weight               | size   4096 x   4096  | type F16  | T+   8
INFO:convert:[ 27/291] Writing tensor blk.2.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   8
INFO:convert:[ 28/291] Writing tensor blk.2.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   8
INFO:convert:[ 29/291] Writing tensor blk.3.attn_norm.weight                 | size   4096           | type F32  | T+   8
INFO:convert:[ 30/291] Writing tensor blk.3.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   9
INFO:convert:[ 31/291] Writing tensor blk.3.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   9
INFO:convert:[ 32/291] Writing tensor blk.3.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   9
INFO:convert:[ 33/291] Writing tensor blk.3.ffn_norm.weight                  | size   4096           | type F32  | T+   9
INFO:convert:[ 34/291] Writing tensor blk.3.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   9
INFO:convert:[ 35/291] Writing tensor blk.3.attn_output.weight               | size   4096 x   4096  | type F16  | T+   9
INFO:convert:[ 36/291] Writing tensor blk.3.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   9
INFO:convert:[ 37/291] Writing tensor blk.3.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   9
INFO:convert:[ 38/291] Writing tensor blk.4.attn_norm.weight                 | size   4096           | type F32  | T+   9
INFO:convert:[ 39/291] Writing tensor blk.4.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  10
INFO:convert:[ 40/291] Writing tensor blk.4.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  10
INFO:convert:[ 41/291] Writing tensor blk.4.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  11
INFO:convert:[ 42/291] Writing tensor blk.4.ffn_norm.weight                  | size   4096           | type F32  | T+  12
INFO:convert:[ 43/291] Writing tensor blk.4.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  12
INFO:convert:[ 44/291] Writing tensor blk.4.attn_output.weight               | size   4096 x   4096  | type F16  | T+  12
INFO:convert:[ 45/291] Writing tensor blk.4.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  12
INFO:convert:[ 46/291] Writing tensor blk.4.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  12
INFO:convert:[ 47/291] Writing tensor blk.5.attn_norm.weight                 | size   4096           | type F32  | T+  12
INFO:convert:[ 48/291] Writing tensor blk.5.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  12
INFO:convert:[ 49/291] Writing tensor blk.5.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  12
INFO:convert:[ 50/291] Writing tensor blk.5.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  12
INFO:convert:[ 51/291] Writing tensor blk.5.ffn_norm.weight                  | size   4096           | type F32  | T+  13
INFO:convert:[ 52/291] Writing tensor blk.5.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  13
INFO:convert:[ 53/291] Writing tensor blk.5.attn_output.weight               | size   4096 x   4096  | type F16  | T+  13
INFO:convert:[ 54/291] Writing tensor blk.5.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  13
INFO:convert:[ 55/291] Writing tensor blk.5.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  13
INFO:convert:[ 56/291] Writing tensor blk.6.attn_norm.weight                 | size   4096           | type F32  | T+  13
INFO:convert:[ 57/291] Writing tensor blk.6.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  13
INFO:convert:[ 58/291] Writing tensor blk.6.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  13
INFO:convert:[ 59/291] Writing tensor blk.6.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  14
INFO:convert:[ 60/291] Writing tensor blk.6.ffn_norm.weight                  | size   4096           | type F32  | T+  14
INFO:convert:[ 61/291] Writing tensor blk.6.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  14
INFO:convert:[ 62/291] Writing tensor blk.6.attn_output.weight               | size   4096 x   4096  | type F16  | T+  14
INFO:convert:[ 63/291] Writing tensor blk.6.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  14
INFO:convert:[ 64/291] Writing tensor blk.6.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  14
INFO:convert:[ 65/291] Writing tensor blk.7.attn_norm.weight                 | size   4096           | type F32  | T+  14
INFO:convert:[ 66/291] Writing tensor blk.7.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  14
INFO:convert:[ 67/291] Writing tensor blk.7.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  15
INFO:convert:[ 68/291] Writing tensor blk.7.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  15
INFO:convert:[ 69/291] Writing tensor blk.7.ffn_norm.weight                  | size   4096           | type F32  | T+  15
INFO:convert:[ 70/291] Writing tensor blk.7.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  15
INFO:convert:[ 71/291] Writing tensor blk.7.attn_output.weight               | size   4096 x   4096  | type F16  | T+  15
INFO:convert:[ 72/291] Writing tensor blk.7.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  15
INFO:convert:[ 73/291] Writing tensor blk.7.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  15
INFO:convert:[ 74/291] Writing tensor blk.8.attn_norm.weight                 | size   4096           | type F32  | T+  15
INFO:convert:[ 75/291] Writing tensor blk.8.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  16
INFO:convert:[ 76/291] Writing tensor blk.8.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  16
INFO:convert:[ 77/291] Writing tensor blk.8.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  16
INFO:convert:[ 78/291] Writing tensor blk.8.ffn_norm.weight                  | size   4096           | type F32  | T+  16
INFO:convert:[ 79/291] Writing tensor blk.8.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  16
INFO:convert:[ 80/291] Writing tensor blk.8.attn_output.weight               | size   4096 x   4096  | type F16  | T+  16
INFO:convert:[ 81/291] Writing tensor blk.8.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  16
INFO:convert:[ 82/291] Writing tensor blk.8.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  17
INFO:convert:[ 83/291] Writing tensor blk.10.attn_norm.weight                | size   4096           | type F32  | T+  17
INFO:convert:[ 84/291] Writing tensor blk.10.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  17
INFO:convert:[ 85/291] Writing tensor blk.10.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  17
INFO:convert:[ 86/291] Writing tensor blk.10.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  17
INFO:convert:[ 87/291] Writing tensor blk.10.ffn_norm.weight                 | size   4096           | type F32  | T+  17
INFO:convert:[ 88/291] Writing tensor blk.10.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  17
INFO:convert:[ 89/291] Writing tensor blk.10.attn_output.weight              | size   4096 x   4096  | type F16  | T+  17
INFO:convert:[ 90/291] Writing tensor blk.10.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  18
INFO:convert:[ 91/291] Writing tensor blk.10.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  18
INFO:convert:[ 92/291] Writing tensor blk.11.attn_norm.weight                | size   4096           | type F32  | T+  18
INFO:convert:[ 93/291] Writing tensor blk.11.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  18
INFO:convert:[ 94/291] Writing tensor blk.11.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  18
INFO:convert:[ 95/291] Writing tensor blk.11.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  18
INFO:convert:[ 96/291] Writing tensor blk.11.ffn_norm.weight                 | size   4096           | type F32  | T+  18
INFO:convert:[ 97/291] Writing tensor blk.11.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  18
INFO:convert:[ 98/291] Writing tensor blk.11.attn_output.weight              | size   4096 x   4096  | type F16  | T+  18
INFO:convert:[ 99/291] Writing tensor blk.11.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  19
INFO:convert:[100/291] Writing tensor blk.11.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  19
INFO:convert:[101/291] Writing tensor blk.12.attn_norm.weight                | size   4096           | type F32  | T+  19
INFO:convert:[102/291] Writing tensor blk.12.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  20
INFO:convert:[103/291] Writing tensor blk.12.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  20
INFO:convert:[104/291] Writing tensor blk.12.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  20
INFO:convert:[105/291] Writing tensor blk.12.ffn_norm.weight                 | size   4096           | type F32  | T+  20
INFO:convert:[106/291] Writing tensor blk.12.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  20
INFO:convert:[107/291] Writing tensor blk.12.attn_output.weight              | size   4096 x   4096  | type F16  | T+  20
INFO:convert:[108/291] Writing tensor blk.12.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  20
INFO:convert:[109/291] Writing tensor blk.12.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  20
INFO:convert:[110/291] Writing tensor blk.13.attn_norm.weight                | size   4096           | type F32  | T+  20
INFO:convert:[111/291] Writing tensor blk.13.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  21
INFO:convert:[112/291] Writing tensor blk.13.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  21
INFO:convert:[113/291] Writing tensor blk.13.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  21
INFO:convert:[114/291] Writing tensor blk.13.ffn_norm.weight                 | size   4096           | type F32  | T+  21
INFO:convert:[115/291] Writing tensor blk.13.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  22
INFO:convert:[116/291] Writing tensor blk.13.attn_output.weight              | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[117/291] Writing tensor blk.13.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[118/291] Writing tensor blk.13.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  22
INFO:convert:[119/291] Writing tensor blk.14.attn_norm.weight                | size   4096           | type F32  | T+  22
INFO:convert:[120/291] Writing tensor blk.14.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  22
INFO:convert:[121/291] Writing tensor blk.14.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  22
INFO:convert:[122/291] Writing tensor blk.14.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  22
INFO:convert:[123/291] Writing tensor blk.14.ffn_norm.weight                 | size   4096           | type F32  | T+  22
INFO:convert:[124/291] Writing tensor blk.14.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  22
INFO:convert:[125/291] Writing tensor blk.14.attn_output.weight              | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[126/291] Writing tensor blk.14.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  22
INFO:convert:[127/291] Writing tensor blk.14.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  23
INFO:convert:[128/291] Writing tensor blk.15.attn_norm.weight                | size   4096           | type F32  | T+  23
INFO:convert:[129/291] Writing tensor blk.15.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  23
INFO:convert:[130/291] Writing tensor blk.15.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  23
INFO:convert:[131/291] Writing tensor blk.15.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  23
INFO:convert:[132/291] Writing tensor blk.15.ffn_norm.weight                 | size   4096           | type F32  | T+  24
INFO:convert:[133/291] Writing tensor blk.15.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  24
INFO:convert:[134/291] Writing tensor blk.15.attn_output.weight              | size   4096 x   4096  | type F16  | T+  24
INFO:convert:[135/291] Writing tensor blk.15.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  24
INFO:convert:[136/291] Writing tensor blk.15.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  24
INFO:convert:[137/291] Writing tensor blk.16.attn_norm.weight                | size   4096           | type F32  | T+  24
INFO:convert:[138/291] Writing tensor blk.16.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  24
INFO:convert:[139/291] Writing tensor blk.16.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  24
INFO:convert:[140/291] Writing tensor blk.16.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  25
INFO:convert:[141/291] Writing tensor blk.16.ffn_norm.weight                 | size   4096           | type F32  | T+  25
INFO:convert:[142/291] Writing tensor blk.16.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  25
INFO:convert:[143/291] Writing tensor blk.16.attn_output.weight              | size   4096 x   4096  | type F16  | T+  25
INFO:convert:[144/291] Writing tensor blk.16.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  25
INFO:convert:[145/291] Writing tensor blk.16.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  25
INFO:convert:[146/291] Writing tensor blk.17.attn_norm.weight                | size   4096           | type F32  | T+  25
INFO:convert:[147/291] Writing tensor blk.17.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  25
INFO:convert:[148/291] Writing tensor blk.17.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  26
INFO:convert:[149/291] Writing tensor blk.17.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  26
INFO:convert:[150/291] Writing tensor blk.17.ffn_norm.weight                 | size   4096           | type F32  | T+  26
INFO:convert:[151/291] Writing tensor blk.17.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  26
INFO:convert:[152/291] Writing tensor blk.17.attn_output.weight              | size   4096 x   4096  | type F16  | T+  26
INFO:convert:[153/291] Writing tensor blk.17.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  26
INFO:convert:[154/291] Writing tensor blk.17.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  26
INFO:convert:[155/291] Writing tensor blk.18.attn_norm.weight                | size   4096           | type F32  | T+  26
INFO:convert:[156/291] Writing tensor blk.18.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  26
INFO:convert:[157/291] Writing tensor blk.18.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  27
INFO:convert:[158/291] Writing tensor blk.18.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  27
INFO:convert:[159/291] Writing tensor blk.18.ffn_norm.weight                 | size   4096           | type F32  | T+  27
INFO:convert:[160/291] Writing tensor blk.18.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  27
INFO:convert:[161/291] Writing tensor blk.18.attn_output.weight              | size   4096 x   4096  | type F16  | T+  27
INFO:convert:[162/291] Writing tensor blk.18.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  27
INFO:convert:[163/291] Writing tensor blk.18.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  27
INFO:convert:[164/291] Writing tensor blk.19.attn_norm.weight                | size   4096           | type F32  | T+  27
INFO:convert:[165/291] Writing tensor blk.19.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  28
INFO:convert:[166/291] Writing tensor blk.19.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  28
INFO:convert:[167/291] Writing tensor blk.19.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  28
INFO:convert:[168/291] Writing tensor blk.19.ffn_norm.weight                 | size   4096           | type F32  | T+  28
INFO:convert:[169/291] Writing tensor blk.19.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  28
INFO:convert:[170/291] Writing tensor blk.19.attn_output.weight              | size   4096 x   4096  | type F16  | T+  28
INFO:convert:[171/291] Writing tensor blk.19.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  29
INFO:convert:[172/291] Writing tensor blk.19.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  29
INFO:convert:[173/291] Writing tensor blk.20.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  29
INFO:convert:[174/291] Writing tensor blk.20.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  29
INFO:convert:[175/291] Writing tensor blk.20.attn_output.weight              | size   4096 x   4096  | type F16  | T+  29
INFO:convert:[176/291] Writing tensor blk.20.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  29
INFO:convert:[177/291] Writing tensor blk.20.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  29
INFO:convert:[178/291] Writing tensor blk.9.attn_norm.weight                 | size   4096           | type F32  | T+  29
INFO:convert:[179/291] Writing tensor blk.9.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+  30
INFO:convert:[180/291] Writing tensor blk.9.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+  30
INFO:convert:[181/291] Writing tensor blk.9.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+  30
INFO:convert:[182/291] Writing tensor blk.9.ffn_norm.weight                  | size   4096           | type F32  | T+  30
INFO:convert:[183/291] Writing tensor blk.9.attn_k.weight                    | size   1024 x   4096  | type F16  | T+  30
INFO:convert:[184/291] Writing tensor blk.9.attn_output.weight               | size   4096 x   4096  | type F16  | T+  30
INFO:convert:[185/291] Writing tensor blk.9.attn_q.weight                    | size   4096 x   4096  | type F16  | T+  30
INFO:convert:[186/291] Writing tensor blk.9.attn_v.weight                    | size   1024 x   4096  | type F16  | T+  30
INFO:convert:[187/291] Writing tensor blk.20.attn_norm.weight                | size   4096           | type F32  | T+  30
INFO:convert:[188/291] Writing tensor blk.20.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  31
INFO:convert:[189/291] Writing tensor blk.20.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  31
INFO:convert:[190/291] Writing tensor blk.20.ffn_norm.weight                 | size   4096           | type F32  | T+  31
INFO:convert:[191/291] Writing tensor blk.21.attn_norm.weight                | size   4096           | type F32  | T+  31
INFO:convert:[192/291] Writing tensor blk.21.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  31
INFO:convert:[193/291] Writing tensor blk.21.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  31
INFO:convert:[194/291] Writing tensor blk.21.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  32
INFO:convert:[195/291] Writing tensor blk.21.ffn_norm.weight                 | size   4096           | type F32  | T+  32
INFO:convert:[196/291] Writing tensor blk.21.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  32
INFO:convert:[197/291] Writing tensor blk.21.attn_output.weight              | size   4096 x   4096  | type F16  | T+  32
INFO:convert:[198/291] Writing tensor blk.21.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  32
INFO:convert:[199/291] Writing tensor blk.21.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  32
INFO:convert:[200/291] Writing tensor blk.22.attn_norm.weight                | size   4096           | type F32  | T+  32
INFO:convert:[201/291] Writing tensor blk.22.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  33
INFO:convert:[202/291] Writing tensor blk.22.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  33
INFO:convert:[203/291] Writing tensor blk.22.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  33
INFO:convert:[204/291] Writing tensor blk.22.ffn_norm.weight                 | size   4096           | type F32  | T+  33
INFO:convert:[205/291] Writing tensor blk.22.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  33
INFO:convert:[206/291] Writing tensor blk.22.attn_output.weight              | size   4096 x   4096  | type F16  | T+  33
INFO:convert:[207/291] Writing tensor blk.22.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  33
INFO:convert:[208/291] Writing tensor blk.22.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  33
INFO:convert:[209/291] Writing tensor blk.23.attn_norm.weight                | size   4096           | type F32  | T+  33
INFO:convert:[210/291] Writing tensor blk.23.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  33
INFO:convert:[211/291] Writing tensor blk.23.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  34
INFO:convert:[212/291] Writing tensor blk.23.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  34
INFO:convert:[213/291] Writing tensor blk.23.ffn_norm.weight                 | size   4096           | type F32  | T+  34
INFO:convert:[214/291] Writing tensor blk.23.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  34
INFO:convert:[215/291] Writing tensor blk.23.attn_output.weight              | size   4096 x   4096  | type F16  | T+  34
INFO:convert:[216/291] Writing tensor blk.23.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  34
INFO:convert:[217/291] Writing tensor blk.23.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  34
INFO:convert:[218/291] Writing tensor blk.24.attn_norm.weight                | size   4096           | type F32  | T+  34
INFO:convert:[219/291] Writing tensor blk.24.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  35
INFO:convert:[220/291] Writing tensor blk.24.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  35
INFO:convert:[221/291] Writing tensor blk.24.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  35
INFO:convert:[222/291] Writing tensor blk.24.ffn_norm.weight                 | size   4096           | type F32  | T+  35
INFO:convert:[223/291] Writing tensor blk.24.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  35
INFO:convert:[224/291] Writing tensor blk.24.attn_output.weight              | size   4096 x   4096  | type F16  | T+  35
INFO:convert:[225/291] Writing tensor blk.24.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  35
INFO:convert:[226/291] Writing tensor blk.24.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  35
INFO:convert:[227/291] Writing tensor blk.25.attn_norm.weight                | size   4096           | type F32  | T+  35
INFO:convert:[228/291] Writing tensor blk.25.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  36
INFO:convert:[229/291] Writing tensor blk.25.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  36
INFO:convert:[230/291] Writing tensor blk.25.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  36
INFO:convert:[231/291] Writing tensor blk.25.ffn_norm.weight                 | size   4096           | type F32  | T+  37
INFO:convert:[232/291] Writing tensor blk.25.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  37
INFO:convert:[233/291] Writing tensor blk.25.attn_output.weight              | size   4096 x   4096  | type F16  | T+  37
INFO:convert:[234/291] Writing tensor blk.25.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  37
INFO:convert:[235/291] Writing tensor blk.25.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  37
INFO:convert:[236/291] Writing tensor blk.26.attn_norm.weight                | size   4096           | type F32  | T+  37
INFO:convert:[237/291] Writing tensor blk.26.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  37
INFO:convert:[238/291] Writing tensor blk.26.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  37
INFO:convert:[239/291] Writing tensor blk.26.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  37
INFO:convert:[240/291] Writing tensor blk.26.ffn_norm.weight                 | size   4096           | type F32  | T+  38
INFO:convert:[241/291] Writing tensor blk.26.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  38
INFO:convert:[242/291] Writing tensor blk.26.attn_output.weight              | size   4096 x   4096  | type F16  | T+  38
INFO:convert:[243/291] Writing tensor blk.26.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  38
INFO:convert:[244/291] Writing tensor blk.26.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  38
INFO:convert:[245/291] Writing tensor blk.27.attn_norm.weight                | size   4096           | type F32  | T+  38
INFO:convert:[246/291] Writing tensor blk.27.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  39
INFO:convert:[247/291] Writing tensor blk.27.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  39
INFO:convert:[248/291] Writing tensor blk.27.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  39
INFO:convert:[249/291] Writing tensor blk.27.ffn_norm.weight                 | size   4096           | type F32  | T+  39
INFO:convert:[250/291] Writing tensor blk.27.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  39
INFO:convert:[251/291] Writing tensor blk.27.attn_output.weight              | size   4096 x   4096  | type F16  | T+  39
INFO:convert:[252/291] Writing tensor blk.27.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  39
INFO:convert:[253/291] Writing tensor blk.27.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  39
INFO:convert:[254/291] Writing tensor blk.28.attn_norm.weight                | size   4096           | type F32  | T+  39
INFO:convert:[255/291] Writing tensor blk.28.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  40
INFO:convert:[256/291] Writing tensor blk.28.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  40
INFO:convert:[257/291] Writing tensor blk.28.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  40
INFO:convert:[258/291] Writing tensor blk.28.ffn_norm.weight                 | size   4096           | type F32  | T+  40
INFO:convert:[259/291] Writing tensor blk.28.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  40
INFO:convert:[260/291] Writing tensor blk.28.attn_output.weight              | size   4096 x   4096  | type F16  | T+  40
INFO:convert:[261/291] Writing tensor blk.28.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  40
INFO:convert:[262/291] Writing tensor blk.28.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  40
INFO:convert:[263/291] Writing tensor blk.29.attn_norm.weight                | size   4096           | type F32  | T+  40
INFO:convert:[264/291] Writing tensor blk.29.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  41
INFO:convert:[265/291] Writing tensor blk.29.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  41
INFO:convert:[266/291] Writing tensor blk.29.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  41
INFO:convert:[267/291] Writing tensor blk.29.ffn_norm.weight                 | size   4096           | type F32  | T+  41
INFO:convert:[268/291] Writing tensor blk.29.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  41
INFO:convert:[269/291] Writing tensor blk.29.attn_output.weight              | size   4096 x   4096  | type F16  | T+  41
INFO:convert:[270/291] Writing tensor blk.29.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  41
INFO:convert:[271/291] Writing tensor blk.29.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  41
INFO:convert:[272/291] Writing tensor blk.30.attn_norm.weight                | size   4096           | type F32  | T+  41
INFO:convert:[273/291] Writing tensor blk.30.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  42
INFO:convert:[274/291] Writing tensor blk.30.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  42
INFO:convert:[275/291] Writing tensor blk.30.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  43
INFO:convert:[276/291] Writing tensor blk.30.ffn_norm.weight                 | size   4096           | type F32  | T+  43
INFO:convert:[277/291] Writing tensor blk.30.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  43
INFO:convert:[278/291] Writing tensor blk.30.attn_output.weight              | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[279/291] Writing tensor blk.30.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[280/291] Writing tensor blk.30.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  43
INFO:convert:[281/291] Writing tensor blk.31.ffn_gate.weight                 | size  14336 x   4096  | type F16  | T+  43
INFO:convert:[282/291] Writing tensor blk.31.ffn_up.weight                   | size  14336 x   4096  | type F16  | T+  43
INFO:convert:[283/291] Writing tensor blk.31.attn_k.weight                   | size   1024 x   4096  | type F16  | T+  43
INFO:convert:[284/291] Writing tensor blk.31.attn_output.weight              | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[285/291] Writing tensor blk.31.attn_q.weight                   | size   4096 x   4096  | type F16  | T+  43
INFO:convert:[286/291] Writing tensor blk.31.attn_v.weight                   | size   1024 x   4096  | type F16  | T+  44
INFO:convert:[287/291] Writing tensor output.weight                          | size 128256 x   4096  | type F16  | T+  48
INFO:convert:[288/291] Writing tensor blk.31.attn_norm.weight                | size   4096           | type F32  | T+  49
INFO:convert:[289/291] Writing tensor blk.31.ffn_down.weight                 | size   4096 x  14336  | type F16  | T+  49
INFO:convert:[290/291] Writing tensor blk.31.ffn_norm.weight                 | size   4096           | type F32  | T+  49
INFO:convert:[291/291] Writing tensor output_norm.weight                     | size   4096           | type F32  | T+  49
INFO:convert:Wrote models/ggml-vocab-llama3-8B-instruct-f16.gguf

7. 量化模型

quantize

Quantization of LLMs with llama.cpp

Llama.cpp量化简明手册

7.1 查看量化类型

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./quantize -h
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

解释说明

使用quantize量化模型，它提供各种量化位数的模型：Q2、Q3、Q4、Q5、Q6、Q8、F16。量化模型的命名方法遵循: Q + 量化比特位 + 变种。量化位数越少，对硬件资源的要求越低，但是模型的精度也越低。

7.2 执行量化

对FP16模型进行4-bit量化。

./quantize models/ggml-vocab-llama3-8B-instruct-f16.gguf models/ggml-vocab-llama3-8B-instruct-q4_0.gguf Q4_0

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./quantize models/ggml-vocab-llama3-8B-instruct-f16.gguf models/ggml-vocab-llama3-8B-instruct-q4_0.gguf Q4_0
main: build = 3045 (59b0d077)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: quantizing 'models/ggml-vocab-llama3-8B-instruct-f16.gguf' to 'models/ggml-vocab-llama3-8B-instruct-q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q4_0 .. size =  1002.00 MiB ->   281.81 MiB
[   2/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[   4/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[   5/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[   6/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[   8/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[   9/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  10/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  11/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  13/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  14/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  15/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  16/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  17/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  18/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  19/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  20/ 291]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  21/ 291]                blk.2.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  22/ 291]                blk.2.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  23/ 291]                  blk.2.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  24/ 291]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  25/ 291]                  blk.2.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  26/ 291]             blk.2.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  27/ 291]                  blk.2.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  28/ 291]                  blk.2.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  29/ 291]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  30/ 291]                blk.3.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  31/ 291]                blk.3.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  32/ 291]                  blk.3.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  33/ 291]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  34/ 291]                  blk.3.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  35/ 291]             blk.3.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  36/ 291]                  blk.3.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  37/ 291]                  blk.3.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  38/ 291]               blk.4.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  39/ 291]                blk.4.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  40/ 291]                blk.4.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  41/ 291]                  blk.4.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  42/ 291]                blk.4.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  43/ 291]                  blk.4.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  44/ 291]             blk.4.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  45/ 291]                  blk.4.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  46/ 291]                  blk.4.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  47/ 291]               blk.5.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  48/ 291]                blk.5.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  49/ 291]                blk.5.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  50/ 291]                  blk.5.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  51/ 291]                blk.5.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  52/ 291]                  blk.5.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  53/ 291]             blk.5.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  54/ 291]                  blk.5.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  55/ 291]                  blk.5.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  56/ 291]               blk.6.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  57/ 291]                blk.6.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  58/ 291]                blk.6.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  59/ 291]                  blk.6.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  60/ 291]                blk.6.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  61/ 291]                  blk.6.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  62/ 291]             blk.6.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  63/ 291]                  blk.6.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  64/ 291]                  blk.6.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  65/ 291]               blk.7.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  66/ 291]                blk.7.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  67/ 291]                blk.7.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  68/ 291]                  blk.7.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  69/ 291]                blk.7.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  70/ 291]                  blk.7.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  71/ 291]             blk.7.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  72/ 291]                  blk.7.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  73/ 291]                  blk.7.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  74/ 291]               blk.8.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  75/ 291]                blk.8.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  76/ 291]                blk.8.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  77/ 291]                  blk.8.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  78/ 291]                blk.8.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  79/ 291]                  blk.8.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  80/ 291]             blk.8.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  81/ 291]                  blk.8.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  82/ 291]                  blk.8.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  83/ 291]              blk.10.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  84/ 291]               blk.10.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  85/ 291]               blk.10.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  86/ 291]                 blk.10.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  87/ 291]               blk.10.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  88/ 291]                 blk.10.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  89/ 291]            blk.10.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  90/ 291]                 blk.10.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  91/ 291]                 blk.10.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  92/ 291]              blk.11.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  93/ 291]               blk.11.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  94/ 291]               blk.11.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  95/ 291]                 blk.11.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[  96/ 291]               blk.11.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  97/ 291]                 blk.11.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[  98/ 291]            blk.11.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[  99/ 291]                 blk.11.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 100/ 291]                 blk.11.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 101/ 291]              blk.12.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 102/ 291]               blk.12.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 103/ 291]               blk.12.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 104/ 291]                 blk.12.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 105/ 291]               blk.12.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 106/ 291]                 blk.12.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 107/ 291]            blk.12.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 108/ 291]                 blk.12.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 109/ 291]                 blk.12.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 110/ 291]              blk.13.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 111/ 291]               blk.13.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 112/ 291]               blk.13.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 113/ 291]                 blk.13.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 114/ 291]               blk.13.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 115/ 291]                 blk.13.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 116/ 291]            blk.13.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 117/ 291]                 blk.13.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 118/ 291]                 blk.13.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 119/ 291]              blk.14.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 120/ 291]               blk.14.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 121/ 291]               blk.14.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 122/ 291]                 blk.14.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 123/ 291]               blk.14.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 124/ 291]                 blk.14.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 125/ 291]            blk.14.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 126/ 291]                 blk.14.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 127/ 291]                 blk.14.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 128/ 291]              blk.15.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 129/ 291]               blk.15.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 130/ 291]               blk.15.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 131/ 291]                 blk.15.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 132/ 291]               blk.15.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 133/ 291]                 blk.15.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 134/ 291]            blk.15.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 135/ 291]                 blk.15.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 136/ 291]                 blk.15.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 137/ 291]              blk.16.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 138/ 291]               blk.16.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 139/ 291]               blk.16.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 140/ 291]                 blk.16.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 141/ 291]               blk.16.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 142/ 291]                 blk.16.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 143/ 291]            blk.16.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 144/ 291]                 blk.16.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 145/ 291]                 blk.16.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 146/ 291]              blk.17.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 147/ 291]               blk.17.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 148/ 291]               blk.17.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 149/ 291]                 blk.17.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 150/ 291]               blk.17.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 151/ 291]                 blk.17.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 152/ 291]            blk.17.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 153/ 291]                 blk.17.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 154/ 291]                 blk.17.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 155/ 291]              blk.18.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 156/ 291]               blk.18.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 157/ 291]               blk.18.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 158/ 291]                 blk.18.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 159/ 291]               blk.18.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 160/ 291]                 blk.18.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 161/ 291]            blk.18.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 162/ 291]                 blk.18.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 163/ 291]                 blk.18.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 164/ 291]              blk.19.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 165/ 291]               blk.19.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 166/ 291]               blk.19.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 167/ 291]                 blk.19.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 168/ 291]               blk.19.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 169/ 291]                 blk.19.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 170/ 291]            blk.19.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 171/ 291]                 blk.19.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 172/ 291]                 blk.19.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 173/ 291]               blk.20.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 174/ 291]                 blk.20.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 175/ 291]            blk.20.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 176/ 291]                 blk.20.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 177/ 291]                 blk.20.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 178/ 291]               blk.9.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 179/ 291]                blk.9.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 180/ 291]                blk.9.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 181/ 291]                  blk.9.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 182/ 291]                blk.9.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 183/ 291]                  blk.9.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 184/ 291]             blk.9.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 185/ 291]                  blk.9.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 186/ 291]                  blk.9.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 187/ 291]              blk.20.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 188/ 291]               blk.20.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 189/ 291]                 blk.20.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 190/ 291]               blk.20.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 191/ 291]              blk.21.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 192/ 291]               blk.21.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 193/ 291]               blk.21.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 194/ 291]                 blk.21.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 195/ 291]               blk.21.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 196/ 291]                 blk.21.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 197/ 291]            blk.21.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 198/ 291]                 blk.21.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 199/ 291]                 blk.21.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 200/ 291]              blk.22.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 201/ 291]               blk.22.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 202/ 291]               blk.22.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 203/ 291]                 blk.22.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 204/ 291]               blk.22.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 205/ 291]                 blk.22.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 206/ 291]            blk.22.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 207/ 291]                 blk.22.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 208/ 291]                 blk.22.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 209/ 291]              blk.23.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 210/ 291]               blk.23.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 211/ 291]               blk.23.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 212/ 291]                 blk.23.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 213/ 291]               blk.23.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 214/ 291]                 blk.23.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 215/ 291]            blk.23.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 216/ 291]                 blk.23.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 217/ 291]                 blk.23.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 218/ 291]              blk.24.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 219/ 291]               blk.24.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 220/ 291]               blk.24.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 221/ 291]                 blk.24.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 222/ 291]               blk.24.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 223/ 291]                 blk.24.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 224/ 291]            blk.24.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 225/ 291]                 blk.24.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 226/ 291]                 blk.24.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 227/ 291]              blk.25.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 228/ 291]               blk.25.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 229/ 291]               blk.25.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 230/ 291]                 blk.25.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 231/ 291]               blk.25.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 232/ 291]                 blk.25.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 233/ 291]            blk.25.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 234/ 291]                 blk.25.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 235/ 291]                 blk.25.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 236/ 291]              blk.26.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 237/ 291]               blk.26.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 238/ 291]               blk.26.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 239/ 291]                 blk.26.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 240/ 291]               blk.26.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 241/ 291]                 blk.26.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 242/ 291]            blk.26.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 243/ 291]                 blk.26.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 244/ 291]                 blk.26.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 245/ 291]              blk.27.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 246/ 291]               blk.27.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 247/ 291]               blk.27.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 248/ 291]                 blk.27.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 249/ 291]               blk.27.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 250/ 291]                 blk.27.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 251/ 291]            blk.27.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 252/ 291]                 blk.27.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 253/ 291]                 blk.27.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 254/ 291]              blk.28.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 255/ 291]               blk.28.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 256/ 291]               blk.28.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 257/ 291]                 blk.28.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 258/ 291]               blk.28.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 259/ 291]                 blk.28.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 260/ 291]            blk.28.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 261/ 291]                 blk.28.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 262/ 291]                 blk.28.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 263/ 291]              blk.29.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 264/ 291]               blk.29.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 265/ 291]               blk.29.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 266/ 291]                 blk.29.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 267/ 291]               blk.29.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 268/ 291]                 blk.29.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 269/ 291]            blk.29.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 270/ 291]                 blk.29.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 271/ 291]                 blk.29.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 272/ 291]              blk.30.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 273/ 291]               blk.30.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 274/ 291]               blk.30.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 275/ 291]                 blk.30.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 276/ 291]               blk.30.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 277/ 291]                 blk.30.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 278/ 291]            blk.30.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 279/ 291]                 blk.30.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 280/ 291]                 blk.30.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 281/ 291]               blk.31.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 282/ 291]                 blk.31.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 283/ 291]                 blk.31.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 284/ 291]            blk.31.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 285/ 291]                 blk.31.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_0 .. size =    32.00 MiB ->     9.00 MiB
[ 286/ 291]                 blk.31.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_0 .. size =     8.00 MiB ->     2.25 MiB
[ 287/ 291]                        output.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q6_K .. size =  1002.00 MiB ->   410.98 MiB
[ 288/ 291]              blk.31.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 289/ 291]               blk.31.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_0 .. size =   112.00 MiB ->    31.50 MiB
[ 290/ 291]               blk.31.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 291/ 291]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 15317.02 MB
llama_model_quantize_internal: quant size  =  4437.80 MB

main: quantize time = 61476.12 ms
main:    total time = 61476.12 ms

经过Q4_0量化后，模型的大小从15317.02 MB降低到4437.80 MB，但模型精度从16位浮点数降低到4位整数。

更详细的使用教程请访问：https://github.com/ggerganov/llama.cpp#quantization

8. 模型推理

8.1 main指令

llama.cpp/examples/main

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./main -h

usage: ./main [options]

options:
  -h, --help            show this help message and exit
  --version             show version and build info
  -i, --interactive     run in interactive mode
  --special             special tokens output enabled
  --interactive-specials allow special tokens in user text, in interactive mode
  --interactive-first   run in interactive mode and wait for input right away
  -cnv, --conversation  run in conversation mode (does not print special tokens and suffix/prefix)
  -ins, --instruct      run in instruction mode (use with Alpaca models)
  -cml, --chatml        run in chatml mode (use with ChatML-compatible models)
  --multiline-input     allows you to write or paste multiple lines without ending each in '\'
  -r PROMPT, --reverse-prompt PROMPT
                        halt generation at PROMPT, return control in interactive mode
                        (can be specified more than once for multiple prompts).
  --color               colorise output to distinguish prompt and user input from generations
  -s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)
  -t N, --threads N     number of threads to use during generation (default: 128)
  -tb N, --threads-batch N
                        number of threads to use during batch and prompt processing (default: same as --threads)
  -td N, --threads-draft N                        number of threads to use during generation (default: same as --threads)
  -tbd N, --threads-batch-draft N
                        number of threads to use during batch and prompt processing (default: same as --threads-draft)
  -p PROMPT, --prompt PROMPT
                        prompt to start generation with (default: empty)
  -e, --escape          process prompt escapes sequences (\n, \r, \t, \', \", \\)
  --prompt-cache FNAME  file to cache prompt state for faster startup (default: none)
  --prompt-cache-all    if specified, saves user input and generations to cache as well.
                        not supported with --interactive or other interactive options
  --prompt-cache-ro     if specified, uses the prompt cache but does not update it.
  --random-prompt       start with a randomized prompt.
  --in-prefix-bos       prefix BOS to user inputs, preceding the `--in-prefix` string
  --in-prefix STRING    string to prefix user inputs with (default: empty)
  --in-suffix STRING    string to suffix after user inputs with (default: empty)
  -f FNAME, --file FNAME
                        prompt file to start generation.
  -bf FNAME, --binary-file FNAME
                        binary file containing multiple choice tasks.
  -n N, --n-predict N   number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
  -c N, --ctx-size N    size of the prompt context (default: 512, 0 = loaded from model)
  -b N, --batch-size N  logical maximum batch size (default: 2048)
  -ub N, --ubatch-size N
                        physical maximum batch size (default: 512)
  --samplers            samplers that will be used for generation in the order, separated by ';'
                        (default: top_k;tfs_z;typical_p;top_p;min_p;temperature)
  --sampling-seq        simplified sequence for samplers that will be used (default: kfypmt)
  --top-k N             top-k sampling (default: 40, 0 = disabled)
  --top-p N             top-p sampling (default: 0.9, 1.0 = disabled)
  --min-p N             min-p sampling (default: 0.1, 0.0 = disabled)
  --tfs N               tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
  --typical N           locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
  --repeat-last-n N     last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
  --repeat-penalty N    penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
  --presence-penalty N  repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
  --frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
  --dynatemp-range N    dynamic temperature range (default: 0.0, 0.0 = disabled)
  --dynatemp-exp N      dynamic temperature exponent (default: 1.0)
  --mirostat N          use Mirostat sampling.
                        Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.
                        (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
  --mirostat-lr N       Mirostat learning rate, parameter eta (default: 0.1)
  --mirostat-ent N      Mirostat target entropy, parameter tau (default: 5.0)
  -l TOKEN_ID(+/-)BIAS, --logit-bias TOKEN_ID(+/-)BIAS
                        modifies the likelihood of token appearing in the completion,
                        i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
                        or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
  --grammar GRAMMAR     BNF-like grammar to constrain generations (see samples in grammars/ dir)
  --grammar-file FNAME  file to read grammar from
  -j SCHEMA, --json-schema SCHEMA
                        JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object.
                        For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead
  --cfg-negative-prompt PROMPT
                        negative prompt to use for guidance. (default: empty)
  --cfg-negative-prompt-file FNAME
                        negative prompt file to use for guidance. (default: empty)
  --cfg-scale N         strength of guidance (default: 1.000000, 1.0 = disable)
  --rope-scaling {none,linear,yarn}
                        RoPE frequency scaling method, defaults to linear unless specified by the model
  --rope-scale N        RoPE context scaling factor, expands context by a factor of N
  --rope-freq-base N    RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
  --rope-freq-scale N   RoPE frequency scaling factor, expands context by a factor of 1/N
  --yarn-orig-ctx N     YaRN: original context size of model (default: 0 = model training context size)
  --yarn-ext-factor N   YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
  --yarn-attn-factor N  YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
  --yarn-beta-slow N    YaRN: high correction dim or alpha (default: 1.0)
  --yarn-beta-fast N    YaRN: low correction dim or beta (default: 32.0)
  --pooling {none,mean,cls}
                        pooling type for embeddings, use model default if unspecified
  -dt N, --defrag-thold N
                        KV cache defragmentation threshold (default: -1.0, < 0 - disabled)
  --ignore-eos          ignore end of stream token and continue generating (implies --logit-bias 2-inf)
  --penalize-nl         penalize newline tokens
  --temp N              temperature (default: 0.8)
  --all-logits          return logits for all tokens in the batch (default: disabled)
  --hellaswag           compute HellaSwag score over random tasks from datafile supplied with -f
  --hellaswag-tasks N   number of tasks to use when computing the HellaSwag score (default: 400)
  --winogrande          compute Winogrande score over random tasks from datafile supplied with -f
  --winogrande-tasks N  number of tasks to use when computing the Winogrande score (default: 0)
  --multiple-choice     compute multiple choice score over random tasks from datafile supplied with -f
  --multiple-choice-tasks N number of tasks to use when computing the multiple choice score (default: 0)
  --kl-divergence       computes KL-divergence to logits provided via --kl-divergence-base
  --keep N              number of tokens to keep from the initial prompt (default: 0, -1 = all)
  --draft N             number of tokens to draft for speculative decoding (default: 5)
  --chunks N            max number of chunks to process (default: -1, -1 = all)
  -np N, --parallel N   number of parallel sequences to decode (default: 1)
  -ns N, --sequences N  number of sequences to decode (default: 1)
  -ps N, --p-split N    speculative decoding split probability (default: 0.1)
  -cb, --cont-batching  enable continuous batching (a.k.a dynamic batching) (default: disabled)
  -fa, --flash-attn     enable Flash Attention (default: disabled)
  --mmproj MMPROJ_FILE  path to a multimodal projector file for LLaVA. see examples/llava/README.md
  --image IMAGE_FILE    path to an image file. use with multimodal models. Specify multiple times for batching
  --mlock               force system to keep model in RAM rather than swapping or compressing
  --no-mmap             do not memory-map model (slower load but may reduce pageouts if not using mlock)
  --numa TYPE           attempt optimizations that help on some NUMA systems
                          - distribute: spread execution evenly over all nodes
                          - isolate: only spawn threads on CPUs on the node that execution started on
                          - numactl: use the CPU map provided by numactl
                        if run without this previously, it is recommended to drop the system page cache before using this
                        see https://github.com/ggerganov/llama.cpp/issues/1437
  --rpc SERVERS         comma separated list of RPC servers
  --verbose-prompt      print a verbose prompt before generation (default: false)
  --no-display-prompt   don't print prompt at generation (default: false)
  -gan N, --grp-attn-n N
                        group-attention factor (default: 1)
  -gaw N, --grp-attn-w N
                        group-attention width (default: 512.0)
  -dkvc, --dump-kv-cache
                        verbose print of the KV cache
  -nkvo, --no-kv-offload
                        disable KV offload
  -ctk TYPE, --cache-type-k TYPE
                        KV cache data type for K (default: f16)
  -ctv TYPE, --cache-type-v TYPE
                        KV cache data type for V (default: f16)
  --simple-io           use basic IO for better compatibility in subprocesses and limited consoles
  --lora FNAME          apply LoRA adapter (implies --no-mmap)
  --lora-scaled FNAME S apply LoRA adapter with user defined scaling S (implies --no-mmap)
  --lora-base FNAME     optional model to use as a base for the layers modified by the LoRA adapter
  --control-vector FNAME
                        add a control vector
  --control-vector-scaled FNAME S
                        add a control vector with user defined scaling S
  --control-vector-layer-range START END
                        layer range to apply the control vector(s) to, start and end inclusive
  -m FNAME, --model FNAME
                        model path (default: models/$filename with filename from --hf-file or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
  -md FNAME, --model-draft FNAME
                        draft model for speculative decoding (default: unused)
  -mu MODEL_URL, --model-url MODEL_URL
                        model download url (default: unused)
  -hfr REPO, --hf-repo REPO
                        Hugging Face model repository (default: unused)
  -hff FILE, --hf-file FILE
                        Hugging Face model file (default: unused)
  -ld LOGDIR, --logdir LOGDIR
                        path under which to save YAML logs (no logging if unset)
  -lcs FNAME, --lookup-cache-static FNAME
                        path to static lookup cache to use for lookup decoding (not updated by generation)
  -lcd FNAME, --lookup-cache-dynamic FNAME
                        path to dynamic lookup cache to use for lookup decoding (updated by generation)
  --override-kv KEY=TYPE:VALUE
                        advanced option to override model metadata by key. may be specified multiple times.
                        types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false
  -ptc N, --print-token-count N
                        print token count every N tokens (default: -1)
  --check-tensors       check model tensor data for invalid values

log options:
  --log-test            Run simple logging test
  --log-disable         Disable trace logs
  --log-enable          Enable trace logs
  --log-file            Specify a log filename (without extension)
  --log-new             Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"

参数解释

命令描述 -m 指定 LLaMA 模型文件的路径 -mu 指定远程 http url 来下载文件 -i 以交互模式运行程序 -ins 以指令模式运行程序，类似ChatGPT的对话交流模式 -f 指定prompt模板，alpaca模型请加载prompts/alpaca.txt指令模板 -n 控制回复生成的最大长度（默认：-1，表示无穷大） -c 设置提示上下文的大小，值越大越能参考更长的历史对话（默认：512） -b 控制batch size（默认：2048） -t 控制线程数量（默认：128） --repeat_penalty 控制生成回复中对重复文本的惩罚力度 --temp 温度系数，值越低回复的随机性越小 --top_p, top_k 控制解码采样的相关参数 --color 区分用户输入和生成的文本

更详细的官方说明请参考：https://github.com/ggerganov/llama.cpp/tree/master/examples/main

8.2 CPU推理

启动CPU推理，程序卡住，CPU利用率90%以上。

# 以指令模式执行推理
./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1
Log start
main: build = 3045 (59b0d077)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1723114741
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256.
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors:        CPU buffer size =  4437.80 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 128 / 255 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.200
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 2048, n_batch = 2048, n_predict = 256, n_keep = 19


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

Below is an instruction that describes a task. Write a response that appropriately completes the request.
> hi
Hello! I'm happy to help. Please provide more context or clarify what you would like me to assist you with, and I'll do my best to respond accordingly.
>

在提示符 > 之后输入你的prompt，command+c中断输出，多行信息以\作为行尾。如需查看帮助和参数说明，请执行./main -h命令。

8.3 国产异构加速卡推理

使用-ngl N或者 --n-gpu-layers N参数，表示加载到GPU的网络层数。

# 指定GPU
export HIP_VISIBLE_DEVICES="0"

# 指定GFX version版本
export HSA_OVERRIDE_GFX_VERSION=9.2.8

# 以指令模式执行推理
./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap

# 或者
export HSA_OVERRIDE_GFX_VERSION=9.2.8 && export HIP_VISIBLE_DEVICES=0 && ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap

(llama.cpp) root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/Downloads/llama.cpp# export HSA_OVERRIDE_GFX_VERSION=9.2.8 && export HIP_VISIBLE_DEVICES=0 && ./main -m models/ggml-vocab-llama3-8B-instruct-q4_0.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1 --n_gpu_layers 40 --no-mmap
Log start
main: build = 3045 (59b0d077)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1723178798
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/ggml-vocab-llama3-8B-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256.
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 ROCm devices:
  Device 0: DCU K100_AI, compute capability 9.2, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  4155.99 MiB
llm_load_tensors:  ROCm_Host buffer size =   281.81 MiB
.............................

总结

【国产异构加速卡】基于llama.cpp实现Llama3模型的guff格式转换、4bit量化以及推理加速

序言

一、参考资料

二、llama.cpp相关介绍

1. llama.cpp简介

2. llama.cpp 优势

3. llama.cpp目标

4. GGUT格式

4.1 引言

4.2 GGUT简介

4.3 GGUT优势

5. GGML

5.1 ggml简介

5.2 ggml目标

6. llama-cpp-python

三、快速体验llama.cpp

1. 准备环境

requirements.txt

envs.yaml

2. 下载llama.cpp

3. 编译llama.cpp

3.1 编译CPU版本

3.2 编译GPU版本（hipBLAS）

4. 准备模型

4.1 下载原版LLaMA模型

4.2 下载gguf模型

4.3 下载HuggingFace模型

5.（可选）合并LoRA权重

6. 转换gguf格式

6.1 convert脚本

6.2 convert.py

6.3 执行转换

7. 量化模型

7.1 查看量化类型

7.2 执行量化

8. 模型推理

8.1 main指令

8.2 CPU推理

8.3 国产异构加速卡推理

1. `llama.cpp`简介

2. `llama.cpp` 优势

`requirements.txt`

`envs.yaml`