用Transformers库运行Llama-3.1-Minitron-4B

我的笔记本来自8年前的旧时代，勉强能运行8B模型，于是打算试试看更小的……

Llama-3.1-Minitron 4B 简介

Llama-3.1-Minitron 4B 是基于Llama-3.1 8B模型，通过结构化权重剪枝和知识提炼技术优化而成的紧凑型语言模型。

它有两种基座模型，Width-Base 和 Depth-Base，相关的模型文件可以在 Hugging Face 或其镜像 HF-Mirror 中找到。

为什么不用Ollama运行

理由很简单，电脑做不到……原本试图用Ollama管理，但运行时出现我暂无法解决的报错，这似乎是CPU资源不够或模型转换出错导致的无法读取结果，打算军训后降下量化精度重新试试。

所以暂时先用回transformers库+python，

反正也不需要这个小模型直接交互，影响不大。

影响不大才怪，我怎么调都加载不了940MX，Ollama能，虽然无法有效读端口但响应的速度快了好几十倍

所幸用transformers库运行 Llama-3.1-Minitron-4B 很简单 : )

1、下载模型

可以直接在 Hugging Face 上下载模型文件，但推荐使用 HF-Mirror ：

Width-Base模型，从 HF-Mirror 克隆

git lfs install  # 安装lfs以支持大文件下载
git clone https://hf-mirror.com/nvidia/Llama-3.1-Minitron-4B-Width-Base

或者Depth-Base模型，从 HF-Mirror 克隆

git lfs install   # 安装lfs以支持大文件下载
git clone https://hf-mirror.com/nvidia/Llama-3.1-Minitron-4B-Depth-Base

只需要二者之一即可，都下载可能要多花些时间。

如果克隆成功，可以去运行了；

如果克隆失败或缺失，则需要手动下载文件并校验。

手动下载地址：

两个模型各自的文件发布页（镜像）：

nvidia/Llama-3.1-Minitron-4B-Width-Base at main · HF Mirror

nvidia/Llama-3.1-Minitron-4B-Depth-Base at main · HF Mirror

校验方法：

打开文件的发布页，点击红框内的链接之一，

下滑，找到几个关键文件的校验信息，并记录

在命令行中转到这几个文件所在目录，计算文件的哈希值

cd 【待校验文件所在目录】
certutil -hashfile model-00001-of-00002.safetensors SHA256
certutil -hashfile model-00002-of-00002.safetensors SHA256

文件较大，需耐心等待哈希完成

如果返回的哈希值能够与之前记录的值对应，即下载正确，否则应该重新下载并校验。

对于没有哈希校验信息但有文件大小（size）信息的，可以在属性中查看文件字节大小进行校验。

比如这两个文件：大小不匹配，所以哈希值也肯定对应不上，应该重新下载。

2、使用transformers库运行

首先，安装 transformers 和 torch 库。（如果没有的话）

pip install transformers torch

然后就可以运行了，

注意先把代码中的路径改成你的。

# 不是官方示例呢，我自己写的（仰脸）
# 官方的可以在模型发布页找到
from transformers import AutoTokenizer, LlamaForCausalLM
import torch


# 定义模型路径
# 这是我的，应改为你的模型路径，记得双斜杠 ↓
model_path = "C:\\Users\\LingL\\llama.cpp\\models\\Llama-3.1-Minitron-4B-Depth-Base"

# 加载 tokenizer 和模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
dtype = torch.bfloat16  # 使用混合精度以节省显存
device = "auto"  # 使用 HuggingFace 的自动设备映射
model = LlamaForCausalLM.from_pretrained(model_path, torch_dtype=dtype, device_map=device)


# 将模型转移到 GPU 上，如果CUDA可用的话
print("-------------------------------------\n")
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name()}")
else:
# 如果不可用，算了
    device = torch.device("cpu")
    print("CUDA is not available, using CPU.")
print("-------------------------------------\n")



# 导入输入内容，并返回输出内容
def run_test(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
    attention_mask = torch.ones_like(input_ids)  # 创建 attention mask   
    with torch.no_grad():
        output = model.generate(input_ids, 
                                attention_mask=attention_mask,  # 传递 attention_mask
                                max_length=50,  # 生成长度限制，但长度需要大于输入
                                repetition_penalty=1.2, # 惩罚重复
                                pad_token_id=tokenizer.eos_token_id,
                                num_return_sequences=1,  # 确保只返回一个序列
                                do_sample=False)  # 关闭采样以减少随机性
    return tokenizer.decode(output[0], skip_special_tokens=True)


# 待输入的prompt，会被按顺序输入模型
input_texts = [
    # 1.注意添加逗号，不然两prompt会被视作一伙同时输入
    # 2.长度需要小于 max_length 的生成长度限制，具体数值见上文设定
    "Explain the theory of relativity: It is", #补完文本
    "1,1,2,3,5,8,13,21,34,55,"                 #数列推理
    "The Sky is",                              #自由联想

]


# 执行测试并打印结果
for text in input_texts:
    print(f"\n▶Input: {text}\n▶Output: {run_test(text)}\n")

运行结果参考：

▶Input: Explain the theory of relativity: It is
▶Output: Explain the theory of relativity: It is a theory of physics that explains how space and time are related to each other. It was developed by Albert Einstein in 1905 and has since been confirmed by numerous experiments.


▶Input: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55,
▶Output: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976


▶Input: The sky is
▶Output: The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling! The sky is falling
# 建议关注模型的精神状态

另进行的一些基本推理测试

推理代数的加减运算（m=0, n=5, o=m+n, o=5），表现良好

▶Input: a=1, b=2, c=a+b, c=3. d=1, e=5, f=d+e, f=6. g=1, h=3, i=g+h, i=4. j=2, k=3, l=j+k, l=5. m=0, n=5, o=m+n, o=
▶Output: a=1, b=2, c=a+b, c=3. d=1, e=5, f=d+e, f=6. g=1, h=3, i=g+h, i=4. j=2, k=3, l=j+k, l=5. m=0, n=5, o=m+n, o=5. p=1, q=2, r=p+q, r=3. s=1, t=2, u=s+t, u=3.

推理字母对应的数字（按字母的顺序从1至26对应，e=5），表现良好

▶Input: a=1,z=26,e=?
▶Output: a=1,z=26,e=? (a=1,z=26,e=5) (a=1,z

推理接下来的字母（按键盘按键顺序，且每3字母有1逗号），表现良好

▶Input: qwe, rty, uio, pas, dfg, hjk, lzx, cvb, nmq, wer, tyu, iop, asd, fgh, j
▶Output: qwe, rty, uio, pas, dfg, hjk, lzx, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, tyu, iop, asd, fgh, jkl, zxc, cvb, nmq, wer, ty

推理斐波那契数列，表现良好

▶Input: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55,
▶Output: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976

总结

### 文章总结
本文探讨了作者在使用旧笔记本运行LLama-8B模型遇到挑战后，选择尝试更紧凑的LLama-3.1-Minitron 4B模型的经历。该4B模型通过权重剪枝和知识提炼技术优化自LLama-3.1 8B，提供Width-Base和Depth-Base两种基座模型选择，供资源有限的用户使用。
#### 面临的挑战
1. **无法使用Ollama**：由于CPU资源不足或模型转换错误，作者暂时无法在旧笔记本上运行Ollama。
2. **transformers库和Python**成为临时解决方案：尽管存在交互不便，但使用transformers库和Python能够成功加载和运行LLama-3.1-Minitron 4B模型。
#### LLama-3.1-Minitron 4B模型使用步骤
##### 1. 下载模型
- **推荐方式**：从Hugging Face的镜像HF-Mirror下载，或通过Git LFS手动克隆。
- Width-Base: `git clone https://hf-mirror.com/nvidia/Llama-3.1-Minitron-4B-Width-Base`
- Depth-Base: `git clone https://hf-mirror.com/nvidia/Llama-3.1-Minitron-4B-Depth-Base`
- **手动下载与校验**：使用浏览器访问模型文件发布页，下载后通过与官方发布的校验信息进行比对确认。
##### 2. 使用transformers库运行
- **安装所需库**：首先安装`transformers`和`torch`库（如未安装）。
```bash
pip install transformers torch
```
- **代码示例**：
- 设置模型路径，加载tokenizer和模型。
- 使用torch自动判别GPU或CPU，并对模型的输入进行处理，最终生成响应。
- 示例代码中提供了`run_test`函数来测试输入文本和接收输出内容。
#### 测试结果
- 作者通过多个prompt测试了模型的性能，包括文本补完、数列推理、自由联想等，并反馈了模型的实际输出效果。
- 推理测试涵盖代数运算、字母数字对应关系、键盘序列推理、斐波那契数列推理等，表现良好。
#### 结论
尽管作者受限于硬件资源，但通过选择和配置合适的LLama-3.1-Minitron 4B模型，实现了在旧笔记本上的语言模型应用。这一过程涵盖了从模型下载、校验到具体编程实现的全过程，为其他有类似需求的用户提供了参考和借鉴。