从零预训练一个tiny-llama#Datawhale组队学习Task2

完整的教程请参考：datawhalechina/tiny-universe: 《大模型白盒子构建指南》：一个全手搓的Tiny-Universe (github.com)

这是Task2的学习任务

Qwen-blog

Tokenizer（分词器）

Embedding（嵌入）

RMS Norm（RMS Layer Normalization）

enumerate函数

Flash Attention介绍

GQA

RoPE（旋转位置编码嵌入）

Tiny-llama

Qwen-blog

因为llama和Qwen的架构类似，所以这里按照Qwen的架构来学习。

Tokenizer（分词器）

Q：什么是Tokenizer？

A：Tokenizer的主要功能是将一段文本分割成更小的有意义的部分，这些部分被称为“tokens”（标记）。

Embedding（嵌入）

Q：什么是Embedding？

A：Embedding是一种将离散的符号（如单词或标记）映射到连续向量空间的技术。这些向量捕捉了词语之间的语义关系，并且可以用来表示词语在向量空间中的位置。

Tokenizer 是用来将原始文本转换成分词后的序列，它是文本预处理的一部分。 Embedding 则是将分词后的序列进一步转换为数值化的向量表示，以便于机器学习模型理解和处理。

通常先使用Tokenizer对文本进行分词，然后再将得到的tokens通过Embedding转换成向量，最终用于训练模型。

Q：关于Tokenizer好用的库推荐。

A：1）NLTK (Natural Language Toolkit)。NLTK 是一个强大的自然语言处理库，提供了多种分词器和其他NLP工具。2）Transformers (Hugging Face)。Transformers 库提供了多种预训练的Transformer模型，包括分词器。

RMS Norm（RMS Layer Normalization）

Q：什么是RMSNorm？

A：RMS Layer Normalization是对Batch Normalization（BN）和Layer Normalization（LN）的一种改进。RMSNorm针对每一批次中的每一个样本进行归一化处理，计算特征的均方根值，Layer Normalization计算的是每个特征的均值和方差。

enumerate函数

enumerate() 函数是 Python 中的一个内置函数，用于遍历可迭代对象（如列表、元组、字符串等）时同时获取元素及其对应的索引。这对于需要在循环中同时使用元素及其索引的情况非常有用。enumerate() 函数使得在遍历时获取元素的索引变得非常方便，适用于需要索引信息的各种场景。

Flash Attention介绍

Q：什么是Flash Attention？

A：Flash Attention 是一种用于加速Transformer模型中自注意力（self-attention）机制的优化技术。传统自注意力机制的时间复杂度和内存消耗均为 O(N2)，其中 N是序列长度。Flash Attention 通过优化注意力计算的方式，能够在保持计算精度的同时显著降低计算成本，特别是内存消耗。

Q：Flash Attention的机制是什么？如何实现这种优化的？

A：Flash Attention 的核心思想在于利用稀疏矩阵运算和近似计算来减少自注意力机制的计算量。Flash Attention 通过稀疏化注意力矩阵，仅保留最重要的注意力权重，从而减少了计算和内存开销。Flash Attention 在计算注意力权重时采用近似方法，允许在一定程度上牺牲精确度以换取更高的计算效率。

Q：对比传统的自注意力机制，Flash Attention的改进体现在哪里？

A：传统的自注意力机制，包括步骤1）查询、键、值计算；2）注意力得分计算；3）Softmax规范化；4）加权求和。laFlash Attention的改进主要体现在第2步：注意力得分计算：不是直接计算所有查询和键之间的点积，而是通过稀疏化和近似计算来估计注意力得分。

GQA

Q：什么是GQA？

A：Grouped-query attention（GQA，分组查询注意力）是一种改进的多头注意力机制，旨在提高Transformer模型的效率和效果。这种方法通过减少键和值向量的数量来降低计算复杂度，同时保持模型性能。

Q：计算过程中，对Q，K，V分别进行了transpose操作，解释一下。

A：这里对Q，K，V进行transpose操作是为了转置操作是为了让数据更适合进行多头注意力计算。将输入的query,key,value形状从（B,L,D）（其中 B是批量大小，L是序列长度，D是向量的维度）转换成 (B,H,L,Dh)的格式，计算注意力得分的时候再对key进行transpose，将 key的形状从 (B,H,L,Dh)转换成 (B,H,Dh,L)，计算点积。

缩放点积注意力的计算公式如下：

Q：这段代码中，为什么要用expand之后再reshape而不能直接用tensor自带的repeat?

A：在PyTorch中，expand() 和 repeat() 都可以用来复制张量中的元素以改变其形状。expand() 方法不会复制内存中的数据，而是创建一个视图（view），这个视图指向原始张量的存储空间。使用 expand() 不会增加内存负担；repeat() 方法则会真正地复制数据。使用 repeat() 会导致内存使用增加。

RoPE（旋转位置编码嵌入）

Q：什么是位置编码？

A：在自然语言处理任务中，模型需要理解文本中单词的顺序。位置编码（Positional Encoding, PE）是一种在序列模型（如Transformer）中加入位置信息的方法。

Q：Transformer 中的位置编码设计方式？

A：在Transformer的原始论文中，位置编码是通过一个确定性的函数来计算的，该函数根据位置和维度来生成位置编码向量。具体来说，位置编码向量是通过正弦和余弦函数生成的。

Q：这种编码方式的局限性？

A：Transformer 原始位置编码的局限性主要体现在：

1）固定长度限制：最大长度限制：原始的位置编码是基于一个预定义的最大序列长度计算的，这限制了模型处理更长序列的能力。非灵活长度：当处理不同长度的序列时，原始的位置编码需要根据序列的实际长度进行截断或填充；

2）计算复杂度：额外计算：在实际应用中，需要将位置编码向量加到词嵌入上，这增加了额外的计算量。参数开销：它们仍然占据了额外的内存空间，尤其是当序列长度较长时。

3）绝对位置信息：

缺乏相对位置信息：原始的位置编码提供了绝对位置信息，但缺乏相对位置信息。相对位置信息对于捕捉序列中的局部模式非常重要，特别是在长距离依赖关系中。位置信息稀疏：在长序列中，位置编码可能不足以提供足够的位置信息来区分远距离的标记。

Q：旋转位置编码（RoPE）做了哪些改进？

A：旋转位置编码（RoPE）的改进主要体现在：

1）灵活处理任意长度的序列：

动态适应：RoPE 可以动态地适应任意长度的序列，不需要预先设定一个最大长度，因此可以更好地处理不同长度的输入序列。无需重新计算：即使序列长度发生变化，也不需要重新计算位置编码，因为旋转矩阵是根据实际位置动态生成的。

2）减少计算和内存开销：

无额外参数：RoPE 不需要额外的位置编码参数，减少了模型的参数量。高效计算：旋转操作相对简单，计算速度快，不会显著增加模型的计算负担。

3）增强相对位置信息：

相对位置编码：RoPE 通过旋转查询和键向量来模拟相对位置信息，这有助于模型更好地捕捉序列中的局部模式。增强长距离依赖：RoPE 可以更好地保留长序列中的位置信息，从而增强模型处理长距离依赖关系的能力。

Q：旋转位置编码（RoPE）通过旋转操作巧妙地结合了绝对位置信息和相对位置信息。如何理解？

A：绝对位置信息：每个位置 i通过旋转矩阵 Ri 被编码进向量中。这意味着每个位置 i 都有一个唯一的旋转矩阵，从而赋予了每个位置的向量一个独特的表示。

相对位置信息：假设我们有两个位置 i 和 j，对应的旋转矩阵分别为 Ri 和 Rj。当 i≠j 时，Ri 和 Rj 之间的差异反映了位置 i 和位置 j 之间的相对位置关系。

Q：代码是如何实现的？

A：代码定义了一个用于生成旋转位置编码（RoPE）的类。它通过预先计算 cos 和 sin 缓存来加速旋转操作，并且能够在需要时动态更新缓存以支持更长的序列长度。具体步骤如下：

初始化：设置必要的参数，并计算逆频率。生成缓存：生成 cos 和 sin 缓存数据。前向传播：返回旋转矩阵，并在必要时更新缓存

class Qwen2RotaryEmbedding(nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()
        # 定义初始值
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        # 定义旋转角
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq, persistent=False)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )
    # 为seq里面的每个token形成独一无二的旋转角嵌入(外积)
    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)

        freqs = torch.outer(t, self.inv_freq)
        # 生成角度信息(利用注册机制生成self.cos_cached与sin_cached
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )

Tiny-llama

这块的详细教程可以参考：KMnO4-zx/tiny-llm (github.com)。

按照以下步骤，开跑：

训练Tokenizer： python train_vocab.py --download True --vocab_size 4096 数据预处理：python preprocess.py 训练模型：python train.py 使用模型生成文本：python sample.py --prompt "One day, Lily met a Shoggoth"

贴一张运行中的图

开跑，等结果，预计要一天吧。

最终结果，可以看到损失还是比较大的，可能是初始的学习率过高了，导致loss缓慢上升。

我们测试一下效果吧。prompt "One day, Lily met a Shoggoth"，然后我们看看模型的输出效果吧：

Sample 1:
One day, Lily met a Shoggoth leaves. She was so a friends with dark and playedro must and as help. She looked, starteded old way food on hungry. 
F brave scared in running together all moral outside outside.
The be yummy people. "Tom, looked soon want to the too. both can f them red tail fell. He took other played. She looked they laughed played.
Tom and if having others again gardenate toys and that hurt what over thatw.
The pond made ask all he it!" M put the sky happily we friend. They. They and scared and put ask off himself. They into a water againat. You the be then anymorear if too. He didn't this. They more people and proudd feelr some feel dog.
A listen him feel park, Max when you the that of the mommying friends not out and they not go to go look now. He lots to real pictures, with the little smile. The friends knows. They not funb on for clean. Theyed outing can doing lost.
P, be are know and you she food smile. They became be st. Once upon a time, there was broken hat is truck. Tim was surprised and became good friends. The small
--------------------

Sample 2:
One day, Lily met a Shoggoth was any strange said, "S bed important overr here looking again and delicious as dad that she only.
The soft bird so two?" Sheers hurt high. "OK, hurt feel scary and what was happy and as know proud that in softar over no. Jack ate know dress your truck after fix and hurt and by for like stay. Once upon a time always at know, them should p because need to play bed and askaes. One day, both nodded and playing upch clean and right.
The big headf got need so then asked others backl they said dog. Tim when my all used his man to explore friends you to it toy, inside. She continued going playing who smiled up hurt from this careful than came yummy.
The tree stopped and flew good. The so and he hard into fun. 
S wanted to think know just open as the dog p car. He found bright if though do out toys or set before.
Tim nodded around or singc we he was go for the happened. She and go to the over run and anything tree work could show room. Do going to understand sound and worry into feel do mom no. The fast magic if its w dog their house?"
A pond grabbed her
--------------------

Sample 3:
One day, Lily met a Shoggoth by being rockhil. The foresters was go. He was tired, warm feel put red away by how home. But one day green yummy warm up always?" this grandma thought look sh for their box where.
When something house, two always isly carefulot. Lily anything into it from a eyes. Her it have. She became they still thely and called the small nice. They if things don't sing again games family in the never.
Wheno, they didn't their cat in pictures. F about all around listen to the things and blue. It if hurt idea happily b unexpected as some yard always find what oring p she still.
S until feeling up. He lost his at the story, even toys when and a hat to set look. He is us around they and dad away. That he became ball." He always will comest something thing. Max with youes of best back. He with the park and and toys near going park?"
"Lilyar, moral. Maybe be by her dress. You more tree backit fell with should his sh. Why unexpected for funny he as making highit playinges. He room made toy. The dog idea on my.
Max here running rock people

那么本次笔记就记到这里了，喜欢的小伙伴收藏点赞关注吧。

总结

### 文章总结
**标题**: 《大模型白盒子构建指南》学习笔记（Task2）
**内容概述**:
本文是学习《大模型白盒子构建指南》中Task2的学习笔记，涵盖了多个与大模型构建相关的技术细节，主要包括Tokenizer（分词器）、Embedding（嵌入）、RMS Norm（RMS Layer Normalization）、enumerate函数、Flash Attention、GQA（分组查询注意力）、RoPE（旋转位置编码嵌入）以及Tiny-llama模型的基本介绍和实践步骤。
**详细内容**:
1. **Qwen-blog & Tokenizer（分词器）**:
- Tokenizer的功能是将文本分割成有意义的标记（tokens）。
- 推荐的Tokenizer库包括NLTK和Transformers。
2. **Embedding（嵌入）**:
- Embedding是一种将离散符号映射到连续向量空间的技术，用于捕捉词语间的语义关系。
- 通常先使用Tokenizer分词，再将tokens通过Embedding转换为向量，用于模型训练。
3. **RMS Norm（RMS Layer Normalization）**:
- RMS Layer Normalization是对Batch Normalization和Layer Normalization的改进，通过计算特征的均方根值进行归一化处理。
4. **enumerate函数**:
- Python内置函数，用于遍历可迭代对象时同时获取元素及其索引，便于在循环中同时使用元素和索引。
5. **Flash Attention介绍**:
- 一种用于加速Transformer模型中自注意力机制的优化技术，通过稀疏矩阵运算和近似计算减少计算量和内存消耗。
- 核心思想在于稀疏化注意力矩阵，保留最重要的注意力权重。
6. **GQA（分组查询注意力）**:
- 改进的多头注意力机制，通过减少键和值向量的数量来降低计算复杂度，同时保持模型性能。
- 涉及对Q, K, V的transpose操作，以便进行多头注意力计算。
7. **RoPE（旋转位置编码嵌入）**:
- 改进了Transformer中的位置编码方式，通过旋转操作动态地适应任意长度的序列，并减少计算和内存开销。
- RoPE能够结合绝对位置信息和相对位置信息，增强模型处理长距离依赖关系的能力。
8. **Tiny-llama**:
- 一个基于Qwen架构的模型，提供了详细的训练步骤和使用方法。
- 包括训练Tokenizer、数据预处理、模型训练和使用模型生成文本的具体命令。
**实践结果**:
- 通过训练Tiny-llama模型并生成文本，展示了模型在给定prompt下的输出效果，但损失较大，可能由于初始学习率过高。
**结论**:
本文详细介绍了大模型构建中的多个关键技术点，并通过Tiny-llama模型的实践展示了其应用过程。对于希望深入了解大模型构建的学习者来说，是一份有价值的参考资料。