LLMs之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

导读：该论文提出了一个开源的大规模语言模型LLaMA，2048个A100-80G训练21天。该模型有以下几个核心技术点：
>> 模型架构=Transformer+集合多个算法的优秀技术(RMSNorm+SwiGLU+RoPE+AdamW+xformers库+渐进式学习率)：LLaMA模型采用类似GPT的Transformer架构，但是使用了多项技术优化，特别是采用解决层归一化方法的16层模型。这相比于其他模型有更深的深度，能够学习更复杂的语言表示。

(1)、集合多个算法的优秀技术：预归一化函数RMSNorm、激活函数SwiGLU、旋转位置嵌入RoPE、AdamW优化器、高效的因果多头注意力xformers库加速。

(2)、渐进式学习率调度：LLaMA使用渐进式学习率调度方法，即训练开始时使用更大的学习率，然后随着训练的进行逐渐减小学习率。这可以帮助模型更快收敛到最优解。

>> 训练数据4TB+BPE分词(1.4万亿个tokens)—更多tokens+较小模型=可较好性能：LLaMA训练的数据集包含4TB的句子，只使用公开的数据集，英语CommonCrawl、C4、GitHub 、Wikipedia、Gutenberg+Books3+ArXiv、Stack Exchange。使用SentencePiece字节对编码(BPE)算法对数据进行分词(1.4万亿个tokens)。Chinchilla 论文中推荐在 200B(0.2T) 的 tokens 上训练 10B 规模的模型，而 LLaMA 使用了 1.4T tokens(1.4万亿个tokens) 训练 7B的模型，增大 tokens 规模，模型的性能仍在持续上升。

>> LLaMA包含从7B/13B/30B/65B参数的基础语言模型集合—LLaMA-13B 仅以 1/10参数性能优于 GPT-3(175B)：这是一个包含从7B~65B参数的基础语言模型集合。我们使用数万亿个标记对这些模型进行训练，并展示了可以仅使用公开可用的数据集进行训练，而无需使用专有和不可访问的数据集来训练最先进的模型。在多项语言模型和下游任务上的 benchmark上，LLaMA模型与同规模GPT(175B)模型相当或略优。但训练和推理速度明显更快，而LLaMA-65B与最好的模型Chinchilla-70B和PaLM-540B竞争力相当。

实战案例

Windows系统

LLMs：在单机CPU+Windows系统上对LLaMA模型(基于facebookresearch的GitHub)进行模型部署且实现模型推理全流程步骤【部署conda环境+安装依赖库+下载模型权重(国内外各种链接)→模型推理】的图文教程(非常详细)

https://yunyaniu.blog.csdn.net/article/details/130979622

LLMs：基于单机CPU+Windows系统实现中文LLaMA算法(基于Chinese-LLaMA-Alpaca)进行模型部署(llama.cpp)+模型推理全流程步骤【安装环境+创建环境并安装依赖+原版LLaMA转HF格式+合并llama_hf和chinese-alpaca-lora-7b→下载llama.cpp进行模型的量化(CMake编译+生成量化版本模型)→部署f16/q4_0+测试效果】的图文教程(非常详细)

https://yunyaniu.blog.csdn.net/article/details/131016046

LLMs：基于单个4GB GPU上(Windows系统)运行LLM上——pyllama模型(基于fjuncongmoo的GitHub)进行模型部署且实现模型推理全流程步骤的图文教程(非常详细)

https://yunyaniu.blog.csdn.net/article/details/131016598

Linux系统

LLMs：基于Chinese-LLaMA-Alpaca开源代码在Ng单机单卡利用LLaMA(Meta)和Alpaca(斯坦福)实现定义数据集(生成指令数据)→数据预处理(token分词/合并权重)→增量预训练(LoRA的参数/LLaMA的参数)→指令微调LoRA权重(继续训练/全新训练)→模型推理(CLI、GUI【webui/LLaMACha/LangChain】)

https://yunyaniu.blog.csdn.net/article/details/131319010

LLMs之LLaMA-7B-QLoRA：基于Alpaca-Lora代码在CentOS和多卡(A800+并行技术)实现全流程完整复现LLaMA-7B—安装依赖、转换为HF模型文件、模型微调(QLoRA+单卡/多卡)、模型推理(对比终端命令/llama.cpp/Docker封装)图文教程之详细攻略

https://yunyaniu.blog.csdn.net/article/details/131526139

《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

地址

论文：https://arxiv.org/abs/2302.13971

GitHub(基于Python部署)：https://github.com/facebookresearch/llama

GitHub(基于Python和C/C++部署)：
参考文章
https://baijiahao.baidu.com/s?id=1760235370943525251&wfr=spider&for=pc

部署文章
GitHub - ggerganov/llama.cpp: Port of Facebook's LLaMA model in C/C++

作者

Hugo Touvron∗, Thibaut Lavril∗, Gautier Izacard∗, Xavier Martinet Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal

Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave∗, Guillaume Lample∗

Meta AI

时间

2023年2月25日

Abstract

We introduce LLaMA, a collection of founda- tion language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly avail- able datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA- 65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community1.

我们介绍了LLaMA，这是一组参数范围从7B到65B的基础语言模型。我们使用数万亿个标记来训练我们的模型，并展示了可以仅使用公开可用的数据集进行训练，而不需要专有和不可访问的数据集来训练最先进的模型。特别是，LLaMA-13B在大多数基准测试中表现优于GPT-3（175B），LLaMA-65B与最好的模型Chinchilla-70B和PaLM-540B具有竞争力。我们将所有模型发布给研究社区。

1、Introduction

Large Languages Models (LLMs) trained on mas- sive corpora of texts have shown their ability to per- form new tasks from textual instructions or from a few examples (Brown et al., 2020). These few-shot properties first appeared when scaling models to a sufficient size (Kaplan et al., 2020), resulting in a line of work that focuses on further scaling these models (Chowdhery et al., 2022; Rae et al., 2021). These efforts are based on the assumption that more parameters will lead to better performance. However, recent work from Hoffmann et al. (2022) shows that, for a given compute budget, the best performances are not achieved by the largest mod- els, but by smaller models trained on more data.

大型语言模型（LLMs）在大规模文本语料库上训练后展现了它们根据文本指令或少量示例执行新任务的能力（Brown等，2020年）。这种少样本特性首次出现在将模型扩展到足够大的规模时（Kaplan等，2020年），随后有了一系列进一步扩展这些模型的工作（Chowdhery等，2022年；Rae等，2021年）。这些努力是基于一个假设，即更多的参数将导致更好的性能。然而，Hoffmann等人（2022年）的最新研究表明，在给定的计算预算下，最佳性能不是由最大的模型实现的，而是由更小的模型在更多数据上进行训练的模型实现的。

The objective of the scaling laws from Hoff- mann et al. (2022) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Hoffmann等人（2022年）的扩展定律的目标是确定如何在特定的训练计算预算下最佳地扩展数据集和模型大小。然而，这个目标忽视了推理预算，在大规模使用语言模型时变得至关重要。在这种情况下，给定目标性能水平，首选的模型不是训练最快的模型，而是推理最快的模型，尽管训练一个大型模型以达到一定的性能水平可能更便宜，但训练时间更长的较小模型在推理阶段最终更经济。例如，尽管Hoffmann等人（2022年）建议在200B个标记上训练一个10B模型，但我们发现7B模型的性能在训练1T个标记后仍在改善。

The focus of this work is to train a series of language models that achieve the best possible per- formance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA, ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs. For instance, LLaMA-13B outperforms GPT-3 on most bench- marks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU. At the higher-end of the scale, our 65B-parameter model is also competitive with the best large lan- guage models such as Chinchilla or PaLM-540B.

Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work com- patible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “Books – 2TB” or “Social media conversations”). There exist some exceptions, notably OPT (Zhang et al., 2022), GPT-NeoX (Black et al., 2022), BLOOM (Scao et al., 2022) and GLM (Zeng et al., 2022), but none that are competitive with PaLM-62B or Chinchilla.

本文的重点是训练一系列在各种推理预算下实现最佳性能的语言模型，通过使用比通常使用的更多标记进行训练。结果得到的模型称为LLaMA，参数范围从7B到65B，与现有最好的LLM相比具有竞争力的性能。例如，LLaMA-13B在大多数基准测试中优于GPT-3，尽管体积只有其1/10。我们相信，这个模型将有助于使LLMs的访问和研究民主化，因为它可以在单个GPU上运行。在规模较大的端，我们的65B参数模型与最好的大型语言模型（如Chinchilla或PaLM-540B）也具有竞争力。

与Chinchilla、PaLM或GPT-3不同，我们只使用公开可用的数据，使我们的工作与开源兼容，而大多数现有模型依赖于不公开可用或未经记录的数据（例如，“Books – 2TB”或“Social media conversations”）。也存在一些例外，例如OPT（Zhang等，2022年），GPT-NeoX（Black等，2022年），BLOOM（Scao等，2022年）和GLM（Zeng等，2022年），但没有一个能与PaLM-62B或Chinchilla相竞争。

In the rest of this paper, we present an overview of the modifications we made to the transformer architecture (Vaswani et al., 2017), as well as our training method. We then report the performance of our models and compare with others LLMs on a set of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community.

在本文的其余部分，我们将概述我们对Transformer架构（Vaswani等，2017年）所做的修改以及我们的训练方法。然后，我们将报告我们模型的性能，并与其他LLM在一系列标准基准测试中进行比较。最后，我们使用最近的一些负责任的AI社区的基准测试揭示了我们的模型中编码的一些偏见和有害信息。

2、Approach

Our training approach is similar to the methods described in previous work (Brown et al., 2020; Chowdhery et al., 2022), and is inspired by the Chinchilla scaling laws (Hoffmann et al., 2022). We train large transformers on a large quantity of textual data using a standard optimizer.

我们的训练方法类似于先前的工作（Brown等，2020年；Chowdhery等，2022年），并受到了Chinchilla扩展定律的启发（Hoffmann等，2022年）。我们使用标准优化器在大量文本数据上训练大型Transformer模型。

2.1、Pre-training Data

Our training dataset is a mixture of several sources, reported in Table 1, that cover a diverse set of do- mains. For the most part, we reuse data sources that have been leveraged to train other LLMs, with the restriction of only using data that is publicly available, and compatible with open sourcing. This leads to the following mixture of data and the per- centage they represent in the training set:

我们的训练数据集是多个来源的混合物，详见表格1，涵盖了各种领域。在很大程度上，我们重新使用了用于训练其他LLM的数据源，但限制是只使用公开可用的数据，并且与开源兼容。这导致了以下混合数据及其在训练集中所代表的百分比:

English CommonCrawl [67%]. We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al., 2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n- gram language model. In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.

C4 [15%]. During exploratory experiments, we observed that using diverse pre-processed Com- monCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data. The preprocessing of C4 also contains deduplication and language identifi- cation steps: the main difference with CCNet is the quality filtering, which mostly relies on heuris- tics such as presence of punctuation marks or the number of words and sentences in a webpage.

英语CommonCrawl [67%]。我们对五个CommonCrawl数据转储进行预处理，时间跨度从2017年到2020年，使用CCNet流程（Wenzek等，2020年）。该过程在行级别进行数据去重，使用fastText线性分类器进行语言识别以去除非英语页面，并使用n-gram语言模型过滤低质量内容。此外，我们训练了一个线性模型，用于对维基百科中用作参考的页面与随机抽样页面进行分类，并丢弃未被分类为参考文献的页面。

C4 [15%]。在探索性实验中，我们观察到使用多样的预处理CommonCrawl数据集可以提高性能。因此，我们在我们的数据中包括了公开可用的C4数据集（Raffel等，2020年）。C4的预处理也包括去重和语言识别步骤：与CCNet的主要区别在于质量过滤，主要依靠标点符号的存在或网页中的单词和句子数量。

Github [4.5%]. We use the public GitHub dataset available on Google BigQuery. We only kept projects that are distributed under the Apache, BSD and MIT licenses. Additionally, we filtered low quality files with heuristics based on the line length or proportion of alphanumeric characters, and removed boilerplate, such as headers, with reg- ular expressions. Finally, we deduplicate the result- ing dataset at the file level, with exact matches.

Wikipedia [4.5%]. We add Wikipedia dumps from the June-August 2022 period, covering 20 languages, which use either the Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. We process the data to remove hyperlinks, comments and other formatting boilerplate.

Gutenberg and Books3 [4.5%]. We include two book corpora in our training dataset: the Guten- berg Project, which contains books that are in the public domain, and the Books3 section of TheP- ile (Gao et al., 2020), a publicly available dataset for training large language models. We perform deduplication at the book level, removing books with more than 90% content overlap.

GitHub [4.5%]。我们使用Google BigQuery上公开可用的GitHub数据集。我们只保留按Apache、BSD和MIT许可证分发的项目。此外，我们使用基于行长度或包含字母数字字符比例的启发式方法过滤低质量文件，并使用正则表达式删除诸如标题之类的样板文件。最后，我们使用完全匹配在文件级别进行数据去重。

Wikipedia [4.5%]。我们添加了2022年6月至8月期间的维基百科转储，涵盖20种使用拉丁字母或西里尔字母的语言：bg、ca、cs、da、de、en、es、fr、hr、hu、it、nl、pl、pt、ro、ru、sl、sr、sv、uk。我们对数据进行处理，删除超链接、注释和其他格式样板。

Gutenberg和Books3 [4.5%]。我们的训练数据集中包括两个图书语料库：Guten- berg计划中包含的公共领域图书，以及ThePile（Gao等，2020年）的Books3部分，这是一个用于训练大型语言模型的公开可用数据集。我们对书籍进行了去重处理，删除了内容重叠超过90%的书籍。

ArXiv [2.5%]. We process arXiv Latex files to add scientific data to our dataset. Following Lewkowycz et al. (2022), we removed everything before the first section, as well as the bibliography. We also removed the comments from the .tex files, and inline-expanded definitions and macros written by users to increase consistency across papers.

Stack Exchange [2%]. We include a dump of Stack Exchange, a website of high quality ques- tions and answers that covers a diverse set of do- mains, ranging from computer science to chemistry. We kept the data from the 28 largest websites, re- moved the HTML tags from text and sorted the answers by score (from highest to lowest).

Tokenizer. We tokenize the data with the byte- pair encoding (BPE) algorithm (Sennrich et al., 2015), using the implementation from Sentence- Piece (Kudo and Richardson, 2018). Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

Overall, our entire training dataset contains roughly 1.4T tokens after tokenization. For most of our training data, each token is used only once dur- ing training, with the exception of the Wikipedia and Books domains, over which we perform ap- proximately two epochs.

ArXiv [2.5%]。我们处理arXiv的LaTeX文件，以向我们的数据集添加科学数据。根据Lewkowycz等（2022年）的方法，我们删除了第一节之前的所有内容以及参考文献部分。我们还从.tex文件中删除了注释，并对用户编写的内联扩展定义和宏进行了展开，以增加论文之间的一致性。

Stack Exchange [2%]。我们包括Stack Exchange的转储，这是一个高质量问题和回答的网站，涵盖了从计算机科学到化学的各种领域。我们保留了最大的28个网站的数据，从文本中删除了HTML标记，并按得分（从高到低）对答案进行排序。

分词器。我们使用字节对编码（BPE）算法（Sennrich等，2015年）对数据进行分词，使用SentencePiece（Kudo和Richardson，2018年）的实现。值得注意的是，我们将所有数字拆分为单独的数字，并使用字节对将未知的UTF-8字符进行分解。

总体而言，我们整个训练数据集在分词后包含大约1.4万亿个标记。对于我们的大多数训练数据，每个标记在训练过程中仅使用一次，除了维基百科和图书领域，我们对其进行了大约两个时期的训练。

2.2、Architecture

Following recent work on large language models, our network is based on the transformer architec- ture (Vaswani et al., 2017). We leverage various improvements that were subsequently proposed,and used in different models such as PaLM. Here are the main difference with the original architec- ture, and where we were found the inspiration for this change (in bracket):

在最近关于大型语言模型的研究中，我们的网络基于Transformer架构（Vaswani等，2017年）。我们利用了后来提出的各种改进方法，这些方法在不同模型中被使用，如PaLM。以下是与原始架构的主要不同之处以及我们对此变化的启示（方括号内）：

Pre-normalization [GPT3]. To improve the training stability, we normalize the input of each transformer sub-layer, instead of normalizing the output. We use the RMSNorm normalizing func- tion, introduced by Zhang and Sennrich (2019).

SwiGLU activation function [PaLM]. We re- place the ReLU non-linearity by the SwiGLU ac- tivation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 4d instead of 4d as in PaLM.

Rotary Embeddings [GPTNeo]. We remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by Su et al. (2021), at each layer of the network.

[GPT3]预归一化的RMSNorm归一化函数。为了改善训练稳定性，我们对每个Transformer子层的输入进行归一化，而不是对输出进行归一化。我们使用RMSNorm归一化函数，由Zhang和Sennrich（2019年）引入。

[PaLM]激活函数 SwiGLU。我们将ReLU非线性激活函数替换为SwiGLU激活函数，该函数由Shazeer（2020年）引入以提高性能。与PaLM不同的是，我们使用了4d的维度而不是PaLM中的4d。

[GPTNeo]旋转位置嵌入 RoPE。我们移除了绝对位置嵌入，并在网络的每个层添加了旋转位置嵌入（RoPE），这是由Su等（2021年）引入的。

The details of the hyper-parameters for our dif-ferent models are given in Table 2.

有关我们不同模型的超参数详细信息，请参见表2。

2.3 Optimizer

Our models are trained using the AdamW opti- mizer (Loshchilov and Hutter, 2017), with the fol- lowing hyper-parameters: β1 = 0.9, β2 = 0.95. We use a cosine learning rate schedule, such that the final learning rate is equal to 10% of the maximal learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. We use 2, 000 warmup steps, and vary the learning rate and batch size with the size of the model (see Table 2 for details).

我们使用AdamW优化器（Loshchilov和Hutter，2017年）进行模型训练，使用以下超参数：β1 = 0.9，β2 = 0.95。我们使用余弦学习率调度，使最终学习率等于最大学习率的10%。我们使用权重衰减0.1和梯度裁剪1.0。我们使用2,000个预热步骤，并根据模型的大小调整学习率和批次大小（详见表2）。

2.4 Efficient implementation高效实现

We make several optimizations to improve the train-ing speed of our models. First, we use an efficient implementation of the causal multi-head attention to reduce memory usage and runtime. This imple-mentation, available in the xformers library,2 is inspired by Rabe and Staats (2021) and uses the backward from Dao et al. (2022). This is achieved by not storing the attention weights and not com-puting the key/query scores that are masked due to the causal nature of the language modeling task.

我们进行了几项优化来提高模型的训练速度。首先，我们使用了一种高效的因果多头注意力实现，以减少内存使用和运行时间。这个实现在xformers库中可用，受到了Rabe和Staats（2021年）的启发，并使用了Dao等人（2022年）的反向传播。通过不存储注意力权重和不计算由于语言建模任务的因果性质而被屏蔽的键/查询得分，实现了这一点。

To further improve training efficiency, we re-duced the amount of activations that are recom-puted during the backward pass with checkpoint-ing. More precisely, we save the activations that are expensive to compute, such as the outputs of linear layers. This is achieved by manually imple-menting the backward function for the transformer layers, instead of relying on the PyTorch autograd. To fully benefit from this optimization, we need to reduce the memory usage of the model by using model and sequence parallelism, as described by Korthikanti et al. (2022). Moreover, we also over-lap the computation of activations and the commu-nication between GPUs over the network (due to all_reduce operations) as much as possible.

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days.

为了进一步提高训练效率，我们通过检查点技术减少了在反向传播过程中重新计算的激活数量。具体而言，我们保存了昂贵计算的激活，如线性层的输出。这是通过手动实现Transformer层的反向传播函数来实现的，而不是依赖于PyTorch的autograd。为了充分利用这种优化，我们需要使用模型和序列并行来减少模型的内存使用，正如Korthikanti等人（2022年）所描述的。此外，我们还尽可能地重叠计算激活和在网络上进行的GPU之间的通信（由于all_reduce操作）。

当训练一个拥有650亿参数的模型时，我们的代码在拥有80GB内存的2048个A100 GPU上每秒处理约380个标记。这意味着在包含1.4万亿标记的数据集上进行训练大约需要21天。

3 Main results主要结果

Following previous work (Brown et al., 2020), we consider zero-shot and few-shot tasks, and report results on a total of 20 benchmarks:

Zero-shot. We provide a textual description of the task and a test example. The model either provides an answer using open-ended generation, or ranks the proposed answers. Few-shot. We provide a few examples of the task (between 1 and 64) and a test example. The model takes this text as input and gener-ates the answer or ranks different options.

我们遵循以前的工作（Brown等，2020年），考虑了零样本和少样本任务，并在总共20个基准测试中报告结果：

零样本。我们提供任务的文本描述和一个测试示例。模型通过开放式生成回答或对提议的答案进行排序来回答。少样本。我们提供任务的几个示例（1到64个）和一个测试示例。模型以这些文本作为输入，生成答案或对不同选项进行排序。

We compare LLaMA with other foundation mod-els, namely the non-publicly available language models GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022) and PaLM (Chowdhery et al., 2022), as well as the open-sourced OPT models (Zhang et al., 2022), GPT-J (Wang and Komatsuzaki, 2021), and GPT-Neo (Black et al., 2022). In Section 4, we also briefly compare LLaMA with instruction-tuned models such as OPT-IML (Iyer et al., 2022) and Flan-PaLM (Chung et al., 2022).

We evaluate LLaMA on free-form generation tasks and multiple choice tasks. In the multiple choice tasks, the objective is to select the most appropriate completion among a set of given op-tions, based on a provided context. We select the completion with the highest likelihood given the provided context. We follow Gao et al. (2021) and use the likelihood normalized by the number of characters in the completion, except for certain datasets (OpenBookQA, BoolQ), for which we fol-low Brown et al. (2020), and select a completion based on the likelihood normalized by the likeli-hood of the completion given “Answer:” as context: P (completion|context)/P (completion|“Answer:”).

我们将LLaMA与其他基础模型进行比较，包括非公开可用的语言模型GPT-3、Gopher、Chinchilla和PaLM，以及开源的OPT模型、GPT-J和GPT-Neo。在第4节中，我们还简要比较了LLaMA与OPT-IML和Flan-PaLM等针对指令进行调整的模型。

我们在自由形式生成任务和多项选择任务上评估LLaMA。在多项选择任务中，目标是根据提供的上下文从给定选项中选择最合适的完成。我们选择在给定上下文中具有最高可能性的完成。我们遵循Gao等人（2021年）的方法，使用由完成的字符数进行归一化的可能性作为评估指标，但对于某些数据集（OpenBookQA、BoolQ），我们按照Brown等人（2020年）的方法，根据在给定上下文"Answer:"的情况下完成的可能性与完成的可能性的比值进行选择：P(完成|上下文)/P(完成|"Answer:")。

3.1 Common Sense Reasoning常识推理

We consider eight standard common sense rea-soning benchmarks: BoolQ (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019),HellaSwag (Zellers et al., 2019), WinoGrande (Sak-aguchi et al., 2021), ARC easy and challenge (Clark et al., 2018) and OpenBookQA (Mihaylov et al., 2018). These datasets include Cloze and Winograd style tasks, as well as multiple choice question an-swering. We evaluate in the zero-shot setting as done in the language modeling community.

In Table 3, we compare with existing models of various sizes and report numbers from the cor-responding papers. First, LLaMA-65B outper-forms Chinchilla-70B on all reported benchmarks but BoolQ. Similarly, this model surpasses PaLM- 540B everywhere but on BoolQ and WinoGrande. LLaMA-13B model also outperforms GPT-3 on most benchmarks despite being 10× smaller.

我们考虑了八个常识推理基准测试：BoolQ、PIQA、SIQA、HellaSwag、WinoGrande、ARC easy和challenge以及OpenBookQA。这些数据集包括Cloze和Winograd风格的任务，以及多项选择题。我们按照语言建模社区的做法，在零样本设置下进行评估。

在表3中，我们与各种规模的现有模型进行比较，并报告了相应论文中的数据。首先，LLaMA-65B在除了BoolQ以外的所有基准测试中都优于Chinchilla-70B。同样，除了BoolQ和WinoGrande以外，该模型也超过了PaLM-540B。LLaMA-13B模型尽管大小只有GPT-3的10倍小，但在大多数基准测试中表现优于GPT-3。

3.2 Closed-book Question Answering闭书式问答

We compare LLaMA to existing large language models on two closed-book question answering benchmarks: Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). For both benchmarks, we report exact match perfor-mance in a closed book setting, i.e., where the mod-els do not have access to documents that contain evidence to answer the question. In Table 4, we report performance on NaturalQuestions, and in Ta-ble 5, we report on TriviaQA. On both benchmarks, LLaMA-65B achieve state-of-the-arts performance in the zero-shot and few-shot settings. More im-portantly, the LLaMA-13B is also competitive on these benchmarks with GPT-3 and Chinchilla, de-spite being 5-10× smaller. This model runs on a single V100 GPU during inference.

我们将LLaMA与现有的大型语言模型在两个闭书式问答基准测试上进行比较：自然问题和TriviaQA。对于这两个基准测试，我们报告了在闭书设置下的准确匹配性能，即模型无法访问包含回答问题所需证据的文档。在表4中，我们报告了在自然问题上的性能，而在表5中，我们报告了在TriviaQA上的性能。在这两个基准测试中，LLaMA-65B在零样本和少样本设置下达到了最先进的性能。更重要的是，LLaMA-13B在这些基准测试上与GPT-3和Chinchilla相比也具有竞争力，尽管模型只在单个V100 GPU上进行推断。

3.3 Reading Comprehension阅读理解

We evaluate our models on the RACE reading com-prehension benchmark (Lai et al., 2017). This dataset was collected from English reading com-prehension exams designed for middle and high school Chinese students. We follow the evaluation setup from Brown et al. (2020) and report results in Table 6. On these benchmarks, LLaMA-65B is competitive with PaLM-540B, and, LLaMA-13B outperforms GPT-3 by a few percents.

我们在RACE阅读理解基准测试上评估了我们的模型。这个数据集是从为中国中学和高中学生设计的英语阅读理解考试中收集的。我们按照Brown等人（2020年）的评估设置进行评估，并在表6中报告结果。在这些基准测试中，LLaMA-65B与PaLM-540B具有竞争力，而LLaMA-13B在几个百分点上优于GPT-3。

3.4 Mathematical reasoning数学推理

We evaluate our models on two mathematical rea-soning benchmarks: MATH (Hendrycks et al., 2021) and GSM8k (Cobbe et al., 2021). MATH is a dataset of 12K middle school and high school mathematics problems written in LaTeX. GSM8k is a set of middle school mathematical problems. In Table 7, we compare with PaLM and Min-erva (Lewkowycz et al., 2022). Minerva is a series of PaLM models finetuned on 38.5B tokens ex-tracted from ArXiv and Math Web Pages, while neither PaLM or LLaMA are finetuned on mathe-matical data. The numbers for PaLM and Minerva are taken from Lewkowycz et al. (2022), and we compare with and without maj1@k. maj1@k de-notes evaluations where we generate k samples for each problem and perform a majority voting (Wang et al., 2022). On GSM8k, we observe that LLaMA- 65B outperforms Minerva-62B, although it has not been fine-tuned on mathematical data.

我们在两个数学推理基准测试上评估了我们的模型：MATH（Hendrycks等人，2021年）和GSM8k（Cobbe等人，2021年）。MATH是一个包含12,000个中学和高中数学问题的数据集，使用LaTeX编写。GSM8k是一组中学数学问题。在表7中，我们与PaLM和Minerva（Lewkowycz等人，2022年）进行了比较。Minerva是一系列在ArXiv和数学网页中提取的3850亿标记上微调的PaLM模型，而PaLM和LLaMA都没有在数学数据上进行微调。PaLM和Minerva的数据来自Lewkowycz等人（2022年），我们比较了有无maj1@k的结果。maj1@k表示我们为每个问题生成k个样本，并进行多数投票（Wang等人，2022年）的评估。在GSM8k上，我们观察到LLaMA-65B优于Minerva-62B，尽管它没有在数学数据上进行微调。

3.5 Code generation代码生成

We evaluate the ability of our models to write code from a natural language description on two benchmarks: HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021). For both tasks, the model receives a description of the program in a few sentences, as well as a few input-output ex-amples. In HumanEval, it also receives a function signature, and the prompt is formatted as natural code with the textual description and tests in a docstring. The model needs to generate a Python program that fits the description and satisfies the test cases. In Table 8, we compare the pass@1 scores of our models with existing language mod-els that have not been finetuned on code, namely PaLM and LaMDA (Thoppilan et al., 2022). PaLM and LLaMA were trained on datasets that contain a similar number of code tokens.

我们在两个代码生成基准测试上评估我们的模型对于从自然语言描述中生成代码的能力：HumanEval（Chen等人，2021年）和MBPP（Austin等人，2021年）。对于这两个任务，模型接收到一段程序的描述，包括几个输入-输出示例。在HumanEval中，它还会接收到一个函数签名，而提示文本的格式是自然代码，其中包含了文本描述和测试用例。模型需要生成一个符合描述并满足测试用例的Python程序。在表8中，我们将我们的模型的pass@1得分与未在代码上进行微调的现有语言模型进行了比较，包括PaLM和LaMDA（Thoppilan等人，2022年）。PaLM和LLaMA都是在包含相似数量的代码标记的数据集上进行训练的。

As show in Table 8, for a similar number of parameters, LLaMA outperforms other gen-eral models such as LaMDA and PaLM, which are not trained or finetuned specifically for code. LLaMA with 13B parameters and more outper-forms LaMDA 137B on both HumanEval and MBPP. LLaMA 65B also outperforms PaLM 62B, even when it is trained longer. The pass@1 results reported in this table were obtained by sampling with temperature 0.1. The pass@100 and pass@80 metrics were obtained with temperature 0.8. We use the same method as Chen et al. (2021) to obtain unbiased estimates of the pass@k.

如表8所示，对于相似数量的参数，LLaMA优于其他通用模型，如LaMDA和PaLM，这些模型没有专门针对代码进行训练或微调。LLaMA拥有13B参数及以上，在HumanEval和MBPP上的表现优于LaMDA 137B。即使在训练时间更长的情况下，LLaMA 65B也优于PaLM 62B。表中报告的pass@1结果是在温度为0.1的情况下采样得到的。pass@100和pass@80指标是在温度为0.8的情况下获得的。我们使用与Chen等人（2021年）相同的方法来获得pass@k的无偏估计。

It is possible to improve the performance on code by finetuning on code-specific tokens. For instance, PaLM-Coder (Chowdhery et al., 2022) increases the pass@1 score of PaLM on HumanEval from 26.2% for PaLM to 36%. Other models trained specifically for code also perform better than gen-eral models on these tasks (Chen et al., 2021; Ni-jkamp et al., 2022; Fried et al., 2022). Finetuning on code tokens is beyond the scope of this paper.

通过在代码上进行微调可以提高代码的性能。例如，PaLM-Coder（Chowdhery等人，2022年）将PaLM在HumanEval上的pass@1得分从26.2%提高到36%。专门针对代码进行训练的其他模型在这些任务上也表现优于通用模型（Chen等人，2021年；Nijkamp等人，2022年；Fried等人，2022年）。在本文范围之外，对代码标记进行微调。

3.6 Massive Multitask Language Understanding大规模多任务语言理解

The massive multitask language understanding benchmark, or MMLU, introduced by Hendrycks et al. (2020) consists of multiple choice questions covering various domains of knowledge, includ-ing humanities, STEM and social sciences. We evaluate our models in the 5-shot setting, using the examples provided by the benchmark, and report results in Table 9. On this benchmark, we observe that the LLaMA-65B is behind both Chinchilla- 70B and PaLM-540B by a few percent in average, and across most domains. A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books. This large quantity of books used by Gopher, Chinchilla and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, while it is comparable on other benchmarks.

大规模多任务语言理解基准测试（MMLU），由Hendrycks等人（2020年）引入，包含涵盖人文、STEM和社会科学等各个领域的多项选择题。我们在5-shot设置下使用该基准测试提供的示例评估我们的模型，并在表9中报告结果。在这个基准测试中，LLaMA-65B在平均值上略逊于Chinchilla-70B和PaLM-540B，并且在大多数领域中也是如此。一个可能的解释是，我们在预训练数据中使用了有限数量的图书和学术论文，即ArXiv、Gutenberg和Books3，总共只有177GB，而这些模型是在高达2TB的图书上进行训练的。这些由Gopher、Chinchilla和PaLM使用的大量图书可能也解释了为什么Gopher在这个基准测试中优于GPT-3，而在其他基准测试中相当。

3.7 Evolution of performance during training训练期间性能的演变

During training, we tracked the performance of our models on a few question answering and common sense benchmarks, and report them in Figure 2. On most benchmarks, the performance improves steadily, and correlates with the training perplexity of the model (see Figure 1). The exceptions are SIQA and WinoGrande. Most notably, on SIQA, we observe a lot of variance in performance,that may indicate that this benchmark is not reliable. On WinoGrande, the performance does not correlate as well with training perplexity: the LLaMA-33B and LLaMA-65B have similar performance during the training.

在训练过程中，我们跟踪了我们的模型在一些问答和常识基准测试上的性能，并在图2中进行了报告。在大多数基准测试中，性能稳步提升，并与模型的训练困惑度相关（参见图1）。SIQA和WinoGrande是例外。特别是在SIQA上，我们观察到性能有很大的变化，这可能表明该基准测试不太可靠。在WinoGrande上，性能与训练困惑度的相关性不太明显：LLaMA-33B和LLaMA-65B在训练期间的性能相似。

4 Instruction Finetuning指令微调

In this section, we show that briefly finetuning on instructions data rapidly leads to improvements on MMLU. Although the non-finetuned version of LLaMA-65B is already able to follow basic in-structions, we observe that a very small amount of finetuning improves the performance on MMLU, and further improves the ability of the model to follow instructions. Since this is not the focus of this paper, we only conducted a single experiment following the same protocol as Chung et al. (2022) to train an instruct model, LLaMA-I.

在本节中，我们展示了在指令数据上进行简短微调会迅速改善在MMLU上的性能。尽管LLaMA-65B的非微调版本已经能够遵循基本的指令，但我们观察到微调很少的量可以提高在MMLU上的性能，并进一步提高模型遵循指令的能力。由于这不是本文的重点，我们只进行了一个实验，按照Chung等人（2022年）的协议训练了一个指令模型LLaMA-I。

In Table 10, we report the results of our instruct model LLaMA-I on MMLU and compare with ex-isting instruction finetuned models of moderate sizes, namely, OPT-IML (Iyer et al., 2022) and the Flan-PaLM series (Chung et al., 2022). All the re-ported numbers are from the corresponding papers. Despite the simplicity of the instruction finetuning approach used here, we reach 68.9% on MMLU. LLaMA-I (65B) outperforms on MMLU existing instruction finetuned models of moderate sizes, but are still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022)). The details of the performance on MMLU on the 57 tasks can be found in Table 16 of the appendix.

在表10中，我们报告了我们的指令模型LLaMA-I在MMLU上的结果，并与具有中等规模的现有指令微调模型OPT-IML（Iyer等人，2022年）和Flan-PaLM系列（Chung等人，2022年）进行了比较。所有报告的数据都来自相应的论文。尽管这里使用的指令微调方法相对简单，我们在MMLU上达到了68.9%的准确率。LLaMA-I（65B）在MMLU上的表现优于具有中等规模的现有指令微调模型，但仍远远落后于当前的最先进水平，即GPT code-davinci-002在MMLU上的准确率为77.4%（数据取自Iyer等人（2022年））。有关在57个任务上的MMLU性能细节，请参见附录的表16。

Humanities

STEM

Social Sciences

Other

Average

GPT-NeoX

20B

29.8

34.9

33.7

37.7

33.6

GPT-3

175B

40.8

36.7

50.4

48.8

43.9

Gopher

280B

56.2

47.4

71.9

66.1

60.0

Chinchilla

70B

63.6

54.9

79.3

73.9

67.5

25.6

23.8

24.1

27.8

25.4

PaLM

62B

59.5

41.9

62.7

55.8

53.7

540B

77.0

55.6

81.0

69.6

69.3

34.0

30.5

38.3

38.1

35.1

LLaMA

13B

45.0

35.8

53.8

53.3

46.9

33B

55.8

46.0

66.7

63.4

57.8

65B

61.8

51.7

72.9

67.4

63.4

Table 9: Massive Multitask Language Understanding (MMLU). Five-shot accuracy.

5 Bias, Toxicity and Misinformation偏见、有害内容和虚假信息

Large language models have been showed to re-produce and amplify biases that are existing in the training data (Sheng et al., 2019; Kurita et al., 2019), and to generate toxic or offensive con-tent (Gehman et al., 2020). As our training dataset contains a large proportion of data from the Web, we believe that it is crucial to determine the po-tential for our models to generate such content. To understand the potential harm of LLaMA-65B, we evaluate on different benchmarks that measure toxic content production and stereotypes detection. While we have selected some of the standard bench-marks that are used by the language model com-munity to indicate some of the issues with these models, these evaluations are not sufficient to fully understand the risks associated with these models.

已经有研究表明，大型语言模型能够复制和放大训练数据中存在的偏见（Sheng等人，2019年；Kurita等人，2019年），并生成有害或冒犯性的内容（Gehman等人，2020年）。由于我们的训练数据集包含大量来自互联网的数据，我们认为确定我们的模型生成此类内容的潜力是至关重要的。为了了解LLaMA-65B的潜在危害，我们在衡量有害内容生成和刻板印象检测的不同基准测试上进行评估。虽然我们选择了一些标准的基准测试，这些测试被语言模型社区用来指示这些模型存在的一些问题，但这些评估并不足以完全了解与这些模型相关的风险。

5.1 Real Toxicity Prompts

Language models can generate toxic language, e.g., insults, hate speech or threats. There is a very large range of toxic content that a model can generate, making a thorough evaluation challenging. Several recent work (Zhang et al., 2022; Hoffmann et al., 2022) have considered the RealToxicityPrompts benchmark (Gehman et al., 2020) as an indicator of how toxic is their model. RealToxicityPrompts consists of about 100k prompts that the model must complete; then a toxicity score is automatically evaluated by making a request to PerspectiveAPI 3. We do not have control over the pipeline used by the third-party PerspectiveAPI, making comparison with previous models difficult.

语言模型可以生成有害语言，例如侮辱、仇恨言论或威胁。模型可以生成的有害内容范围非常广泛，这使得全面评估变得具有挑战性。最近的一些研究（Zhang等人，2022年；Hoffmann等人，2022年）已将RealToxicityPrompts基准测试（Gehman等人，2020年）视为评估其模型有害性的指标。RealToxicityPrompts包含约10万个模型必须完成的提示，然后通过向PerspectiveAPI发出请求自动评估其有害性分数。我们无法控制第三方PerspectiveAPI使用的流程，这使得与先前模型的比较变得困难。

For each of the 100k prompts, we greedily gen-erate with our models, and measure their toxic-ity score. The score per prompt ranges from 0 (non-toxic) to 1 (toxic). In Table 11, we report our averaged score on basic and respectful prompt cat-egories of RealToxicityPrompts. These scores are “comparable” with what we observe in the litera-ture (e.g., 0.087 for Chinchilla) but the method-ologies differ between these work and ours (in terms of sampling strategy, number of prompts and time of API). We observe that toxicity increases with the size of the model, especially for Respect-ful prompts. This was also observed in previous work (Zhang et al., 2022), with the notable excep-tion of Hoffmann et al. (2022) where they do not see a difference between Chinchilla and Gopher, despite different sizes. This could be explained by the fact that the larger model, Gopher, has worse performance than Chinchilla, suggesting that the relation between toxicity and model size may only apply within a model family.

对于这10万个提示中的每个提示，我们使用我们的模型进行贪婪生成，并测量其有害性分数。每个提示的分数范围从0（非有害）到1（有害）。在表11中，我们报告了我们在RealToxicityPrompts的基本和尊重提示类别上的平均分数。这些分数与我们在文献中观察到的结果“可比”（例如，Chinchilla的分数为0.087），但这些工作与我们的工作方法不同（在采样策略、提示数量和API时间方面）。我们观察到有害性随着模型的大小增加而增加，特别是对于尊重提示。这也是以前的研究观察到的现象（Zhang等人，2022年），但Hoffmann等人（2022年）的研究是一个值得注意的例外，他们没有观察到Chinchilla和Gopher之间的差异，尽管它们的大小不同。这可能可以解释为较大的模型Gopher的性能比Chinchilla差，表明有害性和模型大小之间的关系可能仅适用于模型系列内部。

5.2 CrowS-Pairs

We evaluate the biases in our model on the CrowS-Pairs (Nangia et al., 2020). This dataset allows to measure biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, dis-ability, physical appearance and socioeconomic sta-tus. Each example is composed of a stereotype and an anti-stereotype, we measure the model prefer-ence for the stereotypical sentence using the per-plexity of both sentences in a zero-shot setting. Higher scores thus indicate higher bias. We com-pare with GPT-3 and OPT-175B in Table 12.

LLaMA compares slightly favorably to both models on average. Our model is particularly bi-ased in the religion category (+10% compared to OPT-175B), followed by age and gender. We ex-pect these biases to come from CommonCrawl de-spite multiple filtering steps.

我们在CrowS-Pairs（Nangia等人，2020年）上评估了我们模型的偏见。该数据集可用于衡量9个类别的偏见：性别、宗教、种族/肤色、性取向、年龄、国籍、残疾、外貌和社会经济地位。每个示例由一个刻板印象和一个反刻板印象组成，我们通过在零样本设置中比较两个句子的困惑度来衡量模型对刻板印象句子的偏好。较高的分数表示较高的偏见。我们在表12中与GPT-3和OPT-175B进行了比较。

总体而言，LLaMA在平均水平上略微优于这两个模型。我们的模型在宗教类别上表现出较大的偏见（相比于OPT-175B，增加了10%），其次是年龄和性别。我们认为这些偏见可能来自于CommonCrawl，尽管经过了多次过滤步骤。

5.3 WinoGender

To further investigate the biases of our model on the gender category, we look at the WinoGender benchmark (Rudinger et al., 2018), a co-reference resolution dataset. WinoGender is made of Wino-grad schema, and biases are evaluated by determin-ing if a model co-reference resolution performance is impacted by the gender of the pronoun.

More precisely, each sentence has three men-tions: an “occupation”, a “participant”, and a “pronoun” where the pronoun is co-referencing either the occupation or participant. We prompt the model to determine the co-reference relation and measure if it does so correctly according to the context of the sentence. The goal is to reveal if societal biases associated with occupations have been captured by the model. For example, a sentence in the WinoGender dataset is “The nurse notified the patient that his shift would be ending in an hour.”, which is followed by ‘His’ refers to. We then compare the perplexity of the continuations the nurse and the patient to per-form co-reference resolution with the model. We evaluate the performance when using 3 pronouns: “her/her/she”, “his/him/he” and “their/them/some-one” (the different choices corresponding to the grammatical function of the pronoun.

为了进一步研究我们的模型在性别类别上的偏见，我们查看了WinoGender基准测试（Rudinger等人，2018年），这是一个共参考消解数据集。WinoGender由Wino-grad schema构成，通过确定模型对代词的性别是否会影响共参考消解的性能来评估偏见。

更具体地说，每个句子有三个提及：一个“职业”、一个“参与者”和一个“代词”，其中代词与职业或参与者共指。我们提示模型确定共指关系，并根据句子的上下文正确地执行共参考消解。目标是揭示模型是否捕捉到了与职业相关的社会偏见。例如，WinoGender数据集中的一个句子是“The nurse notified the patient that his shift would be ending in an hour.”，接着是‘His’ refers to。然后，我们比较继续表示护士和病人的困惑度，以执行与模型的共参考消解。我们评估使用三个代词时的性能：“her/her/she”、“his/him/he”和“their/them/some-one”（不同选择对应于代词的语法功能）。

In Table 13, we report the co-reference scores for the three different pronouns contained in the dataset. We observe that our model is significantly better at performing co-reference resolution for the “their/them/someone” pronouns than for the “her/her/she” and “his/him/he” pronouns. A simi-lar observation was made in previous work (Rae et al., 2021; Hoffmann et al., 2022), and is likely indicative of gender bias. Indeed, in the case of the “her/her/she” and “his/him/he” pronouns, the model is probably using the majority gender of the occu-pation to perform co-reference resolution, instead of using the evidence of the sentence.

To further investigate this hypothesis, we look at the set of “gotcha” cases for the “her/her/she” and “his/him/he” pronouns in the WinoGender dataset. Theses cases correspond to sentences in which the pronoun does not match the majority gender of the occupation, and the occupation is the correct answer. In Table 13, we observe that our model, LLaMA-65B, makes more errors on the gotcha examples, clearly showing that it capture societal biases related to gender and occupation. The drop of performance exists for “her/her/she” and “his/him/he” pronouns, which is indicative of biases regardless of gender.

在表13中，我们报告了数据集中三个不同代词的共参考分数。我们观察到，对于“their/them/someone”代词，我们的模型在执行共参考消解时明显更好。以前的研究（Rae等人，2021年；Hoffmann等人，2022年）也得出了类似的观察结果，这很可能表明存在性别偏见。实际上，在“her/her/she”和“his/him/he”代词的情况下，模型可能使用职业的多数性别来执行共参考消解，而不是使用句子的证据。

为了进一步研究这个假设，我们查看了WinoGender数据集中“her/her/she”和“his/him/he”代词的“gotcha”案例。这些案例对应于代词与职业的多数性别不匹配，而职业是正确答案的句子。在表13中，我们观察到我们的模型LLaMA-65B在“gotcha”案例上产生了更多错误，清楚地显示了它捕捉到了与性别和职业相关的社会偏见。无论性别如何，对于“her/her/she”和“his/him/he”代词，性能下降都存在偏见的迹象。

5.4 TruthfulQA

TruthfulQA (Lin et al., 2021) aims to measure the truthfulness of a model, i.e., its ability to identify when a claim is true. Lin et al. (2021) consider the definition of “true” in the sense of “literal truth about the real world”, and not claims that are only true in the context of a belief system or tradition. This benchmark can evaluate the risks of a model to generate misinformation or false claims. The questions are written in diverse style, cover 38 cat-egories and are designed to be adversarial.

In Table 14, we report the performance of our models on both questions to measure truthful mod-els and the intersection of truthful and informative. Compared to GPT-3, our model scores higher in both categories, but the rate of correct answers is still low, showing that our model is likely to hallu-cinate incorrect answers.

TruthfulQA（Lin等人，2021年）旨在衡量模型的真实性，即其识别陈述是否真实的能力。Lin等人（2021年）将“真实”定义为“关于现实世界的字面真实”，而不仅仅是在信仰体系或传统背景下成立的陈述。该基准测试可以评估模型生成错误信息或虚假陈述的风险。问题以多样的风格编写，涵盖了38个类别，并被设计为对抗性的。

在表14中，我们报告了我们的模型在衡量真实模型和真实且有信息的问题时的性能。与GPT-3相比，我们的模型在这两个类别中得分较高，但正确答案的比率仍然很低，显示出我们的模型很可能会产生不正确的答案。

6 Carbon footprint碳足迹

The training of our models have consumed a mas-sive quantity of energy, responsible for the emis-sion of carbon dioxide. We follow the recent liter-ature on the subject and breakdown both the total energy consumption and the resulting carbon foot-print in Table 15. We follow a formula for Wu et al.(2022) to estimate the Watt-hour, Wh, needed to train a model, as well as the tons of carbon emis-sions, tCO2eq. For the Wh, we use the formula:

Wh = GPU-h×(GPU power consumption)×PUE,where we set the Power Usage Effectiveness (PUE) at 1.1. The resulting carbon emission depends on the location of the data center used to train the net-work. For instance, BLOOM uses a grid that emits 0.057 kg CO2eq/KWh leading to 27 tCO2eq and OPT a grid that emits 0.231 kg CO2eq/KWh, lead-ing to 82 tCO2eq. In this study, we are interested in comparing the cost in carbon emission of training of these models if they were trained in the same data center. Hence, we do not take the location of data center in consideration, and use, instead, the US national average carbon intensity factor of 0.385 kg CO2eq/KWh. This leads to the following formula for the tons of carbon emissions:

tCO2eq = MWh × 0.385.

我们模型的训练消耗了大量能源，导致二氧化碳的排放。我们参考最近的文献，将总能耗和相应的碳足迹分解如表15所示。我们遵循Wu等人（2022年）的公式来估计训练模型所需的瓦时（Wh）和碳排放量（tCO2eq）。对于瓦时，我们使用以下公式：

Wh = GPU-h×（GPU功耗）×PUE，其中我们将功耗使用效率（PUE）设置为1.1。产生的碳排放量取决于用于训练网络的数据中心的位置。例如，BLOOM使用的电网排放0.057千克CO2eq/KWh，导致27 tCO2eq；OPT使用的电网排放0.231千克CO2eq/KWh，导致82 tCO2eq。在本研究中，我们感兴趣的是比较在相同的数据中心中训练这些模型的碳排放成本。因此，我们不考虑数据中心的位置，而是使用美国国家平均碳强度因子为0.385千克CO2eq/KWh。这导致以下碳排放量的公式：

tCO2eq = MWh × 0.385.

We apply the same formula to OPT and BLOOM for fair comparison. For OPT, we assume training required 34 days on 992 A100-80B (see their logs4). Finally, we estimate that we used 2048 A100-80GB for a period of approximately 5 months to develop our models. This means that developing these mod-els would have cost around 2,638 MWh under our assumptions, and a total emission of 1,015 tCO2eq. We hope that releasing these models will help to reduce future carbon emission since the training is already done, and some of the models are relatively small and can be run on a single GPU.

我们对OPT和BLOOM应用相同的公式进行公平比较。对于OPT，我们假设训练需要在992个A100-80B上进行34天（参见其日志4）。最后，根据我们的假设，我们估计使用了2048个A100-80GB进行了约5个月的模型开发。这意味着在我们的假设下，开发这些模型的成本约为2638 MWh，并且总排放量为1015 tCO2eq。我们希望发布这些模型能够帮助减少未来的碳排放，因为训练已经完成，其中一些模型相对较小，可以在单个GPU上运行。

7 Related work相关工作

Language models are probability distributions over sequences of words, tokens or charac-ters (Shannon, 1948, 1951). This task, often framed as next token prediction, has long been considered a core problem in natural language processing (Bahl et al., 1983; Brown et al., 1990). Because Turing (1950) proposed to measure machine intelligence by using language through the “imitation game”, language modeling has been proposed as a bench-mark to measure progress toward artificial intelli-gence (Mahoney, 1999).

Architecture. Traditionally, language models were based on n-gram count statistics (Bahl et al., 1983), and various smoothing techniques were proposed to improve the estimation of rare events (Katz, 1987; Kneser and Ney, 1995). In the past two decades, neural networks have been suc-cessfully applied to the language modelling task,starting from feed forward models (Bengio et al., 2000), recurrent neural networks (Elman, 1990; Mikolov et al., 2010) and LSTMs (Hochreiter and Schmidhuber, 1997; Graves, 2013). More recently, transformer networks, based on self-attention, have led to important improvements, especially for cap-turing long range dependencies (Vaswani et al., 2017; Radford et al., 2018; Dai et al., 2019).

语言模型是对单词、标记或字符序列的概率分布（Shannon，1948年，1951年）。这个任务通常被定义为下一个标记预测，并且长期以来一直被视为自然语言处理中的核心问题（Bahl等人，1983年；Brown等人，1990年）。自从Turing（1950年）提出通过使用语言来衡量机器智能以来，语言建模已被提出作为衡量人工智能进展的基准（Mahoney，1999年）。

架构。传统上，语言模型基于n-gram计数统计（Bahl等人，1983年），并提出了各种平滑技术来改善对罕见事件的估计（Katz，1987年；Kneser和Ney，1995年）。在过去的二十年中，神经网络已成功应用于语言建模任务，从前馈模型（Bengio等人，2000年），循环神经网络（Elman，1990年；Mikolov等人，2010年）和LSTM（Hochreiter和Schmidhuber，1997年；Graves，2013年）开始。最近，基于自注意力的Transformer网络在捕捉长距离依赖性方面取得了重要进展（Vaswani等人，2017年；Radford等人，2018年；Dai等人，2019年）。

Scaling. There is a long history of scaling for language models, for both the model and dataset sizes. Brants et al. (2007) showed the benefits of using language models trained on 2 trillion tokens, resulting in 300 billion n-grams, on the quality of machine translation. While this work relied on a simple smoothing technique, called Stupid Backoff, Heafield et al. (2013) later showed how to scale Kneser-Ney smoothing to Web-scale data. This allowed to train a 5-gram model on 975 billions to-kens from CommonCrawl, resulting in a model with 500 billions n-grams (Buck et al., 2014). Chelba et al. (2013) introduced the One Billion Word benchmark, a large scale training dataset to measure the progress of language models.

规模化。语言模型的规模化有着悠久的历史，包括模型和数据集的规模。Brants等人（2007年）展示了使用训练在2万亿标记上的语言模型的好处，从而产生3000亿个n-gram，提高了机器翻译的质量。虽然这项工作依赖于称为Stupid Backoff的简单平滑技术，但Heafield等人（2013年）随后展示了如何将Kneser-Ney平滑技术扩展到Web规模的数据上。这使得可以在CommonCrawl的9750亿个标记上训练一个5-gram模型，得到了5000亿个n-gram的模型（Buck等人，2014年）。Chelba等人（2013年）引入了十亿词基准测试数据集，用于衡量语言模型的进展。

In the context of neural language models, Joze-fowicz et al. (2016) obtained state-of-the-art re-sults on the Billion Word benchmark by scaling LSTMs to 1 billion parameters. Later, scaling transformers lead to improvement on many NLP tasks. Notable models include BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), Megatron-LM (Shoeybi et al., 2019), and T5 (Raffel et al., 2020). A significant breakthrough was obtained with GPT-3 (Brown et al., 2020), a model with 175 billion parameters. This lead to a series of Large Language Models, such as Jurassic-1 (Lieber et al., 2021), Megatron-Turing NLG (Smith et al.,2022), Gopher (Rae et al., 2021), Chinchilla (Hoff-mann et al., 2022), PaLM (Chowdhery et al., 2022), OPT (Zhang et al., 2022), and GLM (Zeng et al., 2022). Hestness et al. (2017) and Rosenfeld et al.(2019) studied the impact of scaling on the perfor-mance of deep learning models, showing the exis-tence of power laws between the model and dataset sizes and the performance of the system. Kaplan et al. (2020) derived power laws specifically for transformer based language models, which were later refined by Hoffmann et al. (2022), by adapting the learning rate schedule when scaling datasets. Finally, Wei et al. (2022) studied the effect of scal-ing on the abilities of large language models.

在神经语言模型的背景下，Joze-fowicz等人（2016年）通过将LSTM扩展到10亿个参数，在十亿词基准测试上取得了最先进的结果。后来，通过扩展Transformer模型，在许多自然语言处理任务上取得了改进。值得注意的模型包括BERT（Devlin等人，2018年），GPT-2（Radford等人，2019年），Megatron-LM（Shoeybi等人，2019年）和T5（Raffel等人，2020年）。GPT-3（Brown等人，2020年）是一个具有1750亿个参数的模型，取得了重大突破。这导致了一系列的大语言模型，如Jurassic-1（Lieber等人，2021年），Megatron-Turing NLG（Smith等人，2022年），Gopher（Rae等人，2021年），Chinchilla（Hoff-mann等人，2022年），PaLM（Chowdhery等人，2022年），OPT（Zhang等人，2022年）和GLM（Zeng等人，2022年）。Hestness等人（2017年）和Rosenfeld等人（2019年）研究了规模化对深度学习模型性能的影响，展示了模型和数据集规模与系统性能之间存在的幂律关系。Kaplan等人（2020年）专门针对基于Transformer的语言模型推导出了幂律关系，后来由Hoffmann等人（2022年）通过在扩展数据集时调整学习率计划进行了改进。最后，Wei等人（2022年）研究了规模化对大型语言模型能力的影响。

8 Conclusion结论

In this paper, we presented a series of language models that are released openly, and competitive with state-of-the-art foundation models. Most notably, LLaMA-13B outperforms GPT-3 while being more than 10× smaller, and LLaMA-65B is competitive with Chinchilla-70B and PaLM-540B. Unlike previous studies, we show that it is possible to achieve state-of-the-art performance by training exclusively on publicly available data, without resorting to proprietary datasets. We hope that releasing these models to the research community will accelerate the development of large language models, and help efforts to improve their robust-ness and mitigate known issues such as toxicity and bias. Additionally, we observed like Chung et al.(2022) that finetuning these models on instructions lead to promising results, and we plan to further investigate this in future work. Finally, we plan to release larger models trained on larger pretraining corpora in the future, since we have seen a constant improvement in performance as we were scaling.

在本文中，我们介绍了一系列与最先进的基础模型竞争，并且以开放方式发布的语言模型。特别值得注意的是，LLaMA-13B在比GPT-3小10倍以上的情况下表现更好，而LLaMA-65B与Chinchilla-70B和PaLM-540B具有竞争力。与先前的研究不同，我们展示了通过仅在公开可用的数据上进行训练，而无需使用专有数据集就可以实现最先进性能的可能性。我们希望将这些模型发布给研究社区将加速大型语言模型的发展，并帮助改善它们的稳健性以及减少已知问题，如有害和偏见。此外，我们观察到，像Chung等人（2022年）一样，在指令上微调这些模型会带来有希望的结果，并计划在未来的工作中进一步研究这一点。最后，我们计划在未来发布在更大的预训练语料库上训练的更大模型，因为我们发现随着规模的扩大，性能不断提高。

Acknowledgements致谢

We thank Daniel Haziza, Francisco Massa, Jeremy Reizenstein, Artem Korenev, and Patrick Labatut from the xformers team. We thank Susan Zhang and Stephen Roller for their support on data deduplication. We thank Luca Wehrstedt, Vegard Mella, and Pierre-Emmanuel Mazaré for their support on training stability. We thank Shubho Sengupta, Kalyan Saladi, and all the AI infra team for their support. We thank Jane Yu for her input on evaluation. We thank Yongyi Hu for his help on data collection.

我们感谢xformers团队的Daniel Haziza、Francisco Massa、Jeremy Reizenstein、Artem Korenev和Patrick Labatut。感谢Susan Zhang和Stephen Roller在数据去重方面的支持。感谢Luca Wehrstedt、Vegard Mella和Pierre-Emmanuel Mazaré在训练稳定性方面的支持。感谢Shubho Sengupta、Kalyan Saladi和所有AI基础设施团队的支持。感谢Jane Yu在评估方面的贡献。感谢Yongyi Hu在数据收集方面的帮助。

LLMs之LLaMA：《LLaMA: Open and Efficient Foundation Language Models》翻译与解读

相关论文