Llama源码解析之tokenizer.py

llama/llama/tokenizer.py at main · meta-llama/llama · GitHub

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

import os
from logging import getLogger
from typing import List

from sentencepiece import SentencePieceProcessor


logger = getLogger()


class Tokenizer:
    """tokenizing and encoding/decoding text using SentencePiece."""
    def __init__(self, model_path: str):
        """
        Initializes the Tokenizer with a SentencePiece model.

        Args:
            model_path (str): The path to the SentencePiece model file.
        """
        # reload tokenizer
        assert os.path.isfile(model_path), model_path
        self.sp_model = SentencePieceProcessor(model_file=model_path)
        logger.info(f"Reloaded SentencePiece model from {model_path}")

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
        self.bos_id: int = self.sp_model.bos_id()
        self.eos_id: int = self.sp_model.eos_id()
        self.pad_id: int = self.sp_model.pad_id()
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )
        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
        """
        Encodes a string into a list of token IDs.

        Args:
            s (str): The input string to be encoded.
            bos (bool): Whether to prepend the beginning-of-sequence token.
            eos (bool): Whether to append the end-of-sequence token.

        Returns:
            List[int]: A list of token IDs.
        """
        assert type(s) is str
        t = self.sp_model.encode(s)
        if bos:
            t = [self.bos_id] + t
        if eos:
            t = t + [self.eos_id]
        return t

    def decode(self, t: List[int]) -> str:
        """
        Decodes a list of token IDs into a string.

        Args:
            t (List[int]): The list of token IDs to be decoded.

        Returns:
            str: The decoded string.
        """
        return self.sp_model.decode(t)

Claude

我会逐行解释这段代码,用于编程和AI教学。这是一个名为Tokenizer的Python类,用于使用SentencePiece模型对文本进行分词和编码/解码。

python

复制

class Tokenizer:
    """tokenizing and encoding/decoding text using SentencePiece."""

第1-2行:定义了一个名为Tokenizer的类,并提供了一个文档字符串,说明该类使用SentencePiece对文本进行分词和编码/解码。

python

复制

    def __init__(self, model_path: str):
        """
        Initializes the Tokenizer with a SentencePiece model.

        Args:
            model_path (str): The path to the SentencePiece model file.
        """

第4-8行:定义了Tokenizer类的构造函数__init__,接受一个参数model_path,表示SentencePiece模型文件的路径。函数的文档字符串解释了构造函数的目的和参数。

python

复制

        # reload tokenizer
        assert os.path.isfile(model_path), model_path
        self.sp_model = SentencePieceProcessor(model_file=model_path)
        logger.info(f"Reloaded SentencePiece model from {model_path}")

第10行:添加了一个注释,表示重新加载分词器。第11行:使用assert语句检查model_path是否为一个文件,如果不是则抛出异常。第12行:创建一个SentencePieceProcessor对象self.sp_model,并使用model_path指定的模型文件初始化它。第13行:使用日志记录器logger记录一条信息,表示已从指定路径重新加载SentencePiece模型。

python

复制

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
        self.bos_id: int = self.sp_model.bos_id()
        self.eos_id: int = self.sp_model.eos_id()
        self.pad_id: int = self.sp_model.pad_id()
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )
        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()

第16行:添加了一个注释,表示与BOS(Beginning of Sequence)和EOS(End of Sequence)标记相关的ID。第17行:将self.n_words设置为SentencePiece模型的词汇量大小,并使用类型注释int指定其类型。第18行:将self.bos_id设置为SentencePiece模型的BOS标记ID。第19行:将self.eos_id设置为SentencePiece模型的EOS标记ID。第20行:将self.pad_id设置为SentencePiece模型的填充(padding)标记ID。第21-23行:使用日志记录器logger记录词汇量大小、BOS ID和EOS ID的信息。第24行:使用assert语句断言SentencePiece模型的词汇量大小与片段(piece)大小相等。

python

复制

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
        """
        Encodes a string into a list of token IDs.

        Args:
            s (str): The input string to be encoded.
            bos (bool): Whether to prepend the beginning-of-sequence token.
            eos (bool): Whether to append the end-of-sequence token.

        Returns:
            List[int]: A list of token IDs.
        """
        assert type(s) is str
        t = self.sp_model.encode(s)
        if bos:
            t = [self.bos_id] + t
        if eos:
            t = t + [self.eos_id]
        return t

第26-37行:定义了encode方法,用于将字符串编码为标记ID列表。函数接受三个参数:s表示要编码的输入字符串,bos表示是否在序列开头添加BOS标记,eos表示是否在序列末尾添加EOS标记。函数返回一个整数列表,表示编码后的标记ID。函数的文档字符串解释了函数的目的、参数和返回值。第38行:使用assert语句断言s的类型为字符串。第39行:使用self.sp_model.encode方法将输入字符串s编码为标记列表t。第40-41行:如果bos为True,则在标记列表t的开头添加BOS标记ID。第42-43行:如果eos为True,则在标记列表t的末尾添加EOS标记ID。第44行:返回编码后的标记ID列表t。

python

复制

    def decode(self, t: List[int]) -> str:
        """
        Decodes a list of token IDs into a string.

        Args:
            t (List[int]): The list of token IDs to be decoded.

        Returns:
            str: The decoded string.
        """
        return self.sp_model.decode(t)

第46-54行:定义了decode方法,用于将标记ID列表解码为字符串。函数接受一个参数t,表示要解码的标记ID列表。函数返回解码后的字符串。函数的文档字符串解释了函数的目的、参数和返回值。第55行:使用self.sp_model.decode方法将标记ID列表t解码为字符串并返回。

这个Tokenizer类提供了使用SentencePiece模型对文本进行分词和编码/解码的功能。它的构造函数接受SentencePiece模型文件的路径,并在初始化时加载模型。encode方法将字符串编码为标记ID列表,可以选择在序列的开头和结尾添加特殊的BOS和EOS标记。

decode方法将标记ID列表解码为字符串。

这个类可以用于各种自然语言处理任务中的文本预处理和后处理步骤。

gpt4-v

它定义了一个使用SentencePiece模型进行文本分词的Tokenizer类。

python

复制

class Tokenizer:

这行代码开始了一个名为Tokenizer的类的定义。在Python中，类是创建对象（特定数据结构）的蓝图，它封装了数据和操作这些数据的函数。

python

复制

    """tokenizing and encoding/decoding text using SentencePiece."""

这是一个文档字符串（docstring），它提供了该类的描述。文档字符串说明了Tokenizer类旨在使用SentencePiece进行文本的分词，编码和解码。

python

复制

    def __init__(self, model_path: str):

这一行定义了类的初始化方法__init__，当创建类的新对象时，会自动调用这个方法。该方法接收self（指向当前实例的引用）和model_path参数，其中model_path是一个字符串类型的参数，它指定了SentencePiece模型文件的路径。

python

复制

        # reload tokenizer
        assert os.path.isfile(model_path), model_path

这里使用了断言（assert）来确保model_path指向的路径确实存在一个文件。如果不是，程序将抛出一个异常。

python

复制

        self.sp_model = SentencePieceProcessor(model_file=model_path)

在这一行，创建了一个SentencePieceProcessor对象，用于加载和操作指定路径上的SentencePiece模型。

python

复制

        logger.info(f"Reloaded SentencePiece model from {model_path}")

使用logger记录信息，指出SentencePiece模型已经从给定的路径重新加载。

python

复制

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
        self.bos_id: int = self.sp_model.bos_id()
        self.eos_id: int = self.sp_model.eos_id()
        self.pad_id: int = self.sp_model.pad_id()

这几行代码初始化了一些属性，分别是词汇表大小（n_words）、句子开始标记的ID（bos_id）、句子结束标记的ID（eos_id）和填充标记的ID（pad_id）。

python

复制

        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )

再次使用logger记录词汇表大小和特殊标记的ID信息。

python

复制

        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()

这行代码断言词汇表的大小应该等于模型中分词（piece）的数量。如果不相等，会抛出一个异常。

python

复制

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:

定义了一个名为encode的方法，它将字符串S编码为一个整数列表。参数bos和eos是布尔类型，分别决定是否在序列前后添加开始和结束的标记。

python

复制

        assert type(s) is str

这里断言传入的s必须是一个字符串类型。

python

复制

        t = self.sp_model.encode(s)

使用SentencePiece模型将字符串s编码为一个token ID列表t。

python

复制

        if bos:
            t = [self.bos_id] + t

如果bos参数为真，则在列表t的前面添加句子开始标记的ID。

python

复制

        if eos:
            t = t + [self.eos_id]

如果eos参数为真，则在列表t的后面添加句子结束标记的ID。

python

复制

        return t

返回编码后的token ID列表。

python

复制

    def decode(self, t: List[int]) -> str:

定义了一个名为decode的方法，它将一个整数列表t解码回字符串。

python

复制

        return self.sp_model.decode(t)

使用SentencePiece模型将token ID列表t解码为字符串，并返回这个字符串。