用TensorRT-LLM进行LLama的推理和部署

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton | NVIDIA Technical Blog
Quick Start Guide — tensorrt_llm documentation (nvidia.github.io)

使用TensorRT-LLM的源码，来下载docker并在docker里编译TensorRT-LLM；

模型格式先Huggingface转为FasterTransformer；再用TensorRT-LLM将其compile为TensorRT engine；然后可用TensorRT-LLM的C++ runtime来跑推理（或者模型放到Triton Repo上，并指定TensorRT-LLM为backend）

Input的Tokenizing和Output的De-Tokenizing，视作前处理、后处理，创建"Python Model"；整个流程用一个"Ensemble Model"来表示，包含以上两个"Model"以及真正的GPT-Model;

Best Practices for Tuning the Performance of TensorRT-LLM — tensorrt_llm documentation (nvidia.github.io)

LLama:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md

TensorRT-LLM支持很多常用模型；例如：baichuan、internlm、chatglm、qwen、bloom、gpt、gptneox、llama;

convert_checkpoint.py，是每种模型用自己的；run.py，是所有模型共享；

每种模型，支持的技术完善程度不同。

支持LLama的以下功能：

FP16 FP8 INT8 & INT4 Weight-Only SmoothQuant Groupwise quantization (AWQ/GPTQ) FP8 KV CACHE INT8 KV CACHE (+ AWQ/per-channel weight-only) Tensor Parallel STRONGLY TYPED

python convert_checkpoint.py

--tp_size 4 // Tensor-parallel

--pp_size 4 // Pipeline-parallel

Pipeline并行，在某一个GPU忙碌时，其他GPU是否在忙着处理别的batch?

量化相关：

Numerical Precision — tensorrt_llm documentation (nvidia.github.io)

9种量化，对每种模型只支持一部分：

Model

FP32

FP16

BF16

FP8

W8A8 SQ

W8A16

W4A16

W4A16 AWQ

W4A16 GPTQ

Baichuan

BERT

ChatGLM

ChatGLM-v2

ChatGLM-v3

GPT

GPT-NeMo

GPT-NeoX

InternLM

LLaMA

LLaMA-v2

LLaMA-v3

Qwen

W8A16、W4A16:

Activation都是FP16(或BF16); Weight是INT8、INT4，在计算前反量化为FP16(或BF16)，FP16*FP16-->FP16;

只是使显卡里塞入了size更大的模型；

并没有加快计算（反而因为dequantize weight从INT到FP16，变慢些）

SmoothQuant: (W8A8)

--smoothquant 0.5

惯例做法，是对Activation的行(Token)和Weight的列(Output channel)，进行量化；

观察到的现象：weights矩阵，没有尖刺；activation矩阵，某几列(channel)是尖刺，而且明显能区分尖刺列和非尖刺列，尖刺列所有行(token)的值都大，非尖刺列所有行的值都小；

如果按照Activation的列进行量化，Gemm矩阵乘法不支持；

解决方案：对Activation的“尖刺”列，缩小N倍，对Weight的相应行，增大N倍；二者仍分别用老的Per-Token、Per-Channel来量化；

--gemm_plugin int8 : 使用指定的dtype去计算矩阵乘法，用的是加速库；

--gpt_attention_plugin int8 (默认开启)：优化key-value cache；"use of efficient CUDA kernels for computing attention scores and values, reducing computation and memory overhead compared to the standard implementation." 看不懂："It allows in-place update of the key-value (KV) cache used for attending to previous tokens, eliminating the need for explicit concatenation operations and further reducing memory consumption"

--remove_input_padding (默认开启)

input batch里，较短句子们，末尾的padding，在正常推理阶段被浪费了。

优化：使用别的句子（下一个batch的），填充这些padding；

--paged_kv_cache (默认开启)

把一部分放不下的keys、values，换出到CPU memory，用的时候再换入；

有选项可以配置kv-cache最大占用的GPU memory比例，建议设为0.95

--context_fmha (默认开启)(用了FlashAttention)

"Enabling the fused multi-head attention, during the context phase, will trigger a kernel that performs the MHA/MQA/GQA block using a single kernel"

"if context_fmha is set to a enabled or enabled_with_fp32_acc (accumulation in the first batched matmul is forced to FP32), that function will trigger a kernel that performs the MHA/MQA block using a single kernel. For short sequences, that kernel uses a vanilla implementation of MHA/MQA. For larger sequences, this kernel uses the Flash Attention algorithm as described in FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness and FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning."

LLM推理，分为context阶段和generate阶段；context阶段，用一个融合的kernel去执行MHA/MQA/GQA；

--use_fused_mlp

适用于Gated MLP层(将mlp-mixer跟门控机制结合起来。即将输入沿着特征维度分为两半，然后将其中一半传入mlp-mixer，作为另一半的gate)；原本是计算gate是一个矩阵乘法，MLP是一个矩阵乘法；这个优化把2个矩阵乘法融合为1个矩阵乘法；

--multi_block_mode

batch_size * heads_count，小于GPU Stream Multiprocessor数目的一半时，且context input tokens较长(>1000)，则使用这个，可以增加GPU SM的利用率。（似乎是每个SM负责解决1个sample和1个head的乘法，同时无法利用所有SM时，就把其他token的计算也并行？）

类似资料：Flash-Decoding for Long-Context Inference | Princeton NLP Group (princeton-nlp.github.io)

--use_paged_context_fmha

在"--context_fmha"的基础上，允许context kv-cache在GPU和CPU memory之间offload；适合长input context的推理；

Memory footprint计算：

Note that if engine is built with contiguous KV cache (i.e., without the flag --paged_kv_cache), you may need to reduce the max batch size (--max_batch_size) to fit the whole model and the KV cache in the GPU memory. The ballpark estimate for runtime memory consumption is given by

Total memory = (Model size + KV cache size + Activation memory) / Parallelism

where

The model size is the number of parameters * the size of data type. The KV cache size is the total number of tokens * the size of KV cache data type * the number of layers * the KV hidden dimension The activation memory is determined by TRT engine, which can be a few GBs regardless of the degree of parallelism used

For LLaMA v2 70B FP16 weights + FP8 KV cache, the model size is 70B parameters * 2 bytes = 140GB. The KV cache size is 32K tokens * 1 bytes * 80 layers * 2048 KV hidden dimension = 5GB per 32K tokens. We have 145GB spread across 8 GPUs. The end result is ~18GB per GPU plus some GBs of flat scratch/activation memory allocated by TRT engine and the TRT-LLM runtime.

Note that the KV hidden dimension is derived by the number of KV heads times hidden dimension of each head. LLaMA v2 70B has hidden dimension of 8192, and uses grouped-query attention where 8 key heads and 8 value heads are associated with 64 query heads. Each head has hidden dimension of 8192/64 = 128. So the hidden dimension for KV in total is 128 * 8 * 2 = 2048.

The total number of tokens is determined by beam width, batch size, and maximum sequence length.

LLama2-70B使用了Grouped Query Attention:

减少了显存占用量；从activation乘以变换矩阵，计算得到Key和Value，只计算N组，减少了计算量；

--int8_kv_cache

KV cache使用INT8量化，来存放；节约显存；

会使用一部分输入数据，来试跑(calibrate the model)；从而得到Key、Value的取值范围，拿到Scaling factor；

For example, to build LLaMA 70B for 2 nodes with 8 GPUs per node, we can use 8-way tensor parallelism and 2-way pipeline parallelism:

python convert_checkpoint.py --model_dir ./tmp/llama/70B/hf/ \
                            --output_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
                            --dtype float16 \
                            --tp_size 8 \
                            --pp_size 2

trtllm-build --checkpoint_dir ./tllm_checkpoint_16gpu_tp8_pp2 \
            --output_dir ./tmp/llama/70B/trt_engines/fp16/16-gpu/ \
            --workers 8 \   #启动8个后台线程同时build
            --gemm_plugin auto

跑多个LoRA ckpt: (有编号，-1表示原始model，0表示luotuo那个，1表示Japanese那个）

git-lfs clone https://huggingface.co/qychen/luotuo-lora-7b-0.1
git-lfs clone https://huggingface.co/kunishou/Japanese-Alpaca-LoRA-7b-v0
BASE_LLAMA_MODEL=llama-7b-hf/

python convert_checkpoint.py --model_dir ${BASE_LLAMA_MODEL} \
                            --output_dir ./tllm_checkpoint_1gpu \
                            --dtype float16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu \
            --output_dir /tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/ \
            --gemm_plugin auto \
            --lora_plugin auto \
            --max_batch_size 8 \
            --max_input_len 512 \
            --max_output_len 50 \
            --lora_dir  "luotuo-lora-7b-0.1/" "Japanese-Alpaca-LoRA-7b-v0/" \
            --max_lora_rank 8 \
            --lora_target_modules attn_q attn_k attn_v

python ../run.py --engine_dir "/tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/" \
              --max_output_len 10 \
              --tokenizer_dir ${BASE_LLAMA_MODEL} \
              --input_text "美国的首都在哪里? \n答案:" "美国的首都在哪里? \n答案:" "美国的首都在哪里? \n答案:" "アメリカ合衆国の首都はどこですか? \n答え:" "アメリカ合衆国の首都はどこですか? \n答え:" "アメリカ合衆国の首都はどこですか? \n答え:" \
              --lora_task_uids -1 0 1 -1 0 1 \
              --use_py_session --top_p 0.5 --top_k 0

Streaming LLM: (可以允许无限长度）

--streamingllm enable

--max_attention_window_size=2048

做多向前看多少个token的attention

总结

### 文章总结：部署NVIDIA TensorRT-LLM与Triton的AI编码助手
#### 1. 部署流程
- **下载与编译**：使用TensorRT-LLM的源码，下载Docker并在Docker中编译TensorRT-LLM。
- **模型转换**：将模型从Huggingface格式转换为FasterTransformer，再用TensorRT-LLM将其编译为TensorRT engine。
- **推理运行**：使用TensorRT-LLM的C++ runtime进行推理，或将模型部署到Triton Repo上，并指定TensorRT-LLM为后端。
- **前后处理**：将输入Tokenizing和输出De-Tokenizing作为前处理和后处理，创建Python Model，并与GPT-Model一起封装成Ensemble Model。
#### 2. 支持的模型与功能
- **支持的模型**：TensorRT-LLM支持多种常用模型，如baichuan、internlm、chatglm、qwen、bloom、gpt、gptneox、llama等。
- **模型转换工具**：每种模型使用特定的`convert_checkpoint.py`脚本进行转换，而`run.py`是所有模型共享的。
- **LLama特定功能**：支持FP16、FP8、INT8 & INT4 Weight-Only、SmoothQuant、Groupwise quantization (AWQ/GPTQ)、FP8 KV CACHE、INT8 KV CACHE (+ AWQ/per-channel weight-only)、Tensor Parallel、STRONGLY TYPED等。
#### 3. 性能调优最佳实践
- **量化技术**：TensorRT-LLM支持多种量化技术，包括FP32、FP16、BF16、FP8、W8A8 SQ、W8A16、W4A16、W4A16 AWQ、W4A16 GPTQ等，但每种模型支持的量化类型不同。
- **并行与缓存优化**：
- **Tensor Parallel** 和 **Pipeline Parallel**：通过`--tp_size`和`--pp_size`参数设置。
- **Pipeline并行**：在GPU空闲时，其他GPU可以处理其他batch。
- **KV Cache**：支持FP8和INT8 KV CACHE，以及分页缓存（`--paged_kv_cache`）以节省GPU内存。
- **内存占用计算**：通过模型大小、KV缓存大小和激活内存来计算总内存占用。
#### 4. 特定优化选项
- **SmoothQuant**：通过调整量化参数（如`--smoothquant 0.5`）来优化权重和激活的量化。
- **矩阵乘法插件**：使用`--gemm_plugin int8`和`--gpt_attention_plugin int8`来优化矩阵乘法和注意力计算。
- **输入填充优化**：`--remove_input_padding`优化输入batch中的padding。
- **Flash Attention**：通过`--context_fmha`启用Flash Attention算法，优化长序列的推理。
- **其他优化**：如`--use_fused_mlp`、`--multi_block_mode`、`--use_paged_context_fmha`等，分别针对MLP层、GPU SM利用率和长输入context的推理进行优化。
#### 5. 示例与高级应用
- **多LoRA ckpt运行**：展示如何同时运行多个LoRA checkpoint，包括原始模型和多个LoRA模型。
- **Streaming LLM**：支持无限长度的输入，通过`--streamingllm enable`和`--max_attention_window_size`参数启用。
#### 6. 注意事项
- **内存与显存管理**：需要仔细计算和管理模型、KV缓存和激活的内存占用，以确保在GPU上顺利运行。
- **量化与并行配置**：根据具体模型和硬件环境，合理配置量化类型和并行参数，以达到最佳性能。
通过以上步骤和最佳实践，可以高效地部署和优化NVIDIA TensorRT-LLM与Triton的AI编码助手，以支持各种大型语言模型的推理任务。