Jetson 部署 Faster Whisper

文章目录

Whisper Faster Whisper 安装使用尝试WSL部署尝试 Jetson 部署时间戳实时转录

Whisper

Whisper 是一种通用语音识别模型。它是在大量不同音频数据集上进行训练的，也是一个多任务模型，可以执行多语言语音识别、语音翻译和语言识别。

测试，用Chattts生成一段语音：四川美食确实以辣闻名，但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等，这些小吃口味温和，甜而不腻，也很受欢迎。

$ pip install -U openai-whisper
$ sudo apt update && sudo apt install ffmpeg
$ pip install setuptools-rust

$ whisper ../audio.wav --model tiny
100%|█████████████████████████████████████| 72.1M/72.1M [00:36<00:00, 2.08MiB/s]
/home/jetson/.local/lib/python3.8/site-packages/whisper/__init__.py:146: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(fp, map_location=device)
/home/jetson/.local/lib/python3.8/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: Chinese
[00:00.000 --> 00:03.680] 四川美時確實以辣文明 但以有不辣的選擇
[00:03.680 --> 00:07.200] 比如潛水面 賴湯圓 再轟高夜熱八等
[00:07.200 --> 00:11.560] 這些小市口維溫和 然後甜而不膩也很受歡迎

这个是CPU运行的?，GPU都没带喘的。

Faster Whisper

fast-whisper 是使用 CTranslate2 重新实现 OpenAI 的 Whisper 模型，CTranslate2 是 Transformer 模型的快速推理引擎。

Funasr有个大问题，它的实时转录是CPU的，很慢，GPU的支持离线语音转文字，但又不能实时。找到了一个faster-whisper可以支持实时GPU转录，也支持中文。

Faster-Whisper 实时识别电脑语音转文本模型：faster-whisper-large-v3

安装使用

pip install faster-whisper

from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
# model = WhisperModel(model_size, device="cuda", compute_type="float16")

# or run on GPU with INT8
model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

尝试WSL部署

Cuda：12.6 Cudnn：9.2

直接运行，报错Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory，这是需要cublas，cudnn的python库：

pip install nvidia-cublas-cu12 nvidia-cudnn-cu12

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`

但是仍然跑不起来，因为：

Version 9+ of nvidia-cudnn-cu12 appears to cause issues due its reliance on cuDNN 9 (Faster-Whisper does not currently support cuDNN 9). Ensure your version of the Python package is for cuDNN 8.

那我安装 Cudnn 8 不就行了？果断下载cudnn8 for cuda 12.x，但是每次都安装cudnn9.4，除了降cuda版本，否则没办法恢复到cudnn8。

尝试 Jetson 部署

Cuda：11.4 Cudnn：8.6.0

简直量身定制啊！首先尝试安装cudnn python库：

$ pip3 install faster-whisper -i https://mirrors.aliyun.com/pypi/simple/

# 贴心的提示我们：For all these methods below, keep in mind the above note
# regarding CUDA versions. Depending on your setup, you may need to install the
# CUDA 11 versions of libraries that correspond to the CUDA 12 libraries listed
# in the instructions below.

$ pip install --extra-index-url https://pypi.nvidia.com nvidia-cudnn-cu11
...
The installation of nvidia-cudnn-cu11 for version 9.0.0.312 failed.

      This is a special placeholder package which downloads a real wheel package
      from https://pypi.nvidia.com. If https://pypi.nvidia.com is not reachable, we
      cannot download the real wheel file to install.

      You might try installing this package via
      $ pip install --extra-index-url https://pypi.nvidia.com nvidia-cudnn-cu11

      Here is some debug information about your platform to include in any bug
      report:

      Python Version: CPython 3.8.10
      Operating System: Linux 5.10.104-tegra
      CPU Architecture: aarch64
      nvidia-smi command not found. Ensure NVIDIA drivers are installed.

原来是 nvidia-cudnnn-cu11没有aarch64 Arm版本！但是nvidia-cudnn-cu12有。

怎么办，安装cuda 12.2？Jetson的系统是离线刷机，jetpack 6确实支持12.2和cudnn8：

已经准备买新的固态刷机了，但是太麻烦了，得装虚拟机装刷机SDK，得拆机箱改跳帽，得重新配置ssh网络连接，关键是，得花钱！

不试试怎么行呢，我不信邪，就安装cudnn12 python库：

pip install --extra-index-url https://pypi.nvidia.com nvidia-cudnn-cu12
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting nvidia-cudnn-cu12
  Downloading nvidia_cudnn_cu12-9.4.0.58-py3-none-manylinux2014_aarch64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12 (from nvidia-cudnn-cu12)
  Downloading https://pypi.nvidia.com/nvidia-cublas-cu12/nvidia_cublas_cu12-12.6.1.4-py3-none-manylinux2014_aarch64.whl (376.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 376.7/376.7 MB 12.9 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.4.0.58-py3-none-manylinux2014_aarch64.whl (572.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 572.7/572.7 MB 1.1 MB/s eta 0:00:00
Installing collected packages: nvidia-cublas-cu12, nvidia-cudnn-cu12
Successfully installed nvidia-cublas-cu12-12.6.1.4 nvidia-cudnn-cu12-9.4.0.58

跑一下demo：

test.py
preprocessor_config.json: 100%|████████████████████████████████| 340/340 [00:00<00:00, 118kB/s]
config.json: 100%|█████████████████████████████████████████████| 2.39k/2.39k [00:00<00:00, 1.03MB/s]
vocabulary.json: 100%|█████████████████████████████████████████| 1.07M/1.07M [00:00<00:00, 1.13MB/s]
tokenizer.json: 100%|██████████████████████████████████████████| 2.48M/2.48M [00:01<00:00, 2.14MB/s]
model.bin: 100%|███████████████████████████████████████████████| 3.09G/3.09G [03:18<00:00, 9.89MB/s]
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
  File "/home/jetson/.local/lib/python3.8/site-packages/faster_whisper/transcribe.py", line 145, in __init__
    self.model = ctranslate2.models.Whisper(
ValueError: This CTranslate2 package was not compiled with CUDA support

Holy?，这又是咋回事，找一下：This CTranslate2 package was not compiled with CUDA support #1306，跳过他们的讨论，结合faster-whisper库里的描述：

Note: Latest versions of ctranslate2 support CUDA 12 only. For CUDA 11, the current workaround is downgrading to the 3.24.0 version of ctranslate2 (This can be done with pip install --force-reinstall ctranslate2==3.24.0 or specifying the version in a requirements.txt).

又是cuda11的幺蛾子，它说要使用降级的方法：

$ pip install --force-reinstall ctranslate2==3.24.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mediapipe 0.8.4 requires opencv-contrib-python, which is not installed.
onnx-graphsurgeon 0.3.12 requires onnx, which is not installed.
d2l 0.17.6 requires numpy==1.21.5, but you have numpy 1.24.4 which is incompatible.
d2l 0.17.6 requires requests==2.25.1, but you have requests 2.32.3 which is incompatible.
faster-whisper 1.0.3 requires ctranslate2<5,>=4.0, but you have ctranslate2 3.24.0 which is incompatible.

呸！?

我试试自己编一个cuda版本的：https://opennmt.net/CTranslate2/installation.html#compile-the-c-library

$ pip3 uninstall ctranslate2 whisper-ctranslate2
$ git clone --recursive https://github.com/OpenNMT/CTranslate2.git
$ mkdir build && cd build
$ cmake ..
...
CMake Error at CMakeLists.txt:294 (message):
  Intel OpenMP runtime libiomp5 not found

-- Configuring incomplete, errors occurred!

哪来的intel？找找，原来是，By default, the library is compiled with the Intel MKL backend which should be installed separately. See the Build options to select or add another backend. 改一下，不用老in家的：

# 老张我给你表演什么叫一镜到底，注意看，我只表演一次：
$ cmake .. -DOPENMP_RUNTIME=COMP -DWITH_MKL=OFF -DWITH_CUDA=ON -DWITH_CUDNN=ON
$ make -j32
$ sudo make install
$ sudo ldconfig
$ cd ../python
$ pip install -r install_requirements.txt
$ python setup.py bdist_wheel
$ pip install dist/*.whl

喜大普奔！

时间戳

from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
# model = WhisperModel(model_size, device="cuda", compute_type="float16")

# or run on GPU with INT8
model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")



segments, _ = model.transcribe("audio.wav", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

[0.00s -> 0.24s] 四
[0.24s -> 0.44s] 川
[0.44s -> 0.58s] 美
[0.58s -> 0.78s] 食
[0.78s -> 1.10s] 确
..
[9.72s -> 9.96s] 腻
[9.96s -> 10.42s] 也
[10.42s -> 10.68s] 很
[10.68s -> 10.82s] 受
[10.82s -> 11.04s] 欢
[11.04s -> 11.22s] 迎

实时转录

Whisper 实时流式传输，用于长时间语音到文本的转录和翻译。Whisper 是最近最先进的多语言语音识别和翻译模型之一，然而，它并不是为实时转录而设计的。在本文中，我们在 Whisper 之上构建并创建了 Whisper-Streaming，这是一种实时语音转录和类似 Whisper 模型翻译的实现。 Whisper-Streaming 使用本地协议策略和自适应延迟来实现流式转录。我们证明 Whisper-Streaming 在未分段的长格式语音转录测试集上实现了高质量和 3.3 秒的延迟，并且我们在多语言会议上展示了其作为实时转录服务组件的鲁棒性和实际可用性。

$ git clone git@github.com:ufal/whisper_streaming.git
$ cd whisper_streaming
$ python3 whisper_online.py ../audio.wav --language zh --min-chunk-size 1
INFO    Audio duration is: 11.68 seconds
INFO    Loading Whisper large-v2 model for zh...
INFO    done. It took 14.19 seconds.
DEBUG   PROMPT:
DEBUG   CONTEXT:
DEBUG   transcribing 1.00 seconds from 0.00
DEBUG   >>>>COMPLETE NOW: (None, None, '')
DEBUG   INCOMPLETE: (0.0, 0.98, '四川美食群')
DEBUG   len of buffer now: 1.00
DEBUG   ## last processed 1.00 s, now is 5.30, the latency is 4.29
DEBUG   PROMPT:
DEBUG   CONTEXT:
DEBUG   transcribing 5.30 seconds from 0.00
DEBUG   >>>>COMPLETE NOW: (0.0, 0.88, '四川美食')
DEBUG   INCOMPLETE: (0.88, 5.26, '确实以辣为名,但也有不辣的选择,比如甜水面赖淘宝。')
DEBUG   len of buffer now: 5.30
11643.5227 0 880 四川美食
11643.5227 0 880 四川美食
DEBUG   ## last processed 5.30 s, now is 11.64, the latency is 6.35
DEBUG   PROMPT:
DEBUG   CONTEXT: 四川美食
DEBUG   transcribing 11.64 seconds from 0.00
DEBUG   >>>>COMPLETE NOW: (None, None, '')
DEBUG   INCOMPLETE: (0.88, 11.24, '確實以辣聞名,但也有不辣的選擇,比如甜水麵、瀨湯圓、炸烘糕 、葉子粑等,這些小吃口味溫和,然後甜而不膩,也很受歡迎。')
DEBUG   len of buffer now: 11.64
DEBUG   ## last processed 11.64 s, now is 21.61, the latency is 9.96
DEBUG   PROMPT:
DEBUG   CONTEXT: 四川美食
DEBUG   transcribing 11.68 seconds from 0.00
DEBUG   >>>>COMPLETE NOW: (None, None, '')
DEBUG   INCOMPLETE: (0.88, 11.32, '确实以辣闻名,但也有不辣的选择,比如甜水面、赖汤圆、炸烘糕 叶、热巴等,这些小吃口味温和,然后甜而不腻,也很受欢迎。')
DEBUG   len of buffer now: 11.68
DEBUG   ## last processed 21.61 s, now is 31.53, the latency is 9.92
DEBUG   last, noncommited: (0.88, 11.32, '确实以辣闻名,但也有不辣的选择,比如甜水面、赖汤圆、炸烘糕叶、热巴等,这些小吃口味温和,然后甜而不腻,也很受欢迎。')
31528.1091 880 11320 确实以辣闻名,但也有不辣的选择,比如甜水面、赖汤圆、炸烘糕叶、热巴等,这些小吃口味温和,然后甜而不腻,也很受欢迎。
31528.1091 880 11320 确实以辣闻名,但也有不辣的选择,比如甜水面、赖汤圆、炸烘糕叶、热巴等,这些小吃口味温和,然后甜而不腻,也很受欢迎。

注：更改模型量化：

# this worked fast and reliably on NVIDIA L40
# model = WhisperModel(model_size_or_path, device="cuda", compute_type="float16", download_root=cache_dir)

# or run on GPU with INT8
# tested: the transcripts were different, probably worse than with FP16, and it was slightly (appx 20%) slower
model = WhisperModel(model_size_or_path, device="cuda", compute_type="int8_float16")

总结