whisper.cpp 学习笔记

whisper.cpp

whisper.cpp 学习笔记 whisper 介绍源码下载源码编译支持的模型优化/加速生成库文件使用 whispe.cpp 的 demo 参考文献

whisper.cpp 学习笔记

whisper 介绍

whisper 是基于 OpenAI 的自动语音识别（ASR）模型。他可以识别包括英语、普通话等在内多国语言。

whisper 分为 whisper （python 版本）和 whisper.cpp（C/C++ 版本）。

python 版本的 whisper 可以直接通过 pip install whisper 安装；whisper.cpp 可以通过源码进行安装。

以下主要介绍 whisper.cpp，因为其识别速度要快于 python 版本的 whisper。

源码下载

网址：https://github.com/ggerganov/whisper.cpp

git clone https://github.com/ggerganov/whisper.cpp.git

下载模型

bash ./models/download-ggml-model.sh base.en

源码编译

方法一

直接在 whisper.cpp 目录下执行 make 命令就可以编译

# build the main example
make

# transcribe an audio file
./main -f samples/jfk.wav

默认会生成 main bench 和 quantize 这三个命令：

main ：whisper 的命令 quantize ：对 whisper 的模型进行量化处理 bench ：性能测试

注：

使用此方法生成的可执行文件会包含 whisper 的代码，即没有生成 whisper 的库文件

whisper 只能识别音频采样频率为 16000 Hz 的声音数据，数据格式为 float类型——如果是 wav 文件则内部会自动转成 float 格式

方法二

make build
cd build
cmake xxx	#xxx 为 cmake 传入的参数，例如 prefix 等等，单独的 cmake 不能生成 Makefile
make

使用这种方式会在 build 目录下生成 liwhisper.so liwhisper.so.1 和 liwhisper.so。1.5.5 库文件 —— 对应的头文件在 whisper.cpp 目录下 whisper.h 可以直接将这几个文件防盗 /usr/lib 和 /usr/include 下即可使用。

支持的模型

whisper 目前支持：tiny、base、small、medium 以及 large 模型，其中带 .en 的表示之支持英文。

通过 make xxx 可以直接下载模型，例如：

make small

各种模型的内存使用情况：

Model Disk Mem tiny 75 MiB ~273 MB base 142 MiB ~388 MB small 466 MiB ~852 MB medium 1.5 GiB ~2.1 GB large 2.9 GiB ~3.9 GB

优化/加速

使用硬件加速

whipser 支持多种加速（详见 whisper.cpp 的 Readme）

OpenVINO NVIDIA GPU CLBlast OpenBLAS MKL …

这里只介绍 MKL。

首先通过 Intel® oneAPI Math Kernel Library 下载 intel 的 oneapi 数学库；

该数学库是二进制安装的，在安装完后需要执行一个脚本用于产生命令和连接库位置的环境

source /opt/intel/oneapi/setvars.sh 
mkdir build
cd build
cmake -DWHISPER_MKL=ON ..
WHISPER_MKL=1 make -j

通过上步可以生成依赖 intel oneapi 的 whisper 动态库。这里为了能够开机使用 intel oneapi，可以将 source /opt/intel/oneapi/setvars.sh 命令放到 ~/.bashrc 配置文件中——要想所有用户都可以使用，可以在 /ect/profile.d 目录下建一个 intel oneapi 的脚本，这样在用户登陆时可以自动配置环境变量。

注：

如果使用 root 权限安装 intel oneapi 则该软件安装在 /opt 目录下，否则安装在用户的 home 目录下。

量化模型

编译生成的可执行文件中有 quantize 这个命令，该命令用来量化模型可以减少模型的体积和加快运行的速度。

usage: ./quantize model-f32.bin model-quant.bin type
  type = "q2_k" or 10
  type = "q3_k" or 11
  type = "q4_0" or 2
  type = "q4_1" or 3
  type = "q4_k" or 12
  type = "q5_0" or 8
  type = "q5_1" or 9
  type = "q5_k" or 13
  type = "q6_k" or 14
  type = "q8_0" or 7

例如量化前 ggml-medium.bin 的大小为 1.5 G，采用 q4_k 量化后大小为 424 M。

执行速度：

量化前

time whisper --language chinese --model models/ggml-medium.bin output.wav

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'output.wav' (42624 samples, 2.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = chinese, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:02.000]  你好 你好 你好

whisper_print_timings:     load time =   606.10 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.12 ms
whisper_print_timings:   sample time =    12.43 ms /    24 runs (    0.52 ms per run)
whisper_print_timings:   encode time =  9836.69 ms /     1 runs ( 9836.69 ms per run)
whisper_print_timings:   decode time =    69.17 ms /     2 runs (   34.59 ms per run)
whisper_print_timings:   batchd time =   364.41 ms /    20 runs (   18.22 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 10899.75 ms

real    0m11.000s
user    0m44.862s
sys     0m2.151s

量化后

time whisper --language chinese --model models/ggml-medium_q4_k.bin output.wav

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'output.wav' (42624 samples, 2.7 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = chinese, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:02.000]  你好 你好 你好

whisper_print_timings:     load time =   303.66 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.37 ms
whisper_print_timings:   sample time =    13.74 ms /    27 runs (    0.51 ms per run)
whisper_print_timings:   encode time =  8979.33 ms /     1 runs ( 8979.33 ms per run)
whisper_print_timings:   decode time =    34.54 ms /     2 runs (   17.27 ms per run)
whisper_print_timings:   batchd time =   289.84 ms /    23 runs (   12.60 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  9632.14 ms

real    0m9.702s
user    0m42.165s
sys     0m1.342s

从量化前后的执行速度来看，执行时间提升了1秒左右。

生成库文件

在前面我们介绍了使用 cmake & make 命令可以生成动态库。

使用 whispe.cpp 的 demo

在 example 目录下有各种使用的 demo ，通过在 whisper.cpp 目录下执行 make xxx (xxx 为 example 中演示 demo 的名字)，即可生成该 demo 的可执行文件。

通过分析各种 demo 文件可以发现，主要使用了 whisper 中的如下几个函数：

whisper_lang_id() 语言支持检测——这里要使用小写，如 chinese 而不能用 Chinese whisper_context_default_params() 设置默认的上下文参数 whisper_init_from_file_with_params() 初始化上下文 whisper_is_multilingual() 检查上下文是否支持多国语言 whisper_full() 语音识别过程的函数 whisper_full_n_segments() 获取一共产生了多少段文字 whisper_full_get_segments_text() 获取识别到的一段文字 whisper_full_n_token() 一段识别中有多少个 tokern whisper_full_get_token_id() 获取对应 id 的 token

以下是参考 example/main/main.cpp 改写的简单 cpp 文件。

#include "common.h"

#include "whisper.h"
#include "grammar-parser.h"

#include <cmath>
#include <fstream>
#include <cstdio>
#include <regex>
#include <string>
#include <thread>
#include <vector>
#include <cstring>

bool wav_read(std::string fname, std::vector<float>& pcmf32)
{
    std::vector<std::vector<float>> pcmf32s;

    if (!::read_wav(fname, pcmf32, pcmf32s, false)) {
        fprintf(stderr, "error: failed to read WAV file '%s'\n", fname.c_str());
        return false;
    }

    return true;
}

int whisper_init(struct whisper_context * *ctx, whisper_full_params& wparams)
{
    if (whisper_lang_id("chinese") == -1) {
        fprintf(stderr, "error: unknown language '%s'\n", "Chinese");
        exit(0);
    }

    struct whisper_context_params cparams = whisper_context_default_params();
    cparams.use_gpu = false;

   *ctx = whisper_init_from_file_with_params("models/ggml-small.bin", cparams);
    if (*ctx == nullptr) {
        fprintf(stderr, "error: failed to initialize whisper context\n");
        return 3;
    }


    wparams = whisper_full_default_params(WHISPER_SAMPLING_GREEDY);
    wparams.language         = "chinese";

    return 0;
}

int whisper_exit(struct whisper_context ** ctx)
{
    whisper_free(*ctx);
    return 0;
}

int whisper_identify(struct whisper_context **ctx, whisper_full_params& wparams, std::vector<float> pcmf32, std::string& result)
{
    if(whisper_full(*ctx, wparams, pcmf32.data(), pcmf32.size()) != 0){
        return -1;
    }

    const int n_segments = whisper_full_n_segments(*ctx);
    for (int i = 0; i < n_segments; ++i) {
        const char * text = whisper_full_get_segment_text(*ctx, i);

        result += text;
    }

    return 0;
}

int main(int argc, char ** argv) {
    std::vector<float> pcmf32;
    struct whisper_context *ctx = nullptr;
    whisper_full_params wparams;
    std::string text;

    if(!wav_read("output.wav", pcmf32)){
        fprintf(stderr, "wave read failed !\n");
        return -1;
    }

    if(whisper_init(&ctx, wparams)){
        fprintf(stderr, "whisper init error !\n");
        return -1;
    }

    if(whisper_identify(&ctx, wparams, pcmf32, text)){
        fprintf(stderr, "identify error !\n");
        return 0;
    }

    whisper_exit(&ctx);

   fprintf(stdout, "text is : %s\n", text.c_str());

    return 0;
}

该文件简化为 wav 文件读、whisper 初始化、whisper 识别以及 whisper 退出这几个函数，结构简单更容易理解。将其替换 example/main/main.cpp 重新编译即可以执行。

注：
在 whisper 的 api 函数中涉及到的语言全部用小写，不能使用大写，否则会提示不支持。

参考文献

whisper.cpp