环境搭建
需要使用cuda
在 cmd 控制台里输入 nvidia-smi.exe
以查看显卡驱动版本和对应的 cuda 版本
前往 NVIDIA-CUDA 官网下载与系统对应的 Cuda 版本
以 Cuda-11.7 版本为例,根据自己的系统和需求选择安装(一般本地 Windows 用户请依次选择Windows
, x86_64
, 系统版本
, exe(local)
)
安装成功之后在 cmd 控制台中输入nvcc -V
, 出现类似以下内容则安装成功:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
查看是否成功调用,输出True即可
python
# 回车运行
import torch
# 回车运行
print(torch.cuda.is_available())
# 回车运行
安装 Fastwhisper
pip install faster-whisper
下载模型
silero-vad
下载模型
具体实现
思路就是pyaudio循环录制,silero-vad检测是否有人说话,有人说话则将音频保存转录
import threading
import wave
import numpy as np
import pyaudio
from faster_whisper import WhisperModel
import torch
def int2float(sound):
abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max > 0:
sound *= 1 / 32768
sound = sound.squeeze()
return sound
def save_audio(audio):
with wave.open('output.wav', 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes(audio)
def audio2Text(audio):
result = None
segments, info = whisperModel.transcribe(audio, beam_size=5, language="zh")
for segment in segments:
if result is None:
result = segment.text
else:
result += ", " + segment.text
print(result)
if __name__ == '__main__':
model, utils = torch.hub.load(
repo_or_dir='../../silero-vad',
model='silero_vad',
trust_repo=None,
source='local',
)
whisperModel = WhisperModel("../../large-v2", device="cuda", compute_type="float16")
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
FORMAT = pyaudio.paInt16
CHANNELS = 1
SAMPLE_RATE = 16000
num_samples = 8192
audio = pyaudio.PyAudio()
stream = audio.open(format=FORMAT,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=8192)
data = []
print("Started Recording")
audio = None
countSize = 0
while True:
audio_chunk = stream.read(num_samples)
audio_int16 = np.frombuffer(audio_chunk, np.int16)
audio_float32 = int2float(audio_int16)
new_confidence = model(torch.from_numpy(audio_float32), 16000).item()
if new_confidence > 0.5:
if audio is None:
audio = audio_chunk
countSize = 0
else:
audio = audio + audio_chunk
countSize = 0
else:
countSize = countSize + 1
if audio is not None and countSize < 3:
audio = audio + audio_chunk
elif audio is not None and countSize > 3:
save_audio(audio)
t = threading.Thread(target=audio2Text(int2float(np.frombuffer(audio, np.int16))), name='LoopThread')
t.start()
audio = None
countSize = 0
一些笔记
Fastwhisper支持输入为文件地址,binaryio,numpy数组,但pyaduio录音直接转成binaryio会无法转录,只能使用numpy,会有一定精度损失问题
时间戳未实现,似乎可以自己维护一个内置的时间来计算