OpenAI Whisper 语音转文本实验

为了实现语音方式与大语言模型的对话，需要使用语音识别（Voice2Text）和语音输出（Text2Voice）。感觉这项技术已比较成熟了，国内也有许多的机构开发这项技术，但是像寻找一个方便测试的技术居然还不容易。Google 墙了，微软需要注册，而国内的资料很少，最后选择了OpenAI 的Whisper。

Whisper 简介

Whisper是OpenAI于2022年12月发布的语音处理系统。它以英语为主，支持99种语言，包括中文。

提供了从tiny到large，从小到大的五种规格模型，适合不同场景。

Large 模型有2.88G，Basic 模型大约几百M。测试下来，Large 模型比较慢，Basic比较快。

Whisper 安装

pip install   openai-whisper

安装 ffmpeg

whisper 要使用ffmpeg 程序，在windows 的PowerShell 下安装的方式：

choco install ffmpeg

其它一些模块的安装

测试的语音文件

在网络上找中文的语音文件好像不太容易，不是收费，就是文不对题，在github 上找了一个英文的语音样文件。

audio-samples.github.io

Whisper 语音转文本

import whisper
print("Start....")
whisper_model = whisper.load_model("large")
print("Begine...")
result = whisper_model.transcribe("E:/yao2024/sample-0.wav",language='en')
print(", ".join([i["text"] for i in result["segments"] if i is not None]))

程序运行时要下载相关的模型数据，花费一段时间

Langchain 语音助手

Langchain 有语音助手链，它使用pyttsx3和speech_recognition库分别将文本转换为语音和语音转换为文本。

`speech_recognition`

是一个语音识别引擎，它可以调用多个语音识别的API ，其中包括:

CMU Sphinx (works offline)

Google Speech Recognition

Google Cloud Speech API

Wit.ai

Microsoft Azure Speech

Microsoft Bing Voice Recognition (Deprecated)

Houndify API

IBM Speech to Text

Snowboy Hotword Detection (works offline)

Tensorflow

Vosk API (works offline)

OpenAI whisper (works offline)

Whisper API

我们选择了OpenAI_whisper 离线方式。

实验程序

pyttsx3 的实验

import pyttsx3
#语音播放 
pyttsx3.speak("How are you?")
pyttsx3.speak("I am fine, thank you")
pyttsx3.speak("太行,王屋二山，方七百里，高万仞，本在冀州之南，河阳之北。")

对话程序

import  speech_recognition  as sr
import pyttsx3
from langchain.chat_models import ErnieBotChat
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferWindowMemory
llm= ErnieBotChat(model_name='ERNIE-Bot', #ERNIE-Bot
                    ernie_client_id='FAiHIjSQqH5gAhET3sHNTkiH',
                    ernie_client_secret='wlIBmWY4d2Zvrs0GyQbT3JeTXV6kdub4',
                    temperature=0.75,
                    )
template = """Assistant is a large language model trained by OpenAI.
Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.
Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.
Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.
Assistant is aware that human input is being transcribed from audio and as such there may be some errors in the transcription. It will attempt to account for some words being swapped with similar-sounding words or phrases. Assistant will also keep responses concise, because human attention spans are more limited over the audio channel since it takes time to listen to a response.
{history}
Human: {human_input}
Assistant:"""
prompt = PromptTemplate(
    input_variables=["history", "human_input"],
    template=template
)
chatgpt_chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True,
    memory=ConversationBufferWindowMemory(k=2),
)

engine = pyttsx3.init()


# 定义一个函数用于监听麦克风输入并进行处理
def listen():
    r = sr.Recognizer()
    with sr.Microphone() as source:
        print('校准中...')
        r.adjust_for_ambient_noise(source, duration=10)
        # 可选参数，用于调整麦克风灵敏度
       #  r.energy_threshold = 200
        r.pause_threshold=0.5
        print('好的，开始吧！')
        while (1):
            text = ''
            print('正在倾听...')
            try:
                audio = r.listen(source, timeout=10)
                print('识别中...')
                # 进行语音识别
                text = r.recognize_whisper(audio)
                print(text)
            except Exception as e:
                unrecognized_speech_text = f'抱歉，我没听清楚。错误信息: {e}s'
                text = unrecognized_speech_text
            print(text)
            # 使用语言模型生成对话回复
            response_text = chatgpt_chain.predict(human_input=text)
            print(response_text)
            # 使用语音合成引擎将回复转换为语音并播放
            engine.say(response_text)
            engine.runAndWait()


listen()

讲英文，回答英文，讲中文它会回答中文，但是识别同音字效果并不好。不知道如何提高同音字识别效果