This is the document about what I did for the real-time STT on Raspberry Pi and Python.
First of all
These are my development environments.
Hardware
- Raspberry Pi CM4 Model B (RAM 2GB)
- Waveshare Raspberry Pi CM4 IO board (https://www.waveshare.com/cm4-io-base-b.htm)
- ReSpeaker 2-Mics Pi HAT (https://wiki.seeedstudio.com/ReSpeaker/)
Software
- Raspberry Pi OS Bullseys
The reason why I didn’t use the bookworm is because I couldn’t use it with ReSpeaker 2-Mics Pi HAT.
Background
I made a small application to do local text-to-speech scripts before.
Then I wanted to make scripts to do real-time speech-to-text for conversation between me and Raspberry Pi.
What I did
Firstly, I set up my microphone on Raspberry Pi. I used the speaker hat and they provided the setup script to install the driver. It was straightforward.
Software-wise, I used pyaudio
and I couldn't install it via pip
. I checked the official document and it said I needed to install portaudio19-dev
to install it. Then the setup was done.
sudo apt-get install portaudio19-dev
Secondly, I was looking for a way to do STT and trying to do that locally without sending requests to any service. Then I found the whisper package.
My Raspberry Pi has 2GB of RAM so the tiny
and base
models should work on it. This time, I chose the tiny
one because I prioritized the speed.
But the response speed was still slow. I didn’t measure properly but it took more than 5 seconds. It was critical for the real-time STT and I didn’t find other good options to do locally. So I decided to include the ways for STT with requests. Then the Google Speech-to-Text and OpentAI whisper came to my options.
Google: https://cloud.google.com/speech-to-text
OpenAI: https://platform.openai.com/docs/guides/speech-to-text
The STTs with requests were much faster than doing locally. They took around 1 second. In short sentences, it took less than 1 second but in long sentences, it took more than 1 second. Then I chose the Whisper because the accuracy was higher than Google’s result, especially for my Japanese English accent.
Lastly, I wanted to make the speed much faster but It seemed the best with current opened models. So I decided to focus on sending requests with short sentences. I needed to know the end of sentences while listening conversation to keep the sentence short. Then I found the following library.
speech_recognition
bundles utility functions for speech. Also it supports OpenAI API and pyaudio
interface. Then I found the following function called listen
, which can detect the end of speech.
I decided to go with the function to keep the sentence short. I want to note one thing here, which is that I needed to call the adjust_for_ambient_noise
function before calling thelisten
function. Actually, it doesn’t show any errors even if you don’t call the adjust_for_ambient_noise
function. But the listen
didn’t detect the end properly without it. Once I called the function, the listen
function worked as I expected.
My final script is below. I used the concurrent
thread to keep listening to the conversation while sending requests to OpenAI.
import os
import speech_recognition as sr
from concurrent import futures
API_KEY = 'your_api_key'
class SpeechRecognizer:
def __init__(self):
os.makedirs("./out", exist_ok=True)
self.path = f"./out/asr.txt"
self.rec = sr.Recognizer()
self.mic = sr.Microphone()
self.pool = futures.ThreadPoolExecutor(thread_name_prefix="Rec Thread")
self.speech = []
def recognize_audio_thread_pool(self, audio, event=None):
future = self.pool.submit(self.recognize_audio, audio)
self.speech.append(future)
def grab_audio(self) -> sr.AudioData:
print("Say something!")
with self.mic as source:
audio = self.rec.listen(source)
return audio
def recognize_audio(self, audio: sr.AudioData) -> str:
print("Understanting!")
try:
speech = self.rec.recognize_whisper_api(audio, model='whisper-1', api_key=API_KEY)
except sr.UnknownValueError:
speech = "# Failed to recognize speech"
print(speech)
except sr.RequestError as e:
speech = f"# Invalid request:{e}"
print(speech)
return speech
def run(self):
print("Listening surrounding!")
with self.mic as source:
self.rec.adjust_for_ambient_noise(source, duration=5)
try:
while True:
audio = self.grab_audio()
self.recognize_audio_thread_pool(audio)
except KeyboardInterrupt:
print("Finished")
finally:
with open(self.path, mode='w', encoding="utf-8") as out:
futures.wait(self.speech)
for future in self.speech:
print(future.result())
out.write(f"{future.result()}\n")
if __name__ == "__main__":
sp = SpeechRecognizer()
sp.run()
That’s it!