Real-time Speech-to-text on Raspberry Pi and Python

Ats

4 min readMay 26, 2024

This is the document about what I did for the real-time STT on Raspberry Pi and Python.

First of all

These are my development environments.

Hardware

Raspberry Pi CM4 Model B (RAM 2GB)
Waveshare Raspberry Pi CM4 IO board (https://www.waveshare.com/cm4-io-base-b.htm)
ReSpeaker 2-Mics Pi HAT (https://wiki.seeedstudio.com/ReSpeaker/)

Software

Raspberry Pi OS Bullseys

The reason why I didn’t use the bookworm is because I couldn’t use it with ReSpeaker 2-Mics Pi HAT.

Background

I made a small application to do local text-to-speech scripts before.

Local text-to-speech on Raspberry Pi and Python

I did some experiments on TTS on Raspberry Pi. This is the document of it.

atsss.medium.com

Then I wanted to make scripts to do real-time speech-to-text for conversation between me and Raspberry Pi.

What I did

Firstly, I set up my microphone on Raspberry Pi. I used the speaker hat and they provided the setup script to install the driver. It was straightforward.

ReSpeaker Introduction | Seeed Studio Wiki

ReSpeaker Introduction

| Seeed Studio Wiki ReSpeaker Introductionwiki.seeedstudio.com

Software-wise, I used pyaudio and I couldn't install it via pip . I checked the official document and it said I needed to install portaudio19-dev to install it. Then the setup was done.

PyAudio

PyAudio provides Python bindings for PortAudio, the cross platform audio API.

people.csail.mit.edu

sudo apt-get install portaudio19-dev

Secondly, I was looking for a way to do STT and trying to do that locally without sending requests to any service. Then I found the whisper package.

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

github.com

My Raspberry Pi has 2GB of RAM so the tiny and base models should work on it. This time, I chose the tinyone because I prioritized the speed.

But the response speed was still slow. I didn’t measure properly but it took more than 5 seconds. It was critical for the real-time STT and I didn’t find other good options to do locally. So I decided to include the ways for STT with requests. Then the Google Speech-to-Text and OpentAI whisper came to my options.

Google: https://cloud.google.com/speech-to-text

OpenAI: https://platform.openai.com/docs/guides/speech-to-text

The STTs with requests were much faster than doing locally. They took around 1 second. In short sentences, it took less than 1 second but in long sentences, it took more than 1 second. Then I chose the Whisper because the accuracy was higher than Google’s result, especially for my Japanese English accent.

Lastly, I wanted to make the speed much faster but It seemed the best with current opened models. So I decided to focus on sending requests with short sentences. I needed to know the end of sentences while listening conversation to keep the sentence short. Then I found the following library.

GitHub - Uberi/speech_recognition: Speech recognition module for Python, supporting several engines…

Speech recognition module for Python, supporting several engines and APIs, online and offline. …

github.com

speech_recognition bundles utility functions for speech. Also it supports OpenAI API and pyaudio interface. Then I found the following function called listen , which can detect the end of speech.

speech_recognition/reference/library-reference.rst at master · Uberi/speech_recognition

Speech recognition module for Python, supporting several engines and APIs, online and offline. …

github.com

I decided to go with the function to keep the sentence short. I want to note one thing here, which is that I needed to call the adjust_for_ambient_noise function before calling thelisten function. Actually, it doesn’t show any errors even if you don’t call the adjust_for_ambient_noise function. But the listen didn’t detect the end properly without it. Once I called the function, the listen function worked as I expected.

My final script is below. I used the concurrent thread to keep listening to the conversation while sending requests to OpenAI.

import os
import speech_recognition as sr
from concurrent import futures

API_KEY = 'your_api_key'

class SpeechRecognizer:
    def __init__(self):
        os.makedirs("./out", exist_ok=True)
        self.path = f"./out/asr.txt"

        self.rec = sr.Recognizer()
        self.mic = sr.Microphone()

        self.pool = futures.ThreadPoolExecutor(thread_name_prefix="Rec Thread")
        self.speech = []

    def recognize_audio_thread_pool(self, audio, event=None):
        future = self.pool.submit(self.recognize_audio, audio)
        self.speech.append(future)

    def grab_audio(self) -> sr.AudioData:
        print("Say something!")
        with self.mic as source:
            audio = self.rec.listen(source)
        return audio

    def recognize_audio(self, audio: sr.AudioData) -> str:
        print("Understanting!")
        try:
            speech = self.rec.recognize_whisper_api(audio, model='whisper-1', api_key=API_KEY)
        except sr.UnknownValueError:
            speech = "# Failed to recognize speech"
            print(speech)
        except sr.RequestError as e:
            speech = f"# Invalid request:{e}"
            print(speech)
        return speech

    def run(self):
        print("Listening surrounding!")
        with self.mic as source:
            self.rec.adjust_for_ambient_noise(source, duration=5)

        try:
            while True:
                audio = self.grab_audio()
                self.recognize_audio_thread_pool(audio)
        except KeyboardInterrupt:
          print("Finished")
        finally:
            with open(self.path, mode='w', encoding="utf-8") as out:
                futures.wait(self.speech)

                for future in self.speech:
                    print(future.result())
                    out.write(f"{future.result()}\n")

if __name__ == "__main__":
    sp = SpeechRecognizer()
    sp.run()

That’s it!

Real-time Speech-to-text on Raspberry Pi and Python

First of all

Background

Local text-to-speech on Raspberry Pi and Python

I did some experiments on TTS on Raspberry Pi. This is the document of it.

What I did

ReSpeaker Introduction | Seeed Studio Wiki

ReSpeaker Introduction

PyAudio

PyAudio provides Python bindings for PortAudio, the cross platform audio API.

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub - Uberi/speech_recognition: Speech recognition module for Python, supporting several engines…

Speech recognition module for Python, supporting several engines and APIs, online and offline. …

speech_recognition/reference/library-reference.rst at master · Uberi/speech_recognition

Speech recognition module for Python, supporting several engines and APIs, online and offline. …

Written by Ats

No responses yet