Silence remove with Raspberry Pi and Python

Ats

6 min readMar 2, 2024

Documentation of what I did for the last few weeks. I did some explorations around audio manipulation.

First of all

These are my development environments

Hardware

Raspberry Pi CM4 Model B
Waveshare Raspberry Pi CM4 IO board (https://www.waveshare.com/cm4-io-base-b.htm)
ReSpeaker 2-Mics Pi HAT (https://wiki.seeedstudio.com/ReSpeaker/)
IMX219

Software

Raspberry Pi OS Bullseys
Picamera2 (https://datasheets.raspberrypi.com/camera/picamera2-manual.pdf)
PyAudio (https://people.csail.mit.edu/hubert/pyaudio/)

Background

I have recently worked with audio for fun and wanted to make an application using the learnings. So I come up with an idea from a vlog. Someday I heard a vlog with my laptop and there was a lot of useless time when none spoke. Then I would remove blank time from the original and listen to only active time.

What I did

I found the article about almost the same as what I wanted to do. Then I followed it as my reference. He did a great job in it and you just read it if you want just to manipulate audio.

How to Remove Silence from an Audio using Python

There are many ways available that remove the silence part or the dead spaces from an audio file but it’s time…

onkar-patil.medium.com

pydub

I started pydub like the reference. It has dependencey toffmpeg. In Raspberry Pi, it’s installed by default. When I test it in my local machine, I used Docker for ffmpegand Jupyter.

FROM jupyter/base-notebook

USER root

WORKDIR /home/jovyan/work

RUN sudo apt-get update && sudo apt-get -y install ffmpeg

COPY ./requirements.txt /home/jovyan/work

RUN python -m pip install --no-cache -r requirements.txt

Firstly, I recorded sample voices using arecord -d 15 smaple.wav in the Raspberry Pi to test the reference code snippets. However, it didn’t work well. The chart was like below.

I did a quick research on Google and I found there are a few important configurations for audio. They are following.

Number of channels (Mono or Stereo)
Sampling rate
Bit depth

Regarding the code snippets in the reference, the sampling rate is computed from the audio file. But the number of channels and bit depth are hard coded to mono channel and 16-bit. So I decided to adjust my recording script to it with Python. I attached my script to record sounds with mono channel and 16-bit below.

silence_remover/raspberrypi/audio/record_one_channel.py at a68e11d025e17fa190dbb4b68b7f097b7462c20f…

Contribute to atsss/silence_remover development by creating an account on GitHub.

github.com

Then I got the following chart which was looking good.

Afterward, I tried to remove the silent part using silence.detect_silence provided by pydub like below.

import os
from pydub import AudioSegment, silence 

file ='assets/output_one_channel.wav'
assert os.path.isfile(file)

myaudio = AudioSegment.from_wav(file)
silenc = silence.detect_silence(myaudio, min_silence_len=200, silence_thresh=-20)

It worked well but it was quite hard to configure the silence_thresh because its unit is dBFS and it was different from what I displayed as the y-axis in the previous chart. Then I tried to change it to one with dBFS as the y-axis. I was not 100% sure the following way is correct to calculate but I gave it a go.

What Is DBFS In Audio? How It's Calculated And Used

Discover the role of dBFS in digital audio systems. Understand its calculation, how it differs from dB, and its…

audiointerfacing.com

import os
from pydub import AudioSegment, silence 
import matplotlib.pyplot as plt
import numpy as np
import wave
import librosa
import librosa.display

file ='../assets/output_one_channel.wav'
assert os.path.isfile(file)

# Load aduio file
y, sr = librosa.load(file, sr=16000)

# Get absolute max value to normalize
max_abs = np.max(np.abs(y))

# Nomarize wave values
y_normalized = y / max_abs

# Convert wave value to dBFS
y_dBFS = 20 * np.log10(np.abs(y_normalized))

time = np.arange(0,len(y)) / sr
plt.plot(time, y_dBFS)

Then I got the chart. It looked ok to figure out the threshold. However, I tested a few times changing the background noise, and the chart was affected easily by the background noise so that I had to change the threshold every time as seeing the chart. It didn’t make sense so I decided to go pyAudioAnalysis

The following jupyter note is the whole scripts for pydub

silence_remover/notebooks/pydub.ipynb at a68e11d025e17fa190dbb4b68b7f097b7462c20f ·…

Contribute to atsss/silence_remover development by creating an account on GitHub.

github.com

pyAudioAnalysis

Based on the reference, I don’t have to configure the threshold to determine the silence. Instead of me, the library determines the threshold automatically.

I just copied and pasted the reference code and it sounded more accurate than pydub

#import required libraries
import os
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
# path to audio file
file ='../assets/202402181903_one_channel_120sec.wav'

assert os.path.isfile(file)

# below method returns the active / non silent segments of the audio file 
[Fs, x] = aIO.read_audio_file(file)
segments = aS.silence_removal(x, 
                             Fs, 
                             0.020, 
                             0.020, 
                             smooth_window=1.0, 
                             weight=0.3, 
                             plot=True)

However it was too sensitive to silence and cut the audio into very small windows. I think I can configure it with smooth_window and weight parameters but I just did some manipulations to the output becauseit was not easy to understand the meanings of the parameters.

MIN_LENGTH, MIN_INTERVAL = 3, 3
START, END = 0, 1

# Ignore too short sounds
modified_segments = [segment for segment in segments if segment[END] - segment[START] > MIN_LENGTH]

# Combine two sounds which have too short interverls between them
combined_segments = [modified_segments[0][:]]
index = 1
while index < len(modified_segments):
    if modified_segments[index][START]-combined_segments[-1][END] > MIN_INTERVAL:
        combined_segments.append(modified_segments[index])
    else:
        combined_segments[-1][END] = modified_segments[index][END]

Then I got quite good results. But I found some issues when I had very noisy background while recording. In that case, sometimes pyAudioAnalysis missed speaking part. So I decided to do noise cancellation before the audio is passed to pyAudioAnalysis

The following jupyter note is the whole scripts for pyAudioAnalysis

silence_remover/notebooks/pyAudioAnalysis.ipynb at a68e11d025e17fa190dbb4b68b7f097b7462c20f ·…

Contribute to atsss/silence_remover development by creating an account on GitHub.

github.com

noisereduce

I googled about noise cancellation quickly and found the repository.

GitHub - timsainb/noisereduce: Noise reduction in python using spectral gating (speech…

Noise reduction in python using spectral gating (speech, bioacoustics, audio, time-domain signals) …

github.com

It was very easy to use. Just copied and pasted the sample codes from README.

from scipy.io import wavfile
import noisereduce as nr
# load data
rate, data = wavfile.read("mywav.wav")
# perform noise reduction
reduced_noise = nr.reduce_noise(y=data, sr=rate)
wavfile.write("mywav_reduced_noise.wav", rate, reduced_noise)

Then I passed the output to pyAuduiAnalysis . The result was good. It came not to miss any speaking parts.

Todos -> I’ll update after I finish

There are two issues yet.

Firstly, noisereduce has dependency to pytorch . It’s very big and doesn’t make sense to just execute one model. I want to find a proper Audio-to-Audio ML model and make it work on Raspberry Pi.

Secondly, the noisereduce and pyAudioAnalysis can deal with situations where people speak in very noisy backgrounds like cooking or traffic sounds but it wasn’t true in the situation where people speak while other people are speaking. It can’t separate target people from background people well.

The Hugging Face’s article would be helpful to me for both issues. Basically, I found my idea feasible enough and I think all I need to do is find a right ML model.

What is Audio-to-Audio? - Hugging Face

Learn about Audio-to-Audio using Machine Learning

huggingface.co

That’s it!

Silence remove with Raspberry Pi and Python

First of all

Background

What I did

How to Remove Silence from an Audio using Python

There are many ways available that remove the silence part or the dead spaces from an audio file but it’s time…

pydub

silence_remover/raspberrypi/audio/record_one_channel.py at a68e11d025e17fa190dbb4b68b7f097b7462c20f…

Contribute to atsss/silence_remover development by creating an account on GitHub.

What Is DBFS In Audio? How It's Calculated And Used

Discover the role of dBFS in digital audio systems. Understand its calculation, how it differs from dB, and its…

silence_remover/notebooks/pydub.ipynb at a68e11d025e17fa190dbb4b68b7f097b7462c20f ·…

Contribute to atsss/silence_remover development by creating an account on GitHub.

pyAudioAnalysis

silence_remover/notebooks/pyAudioAnalysis.ipynb at a68e11d025e17fa190dbb4b68b7f097b7462c20f ·…

Contribute to atsss/silence_remover development by creating an account on GitHub.

noisereduce

GitHub - timsainb/noisereduce: Noise reduction in python using spectral gating (speech…

Noise reduction in python using spectral gating (speech, bioacoustics, audio, time-domain signals) …

Todos -> I’ll update after I finish

What is Audio-to-Audio? - Hugging Face

Learn about Audio-to-Audio using Machine Learning

Written by Ats

No responses yet