Silence remove with Raspberry Pi and Python

Ats
6 min readMar 2, 2024

--

Documentation of what I did for the last few weeks. I did some explorations around audio manipulation.

Photo by Ernie A. Stephens on Unsplash

First of all

These are my development environments

Hardware

Software

Background

I have recently worked with audio for fun and wanted to make an application using the learnings. So I come up with an idea from a vlog. Someday I heard a vlog with my laptop and there was a lot of useless time when none spoke. Then I would remove blank time from the original and listen to only active time.

What I did

I found the article about almost the same as what I wanted to do. Then I followed it as my reference. He did a great job in it and you just read it if you want just to manipulate audio.

pydub

I started pydub like the reference. It has dependencey toffmpeg. In Raspberry Pi, it’s installed by default. When I test it in my local machine, I used Docker for ffmpegand Jupyter.

FROM jupyter/base-notebook

USER root

WORKDIR /home/jovyan/work

RUN sudo apt-get update && sudo apt-get -y install ffmpeg

COPY ./requirements.txt /home/jovyan/work

RUN python -m pip install --no-cache -r requirements.txt

Firstly, I recorded sample voices using arecord -d 15 smaple.wav in the Raspberry Pi to test the reference code snippets. However, it didn’t work well. The chart was like below.

I did a quick research on Google and I found there are a few important configurations for audio. They are following.

  1. Number of channels (Mono or Stereo)
  2. Sampling rate
  3. Bit depth

Regarding the code snippets in the reference, the sampling rate is computed from the audio file. But the number of channels and bit depth are hard coded to mono channel and 16-bit. So I decided to adjust my recording script to it with Python. I attached my script to record sounds with mono channel and 16-bit below.

Then I got the following chart which was looking good.

Afterward, I tried to remove the silent part using silence.detect_silence provided by pydub like below.

import os
from pydub import AudioSegment, silence

file ='assets/output_one_channel.wav'
assert os.path.isfile(file)

myaudio = AudioSegment.from_wav(file)
silenc = silence.detect_silence(myaudio, min_silence_len=200, silence_thresh=-20)

It worked well but it was quite hard to configure the silence_thresh because its unit is dBFS and it was different from what I displayed as the y-axis in the previous chart. Then I tried to change it to one with dBFS as the y-axis. I was not 100% sure the following way is correct to calculate but I gave it a go.

import os
from pydub import AudioSegment, silence
import matplotlib.pyplot as plt
import numpy as np
import wave
import librosa
import librosa.display

file ='../assets/output_one_channel.wav'
assert os.path.isfile(file)

# Load aduio file
y, sr = librosa.load(file, sr=16000)

# Get absolute max value to normalize
max_abs = np.max(np.abs(y))

# Nomarize wave values
y_normalized = y / max_abs

# Convert wave value to dBFS
y_dBFS = 20 * np.log10(np.abs(y_normalized))

time = np.arange(0,len(y)) / sr
plt.plot(time, y_dBFS)

Then I got the chart. It looked ok to figure out the threshold. However, I tested a few times changing the background noise, and the chart was affected easily by the background noise so that I had to change the threshold every time as seeing the chart. It didn’t make sense so I decided to go pyAudioAnalysis

The following jupyter note is the whole scripts for pydub

pyAudioAnalysis

Based on the reference, I don’t have to configure the threshold to determine the silence. Instead of me, the library determines the threshold automatically.

I just copied and pasted the reference code and it sounded more accurate than pydub

#import required libraries
import os
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
# path to audio file
file ='../assets/202402181903_one_channel_120sec.wav'

assert os.path.isfile(file)

# below method returns the active / non silent segments of the audio file
[Fs, x] = aIO.read_audio_file(file)
segments = aS.silence_removal(x,
Fs,
0.020,
0.020,
smooth_window=1.0,
weight=0.3,
plot=True)

However it was too sensitive to silence and cut the audio into very small windows. I think I can configure it with smooth_window and weight parameters but I just did some manipulations to the output becauseit was not easy to understand the meanings of the parameters.

MIN_LENGTH, MIN_INTERVAL = 3, 3
START, END = 0, 1

# Ignore too short sounds
modified_segments = [segment for segment in segments if segment[END] - segment[START] > MIN_LENGTH]

# Combine two sounds which have too short interverls between them
combined_segments = [modified_segments[0][:]]
index = 1
while index < len(modified_segments):
if modified_segments[index][START]-combined_segments[-1][END] > MIN_INTERVAL:
combined_segments.append(modified_segments[index])
else:
combined_segments[-1][END] = modified_segments[index][END]

Then I got quite good results. But I found some issues when I had very noisy background while recording. In that case, sometimes pyAudioAnalysis missed speaking part. So I decided to do noise cancellation before the audio is passed to pyAudioAnalysis

The following jupyter note is the whole scripts for pyAudioAnalysis

noisereduce

I googled about noise cancellation quickly and found the repository.

It was very easy to use. Just copied and pasted the sample codes from README.

from scipy.io import wavfile
import noisereduce as nr
# load data
rate, data = wavfile.read("mywav.wav")
# perform noise reduction
reduced_noise = nr.reduce_noise(y=data, sr=rate)
wavfile.write("mywav_reduced_noise.wav", rate, reduced_noise)

Then I passed the output to pyAuduiAnalysis . The result was good. It came not to miss any speaking parts.

Todos -> I’ll update after I finish

There are two issues yet.

Firstly, noisereduce has dependency to pytorch . It’s very big and doesn’t make sense to just execute one model. I want to find a proper Audio-to-Audio ML model and make it work on Raspberry Pi.

Secondly, the noisereduce and pyAudioAnalysis can deal with situations where people speak in very noisy backgrounds like cooking or traffic sounds but it wasn’t true in the situation where people speak while other people are speaking. It can’t separate target people from background people well.

The Hugging Face’s article would be helpful to me for both issues. Basically, I found my idea feasible enough and I think all I need to do is find a right ML model.

That’s it!

--

--

Ats
Ats

Written by Ats

I like building something tangible like touch, gesture, and voice. Ruby on Rails / React Native / Yocto / Raspberry Pi / Interaction Design / CIID IDP alumni

No responses yet