A comparison of two feature extraction pipelines for Keyword Spotting. They are referred to as MFCC_slow and MFCC_fast and are tested on Raspberry4.
The whole preoprocessing pipeline for this Keyword Spotting task consist of a Short Time Fourier Transform (STFT) followed by a Mel-frequency cepstral coefficients (MFCCs) representation.
The objective is to define a preprocessing routine referred to as MFCC_fast returning a tensor of a given shape minimizing execution time and maximizing signal-to-noise ratio:
The developed routine reduces of 32% the execution time, from 25ms to 17ms, returning an SNR = 22.37dB. The analysis is deepened in https://github.com/ScorcaF/FastMFCC/blob/21f000e0aa346a0eded66ec63c7705f6f4002c18/Group14_Homework1.pdf
Always-on speech recognition is not energy efficient as it requires to transmit a continuous audio stream to the cloud, where data get processed. To mitigate this concern, devices first detect short keywords such as “Hey Siri” or “Ok Google” that wake up the device and trigger the full-scale speech recognition. This task, called Keyword Spotting, is much simpler, and therefore can be performed on board of the sensing nodes with lightweight Convolutional Neural Networks.
Before feeding the data to a Convolutional Neural Networks, it is required to perform a set of preprocessing steps. The most common strategy is to move from the time domain to the frequency domain using Short Time Fourier Transform (STFT). This transformation converts a one-dimensional timeseries signal into a two-dimensional image, enabling to solve keyword spotting as an image classification problem.
Another common feature extraction step relies on the hypothesis that representing sounds as they are perceived by the human ear improves the classification accuracy and can be achieved extracting the Mel-frequency cepstral coefficients (MFCCs) from the input signal. The Mel-frequency cepstrum is a representation of the STFT of a sound that tries to mimic how the membrane in human ears senses the vibrations of sounds. The MFCCs are coefficients that composes the Mel-frequency cepstrum.