MATLAB: Speech Command detection in audio file

audioAudio Toolboxdeep learningDeep Learning Toolboxsignal processingSignal Processing Toolbox

Hi,

I need to remake "Speech Command Recognition Using Deep Learning" example so I can read audio from the wav file and get time intervals in which the command appears, but I don't know how to change real-time analysis from microphone into static file analysis in this example. Thank you for your help.

 %% Detect Commands Using Streaming Audio from Microphone
% Test your newly trained command detection network on streaming audio from
% your microphone. If you have not trained a network, then type
% |load('commandNet.mat')| at the command line to load a pretrained network
% and the parameters required to classify live, streaming audio. Try
% saying one of the commands, for example, _yes_, _no_, or _stop_.
% Then, try saying one of the unknown words such as _Marvin_, _Sheila_, _bed_,
% _house_, _cat_, _bird_, or any number from zero to nine.
%%


% Specify the audio sampling rate and classification rate in Hz and create
% an audio device reader that can read audio from your microphone.
fs = 16e3;
classificationRate = 20;
audioIn = audioDeviceReader('SampleRate',fs, ...
    'SamplesPerFrame',floor(fs/classificationRate));
%%
% Specify parameters for the streaming spectrogram computations and
% initialize a buffer for the audio. Extract the classification labels of
% the network. Initialize buffers of half a second for the labels and
% classification probabilities of the streaming audio. Use these buffers to
% compare the classification results over a longer period of time and by
% that build 'agreement' over when a command is detected.
frameLength = frameDuration*fs;
hopLength = hopDuration*fs;
waveBuffer = zeros([fs,1]);
labels = trainedNet.Layers(end).Classes;
YBuffer(1:classificationRate/2) = categorical("background");
probBuffer = zeros([numel(labels),classificationRate/2]);
framesNumber = audio/frameLength;
%%
% Create a figure and detect commands as long as the created figure exists.
% To stop the live detection, simply close the figure. 
h = figure('Units','normalized','Position',[0.2 0.1 0.6 0.8]);
while ishandle(h)
    
    % Extract audio samples from the audio device and add the samples to
    % the buffer.
    x = audioIn();
    waveBuffer(1:end-numel(x)) = waveBuffer(numel(x)+1:end);
    waveBuffer(end-numel(x)+1:end) = x;
    
    % Compute the spectrogram of the latest audio samples.
    spec = auditorySpectrogram(waveBuffer,fs, ...
        'WindowLength',frameLength, ...
        'OverlapLength',frameLength-hopLength, ...
        'NumBands',numBands, ...
        'Range',[50,7000], ...
        'WindowType','Hann', ...
        'WarpType','Bark', ...
        'SumExponent',2);
    spec = log10(spec + epsil);
    
    % Classify the current spectrogram, save the label to the label buffer,
    % and save the predicted probabilities to the probability buffer.
    [YPredicted,probs] = classify(trainedNet,spec,'ExecutionEnvironment','cpu');
    YBuffer(1:end-1)= YBuffer(2:end);
    YBuffer(end) = YPredicted;
    probBuffer(:,1:end-1) = probBuffer(:,2:end);
    probBuffer(:,end) = probs';
    
    % Plot the current waveform and spectrogram.
    subplot(2,1,1);
    plot(waveBuffer)
    axis tight
    ylim([-0.2,0.2])
    
    subplot(2,1,2)
    pcolor(spec)
    caxis([specMin+2 specMax])
    shading flat
    
    % Now do the actual command detection by performing a very simple
    % thresholding operation. Declare a detection and display it in the
    % figure title if all of the following hold:
    % 1) The most common label is not |background|.
    % 2) At least |countThreshold| of the latest frame labels agree.
    % 3) The maximum predicted probability of the predicted label is at
    % least |probThreshold|. Otherwise, do not declare a detection.
    [YMode,count] = mode(YBuffer);
    countThreshold = ceil(classificationRate*0.2);
    maxProb = max(probBuffer(labels == YMode,:));
    probThreshold = 0.5;
    subplot(2,1,1);
    if YMode == "background" || count<countThreshold || maxProb < probThreshold
        title(" ")
    else
        title(string(YMode),'FontSize',20)
    end
    
    drawnow
end

Best Answer

Hi Bartek,

Here is a sketch of how you can achieve this:

1) Read the contents of the audio file

[audioIn,fs] = audioread('filename.wav');

2) Split the signal into 1-second chunks, with overlap between consecutive chunks. The higher the overlap, the higher your resolution (i.e. how close you will be able to detect where the keyword occured)

y = buffer(audioIn, fs, round(3*fs/4));

3) Convert each chunk (each column of y) to a mel spectrogram, and classify it using the network. Based on the network results, you will get an estimate of where the word occured in your wav file (each chunk advances by 1 sec - overlap, so in the this example, it's ~250 ms.

HTH

Jihad

Best Answer

Related Solutions

MATLAB: Because when I cut a stereo audio, this happens to be mono

MATLAB: Audio file (.wav) being overwritten

Related Question