MATLAB: Not sure if I set up this neural network correctly

audio classificationAudio ToolboxStatistics and Machine Learning Toolbox

Below is my code as well as the information about the variables for a basic audio classification problem, which is reading an audio file and distinguishing whether the signal is a car horn or a dog barking. I followed the same format as this tutorial I found: https://www.mathworks.com/help/audio/gs/classify-sound-using-deep-learning.html.
I'm not sure where I went wrong, but when training the program did not plot the loss value. And when I tried to test a sample file, the result was "<undefined>". I would appreciate any help on this.
% --------------------------------------------------------------

% Loading Training and Evaluation Sets for Car Horn and Dog Bark
% --------------------------------------------------------------
carDataStore = UrbanSound8K(UrbanSound8K.class == "car_horn",:);
carDataStore = carDataStore(carDataStore.salience == 1,:);
dogDataStore = UrbanSound8K(UrbanSound8K.class == "dog_bark",:);
dogDataStore = dogDataStore(dogDataStore.salience == 1,:);
carData = [];
dogData = [];
% Add first 2 seconds of each audiofile to their respective matrices and
% produce labels
for i = 1:height(carDataStore)
thisfile = "UrbanSound8K\audio\fold" + string(carDataStore(i,:).fold) + "\" + string(carDataStore(i,:).slice_file_name);
if audioinfo(thisfile).Duration >= 2 && audioinfo(thisfile).SampleRate == 44100
[y,fs] = audioread(thisfile);
samples = [1,2*fs];
clear y fs;
[y,fs] = audioread(thisfile, samples);
carData = [carData,y(:,1)];
end
end
carLabels = repelem(categorical("car horn"),width(carData),1);
for i = 1:height(dogDataStore)
thisfile = "UrbanSound8K\audio\fold" + string(dogDataStore(i,:).fold) + "\" + string(dogDataStore(i,:).slice_file_name);
if audioinfo(thisfile).Duration >= 2 && audioinfo(thisfile).SampleRate == 44100
[y,fs] = audioread(thisfile);
samples = [1,2*fs];
clear y fs;
[y,fs] = audioread(thisfile, samples);
dogData = [dogData,y(:,1)];
end
end
dogLabels = repelem(categorical("dog barking"),width(dogData),1);
dogVals = round(0.8*width(dogData));
carVals = round(0.8*width(carData));
audioTrain = [dogData(:,1:dogVals),carData(:,1:carVals)];
labelsTrain = [dogLabels(1:dogVals);carLabels(1:carVals)];
audioValidation = [dogData(:,(dogVals + 1):end),carData(:,(carVals + 1):end)];
labelsValidation = [dogLabels((dogVals + 1):end);carLabels((carVals + 1):end)];
% ---------------------------------------------------------

% Audio Feature Extractor to reduce dimensionality of audio,
% Extracting slope and centroid of mel spectrum over time
% ---------------------------------------------------------
aFE = audioFeatureExtractor("SampleRate",fs, ...
"SpectralDescriptorInput","melSpectrum", ...
"spectralCentroid",true, ...
"spectralSlope",true);
featuresTrain = extract(aFE,audioTrain);
[numHopsPerSequence,numFeatures,numSignals] = size(featuresTrain);
featuresTrain = permute(featuresTrain,[2,1,3]);
featuresTrain = squeeze(num2cell(featuresTrain,[1,2]));
numSignals = numel(featuresTrain);
[numFeatures,numHopsPerSequence] = size(featuresTrain{1});
featuresValidation = extract(aFE,audioValidation);
featuresValidation = permute(featuresValidation,[2,1,3]);
featuresValidation = squeeze(num2cell(featuresValidation,[1,2]));
% ----------------------------------------

% Defining the Neural Network Architecture
% ----------------------------------------
layers = [ ...
sequenceInputLayer(numFeatures)
lstmLayer(50,"OutputMode","last")
fullyConnectedLayer(numel(unique(labelsTrain)))
softmaxLayer
classificationLayer];
options = trainingOptions("adam", ...
"Shuffle","every-epoch", ...
"ValidationData",{featuresValidation,labelsValidation}, ...
"Plots","training-progress", ...
"Verbose",false);
net = trainNetwork(featuresTrain,labelsTrain,layers,options);

Best Answer

Hi Saketh,
I believe the example you're following is more of a 'hello-world' type example--your current code is trying to accomplish something more difficult. You'll probably need to extract features with more information, and depending on your end goal, also apply standardization.
Regarding your particular questions and why the network is not working, its difficult to say without being able to walk through your code (which would require access to that dataset which I don't have).
Below, I've written something that is similar to your code but using the ESC-10 dataset, which can be downloaded from mathworks support files. Hopefully reading through it will help with your current problem.
I changed the features extracted to mfcc the delta and delta-delta mfcc. The dataset does not have car sounds, so we're doing "dog" and "helicopter" instead. Instead of doing any trimming of the signal, we pass in cell arrays of features and tell the network how to trim the signals if they're not the same size. The amount of training and validation data is tiny, so we'll reduce the validation frequency to make sure validation data is plotted (this might be a similar issue to why you're not seeing loss).
% Download dataset
url = 'https://ssd.mathworks.com/supportfiles/audio/ESC-10.zip';
outputLocation = tempdir;
unzip(url,outputLocation)
% Create audioDatastore to point to dataset. Use the folder names as the
% labels.
esc10Datastore = audioDatastore(fullfile(outputLocation,'ESC-10'), ...
'IncludeSubfolders',true,'LabelSource','foldernames');
% Subset to only include 'dog' and 'helicopter' labels.
ads = subset(esc10Datastore,esc10Datastore.Labels==categorical("dog") | ...
esc10Datastore.Labels==categorical("helicopter"));
% Split the datastore into train and validation sets.
[adsTrain,adsValidation] = splitEachLabel(ads,0.8);
% Read a single signal from the train datastore and listen to it.
[audioIn,audioInfo] = read(adsTrain);
fs = audioInfo.SampleRate;
sound(audioIn,fs)
% Create an audioFeatureExtractor
aFE = audioFeatureExtractor("SampleRate",fs, ...
"mfcc",true, ...
"mfccDelta",true, ...
"mfccDeltaDelta",true);
% Get the number of features output per signal
features = extract(aFE,audioIn);
[numHops,numFeatures] = size(features);
% Read all audio data into memory
dataTrain = readall(adsTrain);
labelsTrain = removecats(adsTrain.Labels); %remove empty categories
dataValidation = readall(adsValidation);
labelsValidation = removecats(adsValidation.Labels);
% Extract features from all the data (assume the entire dataset uses the same sample rate (44.1 kHz).
featuresTrain = cellfun(@(x)(extract(aFE,x))',dataTrain,'UniformOutput',false);
featuresValidation = cellfun(@(x)(extract(aFE,x))',dataValidation,'UniformOutput',false);
% Define the architecture
layers = [ ...
sequenceInputLayer(numFeatures)
lstmLayer(100,"OutputMode","last") %< increased number of hidden units
fullyConnectedLayer(numel(unique(labelsTrain)))
softmaxLayer
classificationLayer];
% Define the training options
options = trainingOptions("adam", ...
"Shuffle","every-epoch", ...
"ValidationData",{featuresValidation,labelsValidation}, ...
"Plots","training-progress", ...
"Verbose",false, ...
"SequenceLength","shortest", ...%<--Specify the sequence length (try experimenting with different options)
"ValidationFrequency",20);
% Train the network
net = trainNetwork(featuresTrain,labelsTrain,layers,options);
% Evaluate performance on the validation set
y = classify(net,featuresValidation);
accuracy = mean(y==labelsValidation);
cm = confusionchart(labelsValidation,y);
cm.Title = sprintf('Confusion Matrix for Validation Data (Accuracy = %0.2f)',accuracy);
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';