MATLAB: Can “Fitlm” predict untrained categorical values

categoricalfitlmMATLABpredictuntrained

Why can "fitlm" predict untrained categorical values? When I train a linear model on categorical variable "Var3" with values '1' & '2', I should not be able to predict another sample data when the categorical value is '3' (that the model was untrained for).
REPRODUCTION STEP:
 
% Create a table containing categorical variable
>> T = table([-82;-67;-82;-77;-113;-123;-116;-71;-106;-108], ...
[8.45;8.8;9.1;8.8;8.8;9.05;8.61;8.3;8.28;7.6], ...
categorical({'1';'1';'1';'2';'2';'2';'2';'3';'3';'3'}))
% Split data to training & test sets
>> training_set = T(1:7,:) % train on cat var 1 & 2 only
>> test_set = T(8:end,:) % test on cat var = 3
>> model = fitlm(training_set,'Var1 ~ Var2 + Var3');
>> prediction = predict(model,test_set); % this should not work

Best Answer

This is because even though you have only chosen a training set that contains categorical variable "Var3" that ranges from 1 & 2, MATLAB will still store the information that 3  is a valid categorical value for "Var3" in the training set.
 
>> categories(training_set.Var3)
This issue can be generalize with an example below:
 
>> catArray = categorical({'a', 'b', 'c', 'd', 'd', 'd'})
>> newCatArray = catArray(1:3)
>> categories(newCatArray) % notice that 'd' is still a valid category value for the sub-array "newCatArray"
Therefore, when you train your "model" , you can see that your model will still accept and contain parameters with "Var3" values from 1 to 3:
>> model
To avoid this, you can use "removecats" to first remove any unused categorical values from your training set:
 
>> training_set.Var3 = removecats(training_set.Var3);
>> model = fitlm(train_table, 'Var1 ~ Var2 + Var3');
>> data = predict(model, test_set) % now you will get "NaN" values now when trying to predict an untrained category value