MATLAB: Can “Fitlm” predict untrained categorical values

categoricalfitlmMATLABpredictuntrained

Why can "fitlm" predict untrained categorical values? When I train a linear model on categorical variable "Var3" with values '1' & '2', I should not be able to predict another sample data when the categorical value is '3' (that the model was untrained for).

REPRODUCTION STEP:

% Create a table containing categorical variable
>> T = table([-82;-67;-82;-77;-113;-123;-116;-71;-106;-108], ...
[8.45;8.8;9.1;8.8;8.8;9.05;8.61;8.3;8.28;7.6], ...
categorical({'1';'1';'1';'2';'2';'2';'2';'3';'3';'3'}))
% Split data to training & test sets
>> training_set = T(1:7,:)   % train on cat var 1 & 2 only
>> test_set = T(8:end,:)     % test on cat var = 3
>> model = fitlm(training_set,'Var1 ~ Var2 + Var3');
>> prediction = predict(model,test_set);  % this should not work

Best Answer

This is because even though you have only chosen a training set that contains categorical variable "Var3" that ranges from 1 & 2, MATLAB will still store the information that 3 is a valid categorical value for "Var3" in the training set.

>> categories(training_set.Var3)

This issue can be generalize with an example below:

>> catArray = categorical({'a', 'b', 'c', 'd', 'd', 'd'})
>> newCatArray = catArray(1:3)
>> categories(newCatArray)  % notice that 'd' is still a valid category value for the sub-array "newCatArray"

Therefore, when you train your "model" , you can see that your model will still accept and contain parameters with "Var3" values from 1 to 3:

>> model

To avoid this, you can use "removecats" to first remove any unused categorical values from your training set:

>> training_set.Var3 = removecats(training_set.Var3);
>> model = fitlm(train_table, 'Var1 ~ Var2 + Var3');
>> data = predict(model, test_set)        % now you will get "NaN" values now when trying to predict an untrained category value

Related Solutions

MATLAB: How do you find and replace rows in two tables with some variables in common

Use the "outerjoin" function to find the indices of the match.

https://www.mathworks.com/help/matlab/ref/outerjoin.html#btx2ndz-5

>> [C,ia,ib] = outerjoin(A, B, 'Keys', [2:3], 'MergeKeys', true, 'Type', 'right');

"ia" and "ib" show where each of the table rows in "C", come from in tables "A" and "B". A zero in "ia" indicates that that row in table "C" does not appear in table "A". Use these indices to find the indices in tables "A" and "B" where "ia" is not zero and then find the values at those indices in "ia" and "ib".

>> inda = find(ia~=0);
>> indb = ib(inda);

Those give the correlated indices in tables "A" and "B" that need to be switched.

Then, you can use those indices to index into the "B" table and replace it with the values you obtain by indexing into the "A" table.

>> B(indb,1)=A(ia(inda),1);

MATLAB: How to use custom date labels for the x-axis in MATLAB plots

It is more convenient to format the tick labels using the "XTick", "XTickLabelMode" and "XTickLabel" properties of the axis object (vs. using "datetick" function):

>> figure
>> box on
>>
>> % Start, end and number of ticks...
>> startDate = 7.3457e+05;
>> endDate = 7.3458e+05;
>> numbertick = ceil(endDate - startDate);
>>
>> % Plot them
>> xData = linspace(startDate, endDate, numbertick);
>> plot(xData,ones(size(xData)));
>>
>> %%Modify the axes properties
>> set(gca, 'XTick',xData)
>> set(gca, 'TickDir', 'out');
>> set(gca, 'XTickLabelMode', 'auto');
>>
>> % Modifying the labels
>> labels = get(gca, 'XTickLabel');
>>
>> for i = 1:length(labels)
>>     if i == 1 || i == ceil(length(labels)/2) || i == length(labels)
>>         labels{i} = datestr(xData(i), 'mm/dd');
>>     else
>>         labels{i} = '';
>>     end
>> end
>>
>> set(gca, 'XTickLabel', labels)

Best Answer

Related Solutions

MATLAB: How do you find and replace rows in two tables with some variables in common

MATLAB: How to use custom date labels for the x-axis in MATLAB plots

Related Question