MATLAB: Can I remove outlier lines from a large plot

classificationde-noiseoutlieroutlier detectionpcoe datasetsplottingtime seriestimeseries

Thank you in advance for any help you're able to give me.
I'm currently working with a large data set of battery ageing from Nasa and I've been able to pull the data out of the structure and plot it. However looking at the plots I can see some lines that are clearly wrong.
I was wondering if it would be possible to remove these lines from the plot, or alternatively filter them out from the cell array I create to hold them all (myData). I’ve been looking at outlier removal methods or de-noising but I can’t seem to find a way that would remove entire lines from the plot. I think my code may be written in a somewhat roundabout way so please let me know if there’s anything I can do to change it.
load('C:\Users\Harry\Documents\4th Year\Diss Work\Data\BatteryAgingARC-FY08Q4\B0005.mat');
L = length(B0005.cycle);
for i = 1:L
type = {B0005.cycle.type}.';
s1 = 'charge';
s2 = 'discharge';
C = strcmp(type, s1);
D = strcmp(type, s2);
hold on;
if C(i,:) == 1
figure(1);
title('Charge');
myData(:, i) = {B0005.cycle(i).data.Time}.';
myData(:, i+1) = {B0005.cycle(i).data.Voltage_measured}.';
cellfun(@plot,myData(:,i),myData(:, i+1));
i = i+1;
end
end
The file for the dataset is quite large so I can't attach it, but it's contained here, under Dataset 5, 'Download Battery Data Set 1'.
Thank you again.

Best Answer

It's important to define what constitutes an outlier prior to finding a solution for a few reasons.
  1. Without a definition you're just working in the dark and hoping to find some approach that "looks right" which could result in a lot of wasted time.
  2. Trying approaches until something "looks right" is subjective and is not a disciplined, objective approach. Some data that appear to be an outlier explained by some external factor often turns out to be real data that sould be inlucded in the analysis.
  3. The context and background of the data and the instrumentation used to collect it should be part of the decision of what constitutes an outlier.
In any case, I explored the data a bit and found that ~35 of the timeseries data contain <1000 samples while the rest of the ~135 timeseries data contain ~3400-3900 samples. My first thought was to see if those small-sampled timeseries are the outliers so I plotted them separately. But as I mentioned previously, someone unfamiliar with what the data represents or how it was collected isn't the best person to define an outlier. So the results are plotted below and you can decide additional (or completely different) steps are needed.
Also, I restructured your code to greatly speed it up and make it more efficient. See the inline comments for more details.
% Identify file and load data
directory = 'C:\Users\name\Documents\MATLAB\BatteryAgingARC-FY08Q4';
file = 'B0005'; %without extension so we can use this to identify the variable name, too.
S = load(fullfile(directory,[file,'.mat']),file);
B = S.(file); %now the code is extendable to any file
% Identify charge type
type = {B.cycle.type}.';
s1 = 'charge';
s2 = 'discharge';
C = strcmp(type, s1);
D = strcmp(type, s2);
% Extract data with type='Charge'
chargeCycle = B.cycle(C);
% Extract x and y values for plotting
ccData = [chargeCycle.data];
time = {ccData.Time};
voltMeas = {ccData.Voltage_measured};
% Plot all time series
figure(1);
title('Charge');
hold on;
cellfun(@plot, time, voltMeas)
% Let's look at the number of samples in each time series
nSamp = cellfun(@numel, time);
minSamp = min(nSamp); %Some timeseries are only 5 samples
figure()
plot(sort(nSamp),'o') %Clearly two groups: nSamp < 2000 and nSamp > 2000
xlabel('index'); ylabel('Number of samples (sorted)')
% Are the timeseries with fewer samples the outliers?
hasFewSamp = nSamp < 2000;
figure()
hold on
ph1 = cellfun(@(x,y)plot(x,y,'k-'), time(~hasFewSamp), voltMeas(~hasFewSamp));
ph2 = cellfun(@(x,y)plot(x,y,'r-'), time(hasFewSamp), voltMeas(hasFewSamp));
legend([ph1(1),ph2(1)],{'> 2000 samples','< 2000 samples'})
% zoom into the critical part of the plot
xlim([0, 3400])
ylim([3.4, 4.3])
% Check out the portion of data between x=3200:4000 and you'll see
% that the red lines do behave differently.
% Do you consider those outliers or are additional steps required?
% Look at another section
xlim([3000, 4000])
ylim([4.14, 4.22])
Next steps
If this classification works, the next steps are
  1. Remove the outliers (use the hasFewSamp variable). Try it and if you get stuck you can share the lines of code showing your attempt).
  2. instead of using a hard-coded nSamp threshold of 2000, this value needs to be computed because the other data sets have different characteristics. Again, give that a try and circle back when needed.