MATLAB: How to detect repetition in data

dataerrorpointsrepeatrepetition

Hello,

So I have various data files containing different information, such as engine speed, engine torque, etc. Each file has about 10,000 points, one for each second (so the data was gathered for over two hours). I'm trying to analyze the data such that if for 60 seconds, the data is the same, then there is an error with the data. For example, if the engine speed was 79.356 for 60 data points, there is an error..

How do I go about doing this?

Best Answer

Do you only want to identify adjacent points of repitition for your data, or any points that are not unique? If you're wanting the former, you could try loading in the data into a numerical matrix and using the "diff" command across your vector. Any point where the difference is zero would be a repeated value. In this way you could determine the beginning, extent, etc. of data reptition for whatever conditions you need to meet to throw an error.

Example:

data = 100*rand(1,10000); %random dataset
data(1,50:120) = 79.356; %set some data to constant value
datarep = ~diff(data);

Now, you can count the run-length of each set of repeated data. There might be other ways of doing it, but for run lengths I often convert to a string and use "regexp."

s = regexprep(num2str(datarep),' ',''); %convert to string, remove spaces
[ids runs] = regexp(s,'1+','start','match');
l = cellfun('length',runs);

In this way, ids will tell you where each set of repeated values starts, and l will tell you the length of each. This will give you enough information for seeing if your error conditions are met.

Related Solutions

MATLAB: How to vectorize random permutation of data

You wish to randomly permute each of the rows of ‘data’. Then do this:

   (Simplified)
   [m,n] = size(data);
   [~,p] = sort(rand(m,n),2);
   perm_data = reshape(data(repmat((1-m:0).’,n,1)+p(:)*m),m,n);

MATLAB: Find a row of repeated values

Hi. Sorry for not getting back to your comment on my answer yesterday. Here is how I would do it:

First, some random data for my example:

data = 2100*rand(1,10000); %random dataset

Next, I'll make a few sections of data repitition:

data(1,50:120) = 79.356; %set some data to constant value



data(1,200:210) = 81.220; %set some data to constant value
data(1,400:520) = 1445.201; %set some data to constant value
data(1,900:948) = 0.113; %set some data to constant value

Now do the differencing. Runs of zeros will be potential problem areas. The ~ logical command is used to return binary data. That is, where the difference function returned zero (no change) we return "true." Everywhere else returns "false." So now we have a 10,000 element binary vector with sections of ones and zeros, and the ones are repetitions.

datarep = ~diff(data);

Now here is where I search for zeros. Like I said, there are definitely other ways of doing this, including using a for loop, but I find this to be the most compact and simple way I've come across. I'll split it up into steps instead of jamming it all together like I did yesterday.

First, turn your differenced vector into a string:

datarepstr = num2str(datarep) %convert to string

Turning a vector into a string puts spaces between each number, so we'll use a "regular expression replace" function to get rid of them and leave us just the ones and zeros. The function finds all points of ' ' in our string and replaces them with ''.

s = regexprep(datarepstr,' ',''); %remove spaces

Now we want to find where all the ones are in the string, as well as how long each sections of ones is. regexp searches our string for all cases where there are one or more ones, or '1+'. Our expression should find four different sections of ones (because that's how many runs of repetition I added. "ids" is the start of each section and runs is the section pulled out from the string.

[ids runs] = regexp(s,'1+','start','match'); %find all runs and the point where they start

These values are returned in cell arrays. cellfun is a function that performs another function (in this case, length) on each cell of an array. It's like looping over each element but more compact. l should have four elements telling how long each run is.

l = cellfun('length',runs); %find the length of each run

Now we have everything we need in order to check our potential problem runs for ones that cross the line. It will all depend on the frequency of your sampling. If it's on datapoint every second, we'll see if any of our lengths are greater than sixty. If it's every half second, we'll look for >120. And so on.

if any(l > 60) %if any run is longer than 60, display message
  disp('Error')
end

Of course, you may want more info than that in your message. You may also want to stop execution of your program, in which case calling error instead of disp would be needed. You may want to tell which elements are the problematic repetitions, and you can do that, because you have the lengths of the runs in l and the indices of where each run starts in ids.

Finally, here's the function in its entirety, now in a very compact form:

[ids runs] = regexp(regexprep(num2str(~diff(data)),' ',''),'1+','start','match');
l = cellfun('length',runs);
if any(l > 60)
  disp('Error')
end

Best Answer

Related Solutions

MATLAB: How to vectorize random permutation of data

MATLAB: Find a row of repeated values

Related Question