MATLAB: Find common files in two directories

I've got a script that I'm using to make comparison plots using data from two different sets of simulation runs. The data is stored in mat files in different folders. I use uigetdir to select the folders and then I search through the mat files in each folder to find matching test cases so I can make comparison plots for each matching test case.

I have a working solution. However, I don't understand why I have to go through two stages of conversion to finally get a cell array of strings containing the unique test case numbers so that I can sort them into numerical order instead of "ASCII dictionary order".

Here is the relevant code excerpt:

%%Find matching test cases in the two directories
o = dir(fullfile(dirold, '*.mat')); % 'old'
n = dir(fullfile(dirnew, '*.mat')); % 'new'
[C, iold, inew] = intersect({o.name}, {n.name}); % find common test case files in 'old' and 'new' directories
% Convert C to a sortable array of indices so the comparison plots will be in order by case number
cases = regexp(C, 'case(\d*)', 'tokens'); % extract case numbers (cell array of cells?)
for (iC = 1:length(cases)) % TODO: why do I need to do this step?
    x(iC) = cases{iC}; %#ok<SAGROW>

end
for (iC = 1:length(cases)) % convert to cell array of strings
    y(iC) = x{iC}; %#ok<SAGROW>
end
% Re-sort in numerical order instead of 'ASCII dictionary order'
[~,iy] = sort(str2num(char(y))); %#ok<ST2NM>
%%For each test case that is in both directories, make some comparison plots
for (iCase = iy')
    fprintf('\nFound matching test case ''%s''.\n', C{iCase});
      od = load(fullfile(dirold, C{iCase}));   % load 'old' data into 'od' struct
      nd = load(fullfile(dirnew, C{iCase}));   % load 'new' data into 'nd' struct
      ...
      <make the plots>
end

My concern is this: why do I have to go through the two step process of creating the intermediate 'x' and 'y' so that I can finally get an sortable cell array of strings? Is there a way to do this that is more straightforward and less confusing? I don't understand why this is necessary and future users of this code (including myself) won't understand it either.

Any help to simplify this (or at least clarify what is going on and why this mess is necessary) would be much appreciated.

Note: the reason I want to do the 'numeric' sort is so that I get plots for the test cases in the order 1, 2, …, 9, 10, 11, … instead of 10, 11, … 19, 1, 21, 22, …, 2, 3, 4, …, etc. The mat files are named caseX_… where X is the test case number. By default, the dir command, and hence the intersect command, are sorting by "ASCII dictionary order" which is not what I want.

Best Answer

There is reason behind your regexp returning a cell array of cell array of cell array. The outer cell array is simply because your C input is a cell array. So, the outer cell array is always the same size as C and each cell correspond to the matches for the corresponding string in C.

The second level of cell array is because for a given single string there may be several matches. So the matches themselves have to be returned in a cell array. (For example if you request to match 'a..', there are two matches in the string 'abcdaef': {'abc', 'aef'})

But it's not matches that you've requested, it's tokens (usually called captures in other languages). That adds another level and is the reason for the inner cell array. There may be several tokens per match, so the tokens for a match also have to be wrapped up in a cell array. For example, if you request to match 'a(.)(.)', there are two tokens per match so the tokens are {{'b', 'c'}, {'e', 'f'}} and the matches are as above).

In your case, you've only got one match and one token. You could actually get rid of these two levels of cell array.

To get rid of the cell array of tokens, simply ask for a match instead of tokens. There are many ways to build a regex. If you only want to capture a number preceded by a specific string, this would work:

cases = regexp(C, '(?<=case)\d+', 'match');

This matches one or more digit preceded by 'case' (using look-behind). That's one cell array level gone (the inner one)

To get rid of the cell array of matches, simply tell the regular expression engine you only want one match. This is done with the 'once' keyword:

cases = regexp(C, '(?<=case)\d+', 'match', 'once');

cases is then simply a string if C is a string, or a cell array of single strings if C is a cell array.

Best Answer

Related Solutions

MATLAB: Find string and get number from a Cell array of strings.

MATLAB: Cell array strings storage

Related Question