MATLAB: How to find strings in a very large array of data

csvreadregexregexpreptextscan

Hi
I have a csv file containing a large number of numbers and a few random strings like 'zgdf'. I need to find them and set them to zero. I cannot use 'csvread' (due to strings), so I use 'textscan' to read the file.
I then turn the data to digits using str2double. MATLAB then turns the string values to NaN which is fine for me, but it takes a long time, specially because this has to be done for many similar files.
Any faster method to sort this out?
This is how I read the data (original file has two columns and large number or rows):
fileID = fopen(filename);
C = textscan(fileID,'%s %s','Delimiter',',');
fclose(fileID);
for i = 1: length (C{1})
D(i) = str2double(C{1}{i});
end
Thanks

Best Answer

[This answer has been reorganized following the discussion in the comment section under the question]
Method 1
fid = fopen('myCSVfile.csv');
C = textscan(fid,'%s %s','Delimiter',',');
fclose(fid);
A = str2double(C{1}); % Faster than doing the same thing in a loop.
[update] the loop method below is actually faster
A = zeros(size(C{1})); % <--- always pre-allocate!
for i = 1:numel(C{1})
A = str2double(C{1}{i});
end
Method 2
Try this modification of the script produced by ImportData tool. Rather than importing your data and then converting it using str2double(), this imports the data as numeric and replaces non-numeric elements with NaN. I think it should be faster than your approach but I doubt it is much faster (or maybe it's not faster at all).
The only 2 variables you'll need to change to adapt to your data are
  • file (the filename, or, preferably, the full path to your file)
  • The NumerVariables value (number of columns of data)
%% Setup the Import Options and import the data
file = "C:\Users\name\Documents\MATLAB\myCSVfile.csv"; % Full path to your file (or just file name)
opts = delimitedTextImportOptions("NumVariables", 2); % Number of columns of data
opts.VariableTypes(:) = {'double'}; % read in all data as double (nan for strings)
opts.Delimiter = ",";
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
Data = readtable(file, opts); % Read in as table
Data = Data{:,:}; % Convert to matrix
Method 3
D = zeros(size(C{1})); % <--- pre-allocate!
for j = 1: length (C{1})
s = sscanf(C{1}{j},'%f');
if ~isempty(s)
D(j) = s;
end
end
This is 4.5x faster than method 1.
Method 4
This FEX function is designed to overcome the slow speed of str2double()
Method 5
A very fast solution is to read the data in using readmatrx() which automatically converts non-numeric elements to NaN but it requires r2019a.
file = 'myCSVfile.csv';
D = readmatrix(file); %that's it, just 2 lines