Hello, when converting strings to numbers using str2num, this commands takes "forever": Processing of ~300 MB (of log data, stored in txt file) took ~15 hours! How can one use the "strrep" and "str2num" on cell contents?
The data is stored in an array of cells (every cell contains a line from the source file as string, tabulator-separated values).
log_data_text{:} % contains a cell for each line of the source file
For pre-processing, I need to perform some substitutions (e.g. replace spaces " " with "nan"), this needs to be done individually for each cell, meaning in one cell may be a space at position 10 (and / or other), in another cell space(s) may be at different positions or no spaces at all.
Finally, I need to convert the string of each cell to an array of numbers.
My first approach was to create one long string out of all cells, convert it and reshape the resulting 1-dimensional vector back to the original size
complete_string = convertCharsToStrings([log_data_text_end_delimiter{:}]); % create one string containing all data
complete_string = strrep(complete_string, '\t\t', '\t \t'); % substitution of empty value to space
complete_string = strrep(complete_string, ' ', 'nan'); % substitute spaces (=empty values) with nan
number_array_b = str2num(complete_string); number_of_lines = length(log_data_text_end_delimiter); number_array_b = transpose(reshape(number_array_b, 6, number_of_lines)); % reshape to original size (6 columns)
This approach takes less than a minute, which would be fine. Unfortunately, returns an empty array most of the time, when used on different log data txt files even thouch their structure is identical.
Therefore, I have to use a cell-by-cell routine, looping over all cells.
for yy = 1 : number_of_lines modified_string = log_data_text_end_delimiter{yy}; modified_string = strrep(modified_string, '\t\t', '\t \t'); modified_string = strrep(modified_string, ' ', 'nan'); temp_number_array = str2num(modified_string); number_array(yy, 2:7) = temp_number_array; % the first column already contains data
end
Since there are many lines (=cells), str2num is called millions of times and takes literally hours. How can I optimize this conversion?
Thank you very much for advice,
Dan
#### Update #### – I uploaded a sample log file – I previously also tried the textscan command, but the time conversion didnot work
formatSpec = '%{yyyy-MM-dd HH:mm:ss.S}D\t%f\t%f\t%f\t%f\t%f\t%f';result_array = textscan(fileID,formatSpec)
– Therefore, I analyzed the time string separately from the rest of the string.
Best Answer