MATLAB: Error with new version of readtable (R2020a)

detectimportoptionsfailed to convert character codeMATLABr2020areadtable

I am currently trying to import a .csv file into MATLAB R2020a using the function readtable. The input file can be found there:
https://depmap.org/portal/download/ > ALL DOWNLOADS > DepMap Public 20Q2 CCLE_expression_v2.csv 06/20 368.66 MB
  • When I use readtable with this file I get an error message on the console that I find not very explicit and would love your insights about what could be wrong with it:
T = readtable('CCLE_expression.csv');
Failed to convert character code.
To note:
T = readtable(fn,'FileType','text','Delimiter',',','TextType','string','ReadVariableNames',1);
returns the same error.
  • It is very difficult for me to track the error because:
1) the error does not specify any function name, line of code or error code I can refer to
2) although an error is returned in the console, it does not 'pause' anywhere when tracking errors (Run>pause when error)
3) if the readtable is in another function F, it returns the error when I run F but does not return it if in debug mode within F: I put a break point before the readtable line, and then run manually the readtable line while in debug mode (which is not at all a behavior I am used to see). While in debug mode running the readtable function, I won't get any error but it is as if the readtable did not run: the variable in not present in my workspace.
  • I have tested readtable on previous MATLAB versions as well as on R2020a with the 'auto' flag respecting the old behavior of readtable and have absolutely no problems running it:
T = readtable(fn,'FileType','text','Delimiter',',','TextType','string','ReadVariableNames',1);
  • I am suspecting a problem of a character encoding that could lead the 'automatic importation' of R2020a to fail recognizing the variable type (from MATLAB 2020a documentation:'Starting in R2020a, the readtable function read an input file as though it automatically called the detectImportOptions function on the file. It can detect data types, discard extra header lines, and fill in missing values.') but I do not know how to test that out as I am not able to really see what piece of the readtable function is not working properly (it calls an internaly coded function). The line 195 in readtable is the one leading to the error:
t = func.validateAndExecute(filename,varargin{:});
Has any of you encountered that issue in the past? What was the cause of it? Happy to change the input parameters of the readtable function for R2020a version with the 'auto' flag but I would love to understand what is the problem here and if I should be worried of the new behavior of readtable in R2020a. Also, out of curiosity, why is 'Format','auto' the set of argument needed to restore the previous behavior instead of the 'legacy' term?
Thank you so much for your help!
Best
Sandrine

Best Answer

This is an internal size limit that applies even when you explicitly specify the encoding.
The size limit is exactly 12976128 bytes which is 0xC60000 . If you have even 1 byte more then you will get the decode fail message.
It is not at all obvious to me why that particular limit would be true.
But!! The limit also depends heavily on the number of input columns!
31 rows of 58029 columns -> ok, 32 rows fail
63 rows of 58028 columns -> ok, 64 rows fail
64 rows of 58026 columns -> ok, 64 rows of 58027 columns fail
95 rows of 58026 columns -> ok (99223917 bytes), 96 rows fail (100268245 bytes)
The input file in question has 58677 columns, and by that time the limit is down to about 10 lines.