MATLAB: Do I receive errors when reading data from a text file by using GETGEODATA and GEOSERIESREAD functions in Bioinformatics Toolbox 4.0 (R2011b)

Bioinformatics Toolboxcrashfilei/omemorynotofoutrespondingsystem

I'm trying to read in a 2.6 GB text document downloaded from NCBI GEO website using MATLAB.
At first I receive the following error when using function GETGEODATA:
geoData = getgeodata( 'GSE22851');
Error using getgeodata (line 187)
The GSE record contains multiple files. These were copied to
C:\Users\mranade\AppData\Local\Temp\tp9c182834_f711_4f1e_9c34_c4a50d27c680. Please use GEOSERIESREAD to read the data from this directory.
After this, I use GEOSERIESRREAD as mentioned in the above error message and receive the following error or my system stops responding.
geoData = geoseriesread('GSE22851-GPL10680_series_matrix.txt');
Error using textscan
Maximum variable size allowed by the function is exceeded.
Error in geoseriesread (line 180)
allData = textscan(fullData,formatString,'delimiter','\t');

Best Answer

There are a couple of important limitations to be aware of when using GETGEODATA and GEOSERIESREAD functions, and the GEO series GSE22851 encounters these. Function GETGEODATA is expected to throw an error if it finds more than one GSE file associated with a single accession number, as is the case for GSE22851. The downloaded files are available in a temp directory as indicated in the error message.
As a workaround, we propose using an approach that does not require loading all of the data simultaneously. One way to do this is using a BioIndexed file (BioIndexedFile class from the Bioinformatics Toolbox). This allows you to work with an object that is mapped to data stored on disk, so that only data that is queried will be loaded into the memory.
To create a BioIndexedFile object:
b = BioIndexedFile('table','GSE22851-GPL10680_series_matrix.txt')
Note that this creates a new file (ending in .idx) that contains indexing information for the data file but does not create a copy of that data.
To display data for one or more rows, we can use the getEntryByIndex method:
b.getEntryByIndex(1:10)
This will display the raw text from each row. Note that in this file, the data begins below the header, on row 113.
In this data file, each row of data contains a string ID followed by 68 floating point numbers. To exclude the ID and read the numeric data, the Interpreter on the BioIndexedFile object can be set to parse the string appropriately:
b.Interpreter = @(x)cell2mat(textscan(x,['%*s',repmat('%f',1,68)]))
For more information on parsing strings with TEXTSCAN, please see the documentation page here: http://www.mathworks.com/help/techdoc/ref/textscan.html
Look for these lines on the documentation page:
C = textscan(str, ...) reads data from string str.
%*...Skip the field. textscan does not create an output cell for any field that it skips.
Once the interpreter is set, the read method will return the Interpreter's output for the row or rows specified. In this case that is just the numeric data:
data = b.read(113:200);
This provides a way to read a subset of the data without loading the full data file into the memory. To know more about all the methods and properties of class BioIndexedFile, please visit the following documentation page.
To summarize, here are the three lines of MATLAB code which can be used to read data from the text files.
b = BioIndexedFile('table','GSE22851-GPL10680_series_matrix.txt');
b.Interpreter = @(x)cell2mat(textscan(x,['%*s',repmat('%f',1,68)]))
data = b.read(113:200);