MATLAB: Is BioMap slow for indexing BAM files in R2016b

bamtoolsbioBioinformatics Toolboxcomputational

I am using "BioMap" to index very large "BAM" files. However, for files with several thousand references, it is taking longer than a day to finish. Why is this happening?

Best Answer

"BAM" files (and "SAM" files, their ASCII encoded cousin) can be very large. As a consequence they must be indexed and analyzed on disk. After indexing, there are two additional files that the "BioMap" class uses to access the data efficiently from disk, which have the extensions "BAI" and "LINEARINDEX". However, if both of these files are not present, "BioMap" must construct them. This is very computationally intensive for files containing many reference sequences, as their is a large amount of I/O that must occur in addition to the indexing analysis.
If the references in the "BAM" file are ordered, and the "BAI" file is present, it may be possible to achieve faster results using the "bamread" function.