MATLAB: Nucleotides import- not from database

bioinformaticsBioinformatics Toolboxnucleotides

I am trying to input a nucleotide squence that is not from the NCBI database. What format does it need to be in and what codes do I need? I currently have it in an xlsx file where each character is in a row of the same column, or in a word doc where the characters are viewed as a single "word".

Best Answer

Based on the information you've provided, I would import the data directly from Excel as you currently have it formatted. You can read more about those options here. You can then make further adjustments to the format of the data directly in MATLAB, if that is necessary for your subsequent needs.

Related Solutions

MATLAB: Does Matlab support bed, wig, and other usual genomics file formats

I took a quick look through the documentation, and was not able to find anything about those file types. The pages that describe the import/export capabilities are:

Looking at those pages, the file formats that are supported appear to be:

FASTA
GenBank
GenPept
EMBL
BLAST
PDB
PFAM
ClustalW
GCG
PHYLIP
Newick
FASTQ
MZCDF
MZXML
JCAMP
TGSPC
SFF
SCF
SAM
BAM
SOAP
Bowtie
Affymetrix® GeneChip®
Illumina®
Agilent®
Gene Expression Omnibus (GEO)
ImaGene®
SPOT
GenePix® GPR
GAL

Notice the section with SAM, BAM, FASTA, FASTQ, SOAP, and Bowtie files (the most relevant for genomic work).

I do not see any mention of the formats you mentioned.

One good place to look for new features in any given release is the release notes (if you were not already aware). For the Bioinformatics toolbox, the release notes can be found here.

Again, there is not any mention there of the file formats you specified.

MATLAB: Problem calculating nucleotide percentages

[rnaFile, message] = fopen(outputFile, 'w');

While not your problem, should check for the output file opening successfully as well as input...

[base, num] = fread(dnaFile, 1, 'char');

The above reads one character, but returns it as a double, not a character.

Use

[base, num]=fread(dnaFile,'*char');

to read whole file into a character array.

while num > 0 
  if base == 'C'
    ...

The above starts an infinite loop as num=1 and there's nothing inside the loop that ever changes num so it stays there forever...oh, although it will eventually error out on EOF on the read...scanned too quickly first.

You can loop, but Matlab is vectorized; may as well make use of it.

% rewrite the rules for convenience C -> G; G -> C; T -> A; A -> U
out=repmat('_',size(base));   % this is else case...we'll overwrite everything besides
out(base=='C')='G';           % use logical addressing to locate each and write output
out(base=='G')='C';
out(base=='T')='A';
out(base=='A')='U';
freqG=sum(out=='G');
freqC=sum(out=='C');
freqA=sum(out=='A');
freqU=sum(out=='U');
totalNum=sum([freqG freqC freqA freqU]);
freqG=sum(out=='G')/totalNum;
freqC=sum(out=='C')/totalNum;
freqA=sum(out=='A')/totalNum;
freqU=sum(out=='U')/totalNum;

ADDENDUM:

Couldn't see an issue otomh so did a trial with just made up sequence...

>> dna=['A', 'C', 'G', 'T'];           % the four letters start with
>> dna=repmat(dna,1,10);               % make longer sequence from them
>> dna=dna(randperm(length(dna)));     % and then scramble 'em up
>> dna(randperm(length(dna),3))='_';   % a few other characters for spice
>> dna                                 % what we start with is then ...
dna =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC

The algorithm above starts here:

>> rna=repmat('_',size(base));        % replace the non-wanted letters
>> rna(dna=='C')='G';
>> rna(dna=='G')='C';
>> rna(dna=='T')='A';
>> rna(dna=='A')='U';
>> [dna;rna]            % see what we got...
ans =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG

Looks like what problem statement asked for...compute frequency

Did this in "more Matlab-y" way; order is alphabetic to satisfy histc

>> RNA='ACGU';   % the letters to use as bin centers must be increasing order
>> freq=histc(rna,RNA)
freq =
   9     9    10     9
>> freq=freq/sum(freq)
freq =
  0.2432    0.2432    0.2703    0.2432
>> 
>> [RNA.' num2str(freq.','%.4f')]  % display results tabulated 
ans =
A  0.2432
C  0.2432
G  0.2703
U  0.2432
>>

If order is important revert to previous or a "cute" way would be to use the categorical datatype--

>> rnac=categorical(cellstr(rna),{'G';'C';'A';'U';'_'});
>> summary(rnac)
   G      10 
   C       9 
   A       9 
   U       9 
   _       3 
>>

ADDENDUM 2:

And, the yet more Matlab-y way vectorizes the translation via lookup...

>> DNA=['C','G','T','A','_'];  % base characters in sequence 1
>> RNA=['G','C','A','U','_'];  % corresponding characters in sequence 2
>> RNA(arrayfun(@(c) find(c==DNA),dna))  % translate one to other...
ans =
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG

Best Answer

Related Solutions

MATLAB: Does Matlab support bed, wig, and other usual genomics file formats

MATLAB: Problem calculating nucleotide percentages

Related Question