I am trying to input a nucleotide squence that is not from the NCBI database. What format does it need to be in and what codes do I need? I currently have it in an xlsx file where each character is in a row of the same column, or in a word doc where the characters are viewed as a single "word".
MATLAB: Nucleotides import- not from database
bioinformaticsBioinformatics Toolboxnucleotides
Related Solutions
I took a quick look through the documentation, and was not able to find anything about those file types. The pages that describe the import/export capabilities are:
- Sequence Analysis Data Import and Export
- Mass Spectrometry and Bioanalytics Data Import and Export
- High Throughput Sequencing Data Import and Management
- Microarray Analysis Data Import and Management
Looking at those pages, the file formats that are supported appear to be:
- FASTA
- GenBank
- GenPept
- EMBL
- BLAST
- PDB
- PFAM
- ClustalW
- GCG
- PHYLIP
- Newick
- FASTQ
- MZCDF
- MZXML
- JCAMP
- TGSPC
- SFF
- SCF
- SAM
- BAM
- SOAP
- Bowtie
- Affymetrix® GeneChip®
- Illumina®
- Agilent®
- Gene Expression Omnibus (GEO)
- ImaGene®
- SPOT
- GenePix® GPR
- GAL
Notice the section with SAM, BAM, FASTA, FASTQ, SOAP, and Bowtie files (the most relevant for genomic work).
I do not see any mention of the formats you mentioned.
One good place to look for new features in any given release is the release notes (if you were not already aware). For the Bioinformatics toolbox, the release notes can be found here.
Again, there is not any mention there of the file formats you specified.
[rnaFile, message] = fopen(outputFile, 'w');
While not your problem, should check for the output file opening successfully as well as input...
[base, num] = fread(dnaFile, 1, 'char');
The above reads one character, but returns it as a double, not a character.
Use
[base, num]=fread(dnaFile,'*char');
to read whole file into a character array.
while num > 0 if base == 'C' ...
The above starts an infinite loop as num=1 and there's nothing inside the loop that ever changes num so it stays there forever...oh, although it will eventually error out on EOF on the read...scanned too quickly first.
You can loop, but Matlab is vectorized; may as well make use of it.
% rewrite the rules for convenience C -> G; G -> C; T -> A; A -> U
out=repmat('_',size(base)); % this is else case...we'll overwrite everything besides
out(base=='C')='G'; % use logical addressing to locate each and write output
out(base=='G')='C';out(base=='T')='A';out(base=='A')='U';freqG=sum(out=='G');freqC=sum(out=='C');freqA=sum(out=='A');freqU=sum(out=='U');totalNum=sum([freqG freqC freqA freqU]);freqG=sum(out=='G')/totalNum;freqC=sum(out=='C')/totalNum;freqA=sum(out=='A')/totalNum;freqU=sum(out=='U')/totalNum;
ADDENDUM:
Couldn't see an issue otomh so did a trial with just made up sequence...
>> dna=['A', 'C', 'G', 'T']; % the four letters start with
>> dna=repmat(dna,1,10); % make longer sequence from them
>> dna=dna(randperm(length(dna))); % and then scramble 'em up
>> dna(randperm(length(dna),3))='_'; % a few other characters for spice
>> dna % what we start with is then ...
dna =CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
The algorithm above starts here:
>> rna=repmat('_',size(base)); % replace the non-wanted letters
>> rna(dna=='C')='G';>> rna(dna=='G')='C';>> rna(dna=='T')='A';>> rna(dna=='A')='U';>> [dna;rna] % see what we got...
ans =CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATCGAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG
Looks like what problem statement asked for...compute frequency
Did this in "more Matlab-y" way; order is alphabetic to satisfy histc
>> RNA='ACGU'; % the letters to use as bin centers must be increasing order
>> freq=histc(rna,RNA)freq = 9 9 10 9>> freq=freq/sum(freq)freq = 0.2432 0.2432 0.2703 0.2432>> >> [RNA.' num2str(freq.','%.4f')] % display results tabulated
ans =A 0.2432C 0.2432G 0.2703U 0.2432>>
If order is important revert to previous or a "cute" way would be to use the categorical datatype--
>> rnac=categorical(cellstr(rna),{'G';'C';'A';'U';'_'});>> summary(rnac) G 10 C 9 A 9 U 9 _ 3 >>
ADDENDUM 2:
And, the yet more Matlab-y way vectorizes the translation via lookup...
>> DNA=['C','G','T','A','_']; % base characters in sequence 1
>> RNA=['G','C','A','U','_']; % corresponding characters in sequence 2
>> RNA(arrayfun(@(c) find(c==DNA),dna)) % translate one to other...
ans =GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG
Best Answer