Parsing out ORFs?

Don Gilbert gilbertd at bio.indiana.edu
Thu Mar 30 16:30:12 EST 2000


readseq version2 may or may not help:
  http://iubio.bio.indiana.edu/soft/molbio/readseq/java/
It will pull any feature annotations from genbank or embl sequence files.
If you feed readseq a file of many genbank/embl records, it will produce
output with the feature sequence separated by record.  If you feed it one
large sequence record (e.g., a chromosome) with many feature annotations,
it will join all the feature sequence into one output record (for that csome).

If you translate to embl/genbank while selecting a feature, the output
documentation will list the join() statements it used to make the output
sequence from the extracted features.

E.g.
fetch ftp://ncbi.nlm.nih.gov/genbank/genomes/S_cerevisiae/Chr01/yst_1.gbk.Z
jre -cp readseq.jar run format=fasta features=gene yst_1.gbk

kalo% jre -cp readseq.jar run format=fasta features=gene -pipe yst_1.gbk 
Readseq version 2.0.8 (18 Jan 2000)
>NC_001133 Saccharomyces cerevisiae chromosome I, complete chromosome sequence. 145659 bp 
atgatcgtaaataacacacacgtgcttaccctaccactttataccaccaccacatgccat
actcaccctcacttgtatactgattttacgtacgcacacggatgctacagtatataccat
...

kalo% jre -cp readseq.jar run format=embl features=gene -pipe yst_1.gbk 
Readseq version 2.0.8 (18 Jan 2000)
ID   NC_001133  standard; DNA; PLN; 145659 BP.
XX
..
FH   Key             Location/Qualifiers
FT   gene            335..649
FT                   /gene="YAL069W"
FT   gene            1807..2169
FT                   /gene="YAL068C"
..
FT   extracted_range join(335..649,1807..2169,7236..9017,10092..10400,
FT                   11566..11952,12047..12427,21526..21852,24001..27969,
FT                   31568..32941,33449..34702,35156..36304,36510..37148,
..
SQ   Sequence 145659 BP; 69830 A; 44641 C; 45763 G; 69969 T; 0 other;
     atgatcgtaa ataacacaca cgtgcttacc ctaccacttt ataccaccac cacatgccat        60
     actcaccctc acttgtatac tgattttacg tacgcacacg gatgctacag tatataccat       120
..

Readsesq will read ncbi's genome section .gbk files, but chokes on the large ones 
currently (I think I know the solution..).

In article <38E3826A.6A8CE0AA at nospam.net>,  <nospam at nospam.net> wrote:
>Is there an "out of the box" method for producing separate sequence
>lines for each ORF in a genomic sequence?  Something suitable for making
>a multisequence fasta file would be nice.
>
>Thanks,
>Mike Holloway
>holloway-1 at medctr.osu.edu
>


--
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at bio.indiana.edu





More information about the Bio-soft mailing list