extract & translate exons: solution

Dominique Mouchiroud mouchi at evoserv.univ-lyon1.fr.univ-lyon1.fr
Tue May 2 12:07:37 EST 1995


In article d4s at dartvax.dartmouth.edu, Bob.Gross at Dartmouth.EDU (Robert H. Gross) writes:
> I am familiar with the techniques for extracting the exon sequences
> >From the database. What I need to do is to translate exons in the
> appropriate reading frame automatically. How can I determine which
> reading frame to use and, for the first exon, where to start. Remember,
> I'd like to do this for ALL exons so it needs to be an automatic
> process.
> 

Hello,

M. Gouy has developped a database named ACNUC that allows one to do such requests
(ref).

                                ACNUC
             A RETRIEVAL SYSTEM FOR GENBANK, EMBL, AND NBRF/PIR

ACNUC is a retrieval system for the nucleotide sequence databases GenBank
or EMBL and for the protein sequence data base NBRF/PIR.

ACNUC allows to select sequences from many criteria (keyword, taxonomy, bibliography,
sequence length, molecule type, organelle etc...) from these 3 data
bases, to translate protein-coding genes in protein, and to extract
selected sequences in user files. ACNUC is unique in providing direct access
to coding regions (e.g. protein coding regions, tRNA or rRNA coding regions)
of DNA fragments present in GenBank and in EMBL (introns, exon, CDS, 3'UTR, mRNA,
5'UTR, tRNA, etc... described on the FEATURES).

Of course, these sub-sequences can be selected only if they are described in the
FEATURES.

For example, to select and extract+translate human coding sequences (cds), you
should do:

[strings not preceded by "%" are those typed by the user]
########################################################################
query
%             ****     ACNUC Data Base Content      ****                        
%                GenBank Release 88 (15 April 1995)                             
%286,094,556 bases; 352,414 sequences; 116,28 subsequences; 86,360 references.  
%Software by M. Gouy & M. Jacobzone, Laboratoire de biometrie, Universite Lyon I
%[ 8 free lists available]
% Command? (H for command list)
select
% Enter your selection criteria, or H(elp) (EX: sp=homo sapiens et k=globin@)
sp=homo sapiens et t=cds
% Sequence list named LIST1      contains 16900 seqs (among which 12460 subseqs)
% Command? (H for command list)
extract
% List name, sequence, or accession #, or H(elp)? [default=LIST1]

% Current output format is: text                          
% Do you want to change it? (y/[n])
y
% Choose one among:
% 1  text                          
% 2  gcg                           
% 3  fasta                         
% 4  flat (GenBank, EMBL or PIR)   
% 5  analseq                       
3
% Name of file to write extracted sequences?
human.cds
% Do you want:
%   (1) Simple extraction
%   (2) Translate into protein and extract
%   (3) Fragments or adjacent sequences
%   (4) Regions defined by sequence FEATURES
%   (5) Regions adjacent to sequence FEATURES
2
% Translating and extracting A00119.PE1          
% Translating and extracting A00127.PE1          
% Translating and extracting A00142.LAG-2        
% ... 
%Number of extracted sequences:   16900
%Command? (H for command list)
stop
%STOP: End of ACNUC retrieval program

The file "human.cds" contains all human CDS, translated into protein.

ACNUC offers many other possibilities. 

Notably, ACNUC allows to extract fragments adjacent to the extremities of a 
subsequence (CDS, intron, tRNA, exon, etc...). Therefore it is possible to
systematically extract, for example, 50 nt downstream of introns end of human 
protein encoding genes (or according to any other criteria).

ACNUC is known to run on Sun (SunOs or Solaris), IBM Risc workstations, 
SGI computers, Dec-alpha systems, and VAX/VMS systems. 
It should be easily installed on most unix platforms. Contact M. Gouy for help
for other unix systems.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc


ref: M. Gouy et al. (1985) CABIOS 3:167-172
     M. Gouy et al. (1984) Nucl. Acids Res. 12:121-127

 Contact = M. Gouy: mgouy at evomol.univ-lyon1.fr


Laurent Duret
Laboratoire de Biometrie, Genetique et Biologie des Populations
URA CNRS 243 Universite Claude Bernard - Lyon I
43, Bd du 11 Novembre 1918 F-69622 Villeurbanne cedex

Tel: 	+33 72.44.81.42
E-mail:	duret at biomserv.univ-lyon1.fr



More information about the Comp-bio mailing list