how can I construct a subset of all the sequences?

frist at cc.umanitoba.ca frist at cc.umanitoba.ca
Mon Mar 14 14:32:42 EST 1994


In article <1994Mar13.211036.9827 at pslu1.psl.wisc.edu> Rapu Lak <rapulak at macc.wisc.edu> writes:
>This is the problem:
>
>We are interested in various questions about RNA.  Many of these
>interests are about eukaryotic mRNA and specifically, the 3' ends,
>3' UTRs, 3'-most exons and the 3'-most introns.  I would like to
>construct a database that contains these data.  But I know that I
>don't know how to do this.  
>
This is exactly the type of task that the XYLEM package was designed
to do.  

Fristensky B. (1994) Feature Expresssions: creating and manipulating
sequence datasets. NAR 21:5997-6003.


>I imagine it is possible to search the entire DNA sequence database
>for all eukaryotic DNA sequences, and then search for ones that
>identify a translation STOP codon (from some annotation??), and
>then limit the search to those that also identify the site of the
>3' most intron-exon boundary.  Finally, from among these files
>identify those that also have a comment (annotation?) about the
>site of  cleavage and polyadenylation.
I'll start with the assumption that you are only interested in those
GenBank entries that have the various features you have described
annotated (That will be a lot fewer than you might think.) 
Using FINDKEY in XYLEM, you could identify entries that contain
the information you need by searching for feature keys like 
"polyA_signal", "prim_transcript"  or "3'UTR". For example, the GenBank
Primate division has 466 entries containing the "3'UTR" feature key.
FINDKEY will give you a list of LOCUS names, whose corresponding entries 
could then be retrieved by FETCH. This dataset could be further trimmed by 
inspection (simply delete unsuitable names from the namefile and do a
fetch on your original dataset to generate a refined dataset. 

Another way to do this part would be to search for the same keywords
by sending a request to the email server at NCBI.

  Then we would like to
>extract only the DNA sequence that corresponds to the last exon as
>well as the exon containing the translation stop codon (if they are
>different) and the sequence of the 3'UTR up to and including the
>site of cleavage and polyadenylation.  So we want to identify the
>DNA sequence files that have this information and then create a new
>file that contains only the DNA sequence corresponding to the
>specific part of the mRNA we are interested in.  I then imagine
>that we would have a large group of new files, each containing
>annotations to the site of the translation stop codon, maybe the
>translation reading frame, the site of cleavage and
>polyadenylation,  .... (what else?).  These files would define the
>database that we are interested in analyzing. 
>
I'm not entirely sure what you want to do here, but here's an example of
what the FEATURES program can do. Here is an abbreviated version of
one of the GenBank entries retrieved from the primate division using
the keyword 3'UTR:

LOCUS       ALOEGLOBIM   1691 bp ds-DNA             PRI       10-JAN-1994
DEFINITION  Alouatta seniculus epsilon-globin gene, complete cds.
ACCESSION   L25367
FEATURES             Location/Qualifiers
     5'UTR           1..144
                     /note="putative"
     exon            1..236
                     /number=1
                     /note="putative"
     intron          237..359
                     /number=1
                     /note="putative"
     exon            360..582
                     /number=2
                     /note="putative"
     intron          583..1391
                     /number=2
                     /note="putative"
     exon            1392..1691
                     /number=3
                     /note="putative"
     3'UTR           1521..1691
                     /note="putative"
     source          1..1691
                     /organism="Alouatta seniculus"
                     /dev_stage="adult"
                     /sequenced_mol="DNA"
                     /sex="female"
                     /tissue_type="lymphocyte"
     CDS             join(145..236,360..582,1392..1520)
                     /note="putative; NCBI gi: 440091."
                     /product="epsilon-globin"
                     /codon_start=1

If you ask FEATURES to extract all 3'UTR sequences that are annotated
in the dataset, you get three files: a message file, a sequence file,
and an expression file. The results from the first three of the 466
entries in the dataset are shown below:

message file
-----------
GETOB          Version 1.2        2 Jan 1994
 
AGMA13GT:3'UTR1
     1         371
 
 
/product="alpha-1,3-galactosyltransferase"
/gene="alpha-1.3GT"
//----------------------------------------------
ALOA13GT:3'UTR1
     1         371
 
 
/product="alpha-1,3-galactosyltransferase"
/gene="alpha-1.3GT"
//----------------------------------------------
ALOEGLOBIM:3'UTR1
     1521         1691
 
 
/note="putative"
//----------------------------------------------


sequence file
-------------
>AGMA13GT:3'UTR1
tttgaggtcaagccagagaagaggtggcaacacatcagcatgatgcatgt
gaagatcatcagggagcacatcttggcccacatccaacacgaggtcgact
tcctcttctgcatggatgtagaccaggtcttccaagacaattttggggtg
gacaccctaggccagtcagtggatcagctacagccctggtggtacaaggc
agatcctgaggactttacctaggaaaggcagaaagagtcagcagcatgca
ttccatttggccagggggatttttattaccacacagccatgtttggagga
acacccattcaggttctcaacatcccccaggagtgctttaaaggaatcct
cctggaaaagaaaaatgacat
>ALOA13GT:3'UTR1
tttgaggtcaagccagagaagaggtggcaacacatctgcatgatgggtat
gaagaccatcggggagcacatcttggcccacatccaacacgaggtcgact
tcctcttctgcatggatgtggaccaggtcttccaagaccattttggggtg
gacaccctgggccagtcagtggctcagctacaggcctggtggtacaaggc
agctcctgataactttacctatgagaggcggaaacagtcggcagcatata
ttccatttggccagggggatttttattaccacgcagccatttttggagga
acacccattcaggttctcaacatcacccaggagtgctttaagggaatcct
cctggacaagaaaaatgacat
>ALOEGLOBIM:3'UTR1
gttatctcccagtttgccagtgttcctgtgaccctgacacccttcttctg
cacatgaatactgggcttggccttgagaggaaggtttctgtttaataaag
tacattttcttcagtaatcaaaaattgcaacttcatcttctccatcttgt
actcttgtgctaaaggaaaag

expression file
---------------
>AGMA13GT:3'UTR1
@M73307:1..371
>ALOA13GT:3'UTR1
@M73311:1..371
>ALOEGLOBIM:3'UTR1

The message file is a log of how each feature was extracted, with
accompanying qualifier lines to help in evaluation of results. The
sequence file contains the region of each sequence that was
extracted. The expression file is identical to the sequence file,
with the exception that sequence is replaced by the expression
that was evaluated to create the sequence. (The expression
comes from the Feature Table). 

At this point, you can go through the message file and, if need
be, delete unwanted sequences from the expression file. The 
expression file can then be run back through FEATURES to
created a corrected sequence file.

One of the great things about FEATURES is that even things that
aren't annotated in the GenBank entries can be extracted.
For example, if an entry had CDS ending at position 1028  and polyA_site
annotated at position 1101,    but no 3'UTR, you could create 
the desired sequence fragment with the expression

accession#:1028..1101

Obviously, you could do the same thing to generate the exon or intron
datasets you describe above. For example,  re-run FEATURES on the same dataset, 
extracting 'exon' features. To make a 3'exon file, edit out
all but the last exon for each gene in the expression file, and
run the edited expression file back through FEATURES.
 
>Please reply to me directly by email and I will post a SUMMARY of
>what comes my way.
>
I'm posting to the newsgroup, because I think there will be many
who would be interested in doing comparable things.

>
>Rock Pulak
>UW-Biochemistry
>rapulak at macc.wisc.edu

I should mention that at present, FEATURES only works on GenBank flat
file entries, and not on EMBL entries.

XYLEM can be obtained by anonymous FTP to the directory 'psgendb'
at ftp.cc.umanitoba.ca [130.179.16.24].

===============================================================================
Brian Fristensky                | 
Department of Plant Science     |  A question is like a knife that slices
University of Manitoba          |  through the stage backdrop and gives us
Winnipeg, MB R3T 2N2  CANADA    |  a look at what lies hidden behind.
frist at cc.umanitoba.ca           |  
Office phone:   204-474-6085    |  Milan Kundera, THE UNBEARABLE LIGHTNESS 
FAX:            204-261-5732    |  OF BEING
===============================================================================



More information about the Embl-db mailing list