From owner-embldatabank@net.bio.net Tue Mar 01 22:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!daresbury!bioftp.unibas.ch!comp.bioz.unibas.ch!doelz
From: doelz@comp.bioz.unibas.ch (Reinhard Doelz)
Subject: Re: Help searching databases
Message-ID: <1994Mar2.095205.285@comp.bioz.unibas.ch>
Sender: usenet@comp.bioz.unibas.ch (NEWS transaction account)
Nntp-Posting-Host: biox.embnet.unibas.ch
Reply-To: doelz@urz.unibas.ch
Organization: EMBnet Switzerland [BASEL]   
References:  <0hQpzNO00iV7A3=HJY@andrew.cmu.edu>
Date: Wed, 2 Mar 1994 09:52:05 GMT
Lines: 175

In article <0hQpzNO00iV7A3=HJY@andrew.cmu.edu>, "Howard M. Bomze" <hb10+@andrew.cmu.edu> writes:
|> I need some help in searching through the sequence data bases for splice
|> junctions.  Does anybody know of any programs out there that I can Use
|> to do this?  I am trying to look at sequences next to splice junctions
|> but in the EXON.  I don't think this will be too hard but I don't know
|> much about programing.  Thanks in advance for any help.

I am not sure what you mean with 'look at' sequences but I'll show 
you how to prepare a set of sequences with the SRS system from Thure 
Etzold. 

Start SRS

[U]
[S]				(select a sequence query)
/[D] SPLICE & JUNCTION          
/[T] SPLICE & JUNCTION
/[C] SPLICE & JUNCTION          (enter search terms SPLICE _and_ JUNCTION
                                 in Description, Title and Reference)
/[S] [SPACE] [S] [E] [SPACE]    (deselect Swissprot, select EMBL database, 
                                 you might want to select more DNA databases)
/[X] 2                           combine search terms with OR 

The mask, now looks as follows: (if your terminal isn't good enough, the 
_ and | will be displayed as x and q's or similar, I'm afraid) 


_____________________________________________________________________________
 [G] General  [O] BuffOptions  [U] Query  [H] Help

 lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqSequenceqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk
 x            ID [I]:                                                      x
 x     Accession [N]:                                                      x
 x    Definition [D]: SPLICE & JUNCTION                                    x
 x      Keywords [K]:                                                      x
 x      Organism [O]:                                                      x
 x       Authors [A]:                                                      x
 x         Title [T]:  SPLICE & JUNCTION                                   x
 x     Reference [R]:                                                      x
 x       Comment [C]: SPLICE & JUNCTION                                    x
 x      Features [F]: SPLICE & JUNCTION                                    x
 x                     separate keys by & (AND), | (OR), or ! (AND NOT)    x
 x                                                                         x
 x query (set) name [Q]: SQ1                     select library(s) [S]: @  x
 x connect fields by AND (1) or OR (2) [X]: 1                              x
 x                                 do =>   ([Do])    abort =>   ([F10])    x
 mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj

______________________________________________________________________________

[RETURN] to do the query. There will be about 100 entries. Next,
[O] [W]    to view the query results, my screen looks like 

  Info of Query "SQ1"

  Query Command(s):
      SQ1 = (
      [SQ-DEF: SPLICE* & JUNCTION*] |
      [SQ-TIT: SPLICE* & JUNCTION*] |
      [SQ-CC: SPLICE* & JUNCTION*])  |
      [SQ-FTS: SPLICE* & JUNCTION*] > PARENT

      86 entries from library "EMBL"
      13 entries from library "EMBL_NEW"
      10 entries from library "GENBANK"

  Created set of type "Seq-ID" has 109 members

[Q] to leave this mode. 

Next, we need to find all exons. Again, a sequence query, 

[U] [S] /[F] EXON [RETURN] [O] [W] 


  Query Command(s):
      SQ2 = (
      [S2-FTS: EXON*])

   39185 entries from library "EMBL"
    2499 entries from library "EMBL_NEW"
    1713 entries from library "GENBANK"

  Created set of type "Seq-Ft-ID" has 43397 members

[Q] to leave this query. 

Next, we combine the two by mapping the first on the second. 

[U] [X] SQ1 > SQ2 [RETURN] 

In SQ3, you'll find about 200 entries which are DNA exon sequences, 
described in other sections of the annotation with SPLICE and JUNCTION. 

You may want to copy this into your directory with [O] [C] (copy 
set output). Now as Exons might be rather short we set a minimum limit to 
10 base pairs, and the screen will look as 

    lqqqqqqqqqqqqqqqqqqqqqq Copy Sequence Feature qqqqqqqqqqqqqqqqqqqqqqqkN...
    x                                                                    x
    x  output directory: /bioy/scratch/doelz/                            x
    x  file name:                                                        x
    x  sequence format [F]:  @                                           x
    x                                                                    x
    x  begin [B]: 0       relative to begin (1) or end (2): 1            x
    x    end [E]: 0       relative to begin (1) or end (2): 2            x
    x       minimum length [M]: 10     maximum length: 0                 x
    x                                                                    x
    x         reject if feature incomplete: Y                            x
    x  reject if selected range incomplete: Y                            x
    x                                                                    x
    x                              do =>   ([Do])    abort =>   ([F10])  x
    x                                                                    x
    mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj


As the copying process starts, you will see success messages and messages 
which indicate that the sequence does not meet the constraints (10 bp) 
or that the feature is incomplete (we ticked Y for Yes in this option). 

Finally, you will come up with about 80 sequences, conveniently written 
in GCG format (you might want PIR format as another option). The following
is showing a sample entry on VMS: 

D$DAY:[DOELZ]CFCRE1_S2.GCG;1

FT   repeat_region   1. .415
FT                   /note="mini-exon gene repeat

  Sequence extract of feature "CFCRE1_S2"
  Begin position: 0 added to BEGIN
  End position: 0 added to END
  Length constraints: min=10, max=0

D$DAY:[DOELZ]CFCRE1_S2.GCG  Length: 415  Check: 4283  ..

       1  AAGCTTCCGG AAACAACCGG CACAAATTTT GAGGCGGAAG CGCTGCTTTT TTTTGTGTCC
      61  GGGGGGGTGC TCCTTGGGGT CCCCCTGTCC AGCCCCAGCC GGTCGCCCAC CACATAGGAA
     121  TTTGCGAAGG ACCCCCAAAA ATCCCGGTCC CCGGGGCGAG TTGTCCCAAC TTTTTCAAAC
     181  CTCATGAAGA GCTAGTTGCG TCATTGAAAA GTTCGTGTGC AGAAACCCCC TCCCCCACGT
     241  TTGTACAATG GAAGAGTTTA CGATACAGGT TTTCTCACGG TTTTGAGGTG TTTTTTCGAA
     301  AAACAAAAAA TATAGAGGTG TATAGCGCTT ATTTTTGACA CCCCCCTCAA AACATGCTGG
     361  GGGTATAGGT CCTTCCAACT AACGCTATAT AAGTATCAGT TTCTGTACTT TATTG



SRS is available as telnet hole and via anonymous FTP in full source. 
It is known to run well on VMS and various flavours of UNIX systems. 
GCG Version 8, expected later this year, is rumoured to contain a 
SRS system.



Regards
Reinhard 





DISCLAIMER
Note that  the software  mentioned  resembles  Computer  Program(s)  which 
require a license in order to be run unless stated otherwise in  a  state-
ment  codistributed with the software. The use of the program(s) was  men-
tioned  within  a specific problem or example and must not be used to con-
clude that other  software products cannot possibly do a similar job. 

-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz@urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+
  |   Klingelbergstrasse 70   | EMBnet         embnet@comp.bioz.unibas.ch |
  |CH 4056 Basel  SWITZERLAND | Switzerland       gopher.embnet.unibas.ch |
  +---------------------------+------------- http://beta.embnet.unibas.ch/

From owner-embldatabank@net.bio.net Tue Mar 01 22:00:00 1994
Path: biosci!bloom-beacon.mit.edu!gatech!howland.reston.ans.net!pipex!zaphod.crihan.fr!jussieu.fr!univ-lyon1.fr!news
From: duret@evoserv.univ-lyon1.fr (Laurent Duret)
Newsgroups: bionet.molbio.embldatabank,bionet.molbio.genbank,bionet.software
Subject: Re: Help searching databases = ACNUC
Date: 2 Mar 1994 07:43:18 GMT
Organization: Universite Claude Bernard - Lyon 1
Lines: 58
Distribution: world
Message-ID: <2l1g2m$nfs@cismsun.univ-lyon1.fr>
References: <2l1fhv$nfs@cismsun.univ-lyon1.fr>
Reply-To: duret@evoserv.univ-lyon1.fr
NNTP-Posting-Host: evoserv.univ-lyon1.fr
Xref: biosci bionet.molbio.embldatabank:295 bionet.molbio.genbank:1552 bionet.software:7438

>I need some help in searching through the sequence data bases for splice
>junctions.  
>I am trying to look at sequences next to splice junctions
>but in the EXON.

M. Gouy has developped a database named ACNUC that allows such request (ref).

                                ACNUC
             A RETRIEVAL SYSTEM FOR GENBANK, EMBL, AND NBRF/PIR

ACNUC is a retrieval system for the nucleotide sequence databases GenBank
or EMBL and for the protein sequence data base NBRF/PIR.

ACNUC allows to select sequences from many criteria (keyword, taxonomy, bibliography,
sequence length, molecule type, organelle etc...) from these 3 data
bases, to translate protein-coding genes in protein, and to extract
selected sequences in user files. ACNUC is unique in providing direct access
to coding regions (e.g. protein coding regions, tRNA or rRNA coding regions)
of DNA fragments present in GenBank and in EMBL (introns, exon, CDS, 3'UTR, mRNA,
5'UTR, tRNA, etc... described on the FEATURES).


Notably, ACNUC allows to extract fragments adjacent to the extremities of a 
subsequence (CDS, intron, tRNA, exon, etc...). Therefore it is possible to
systematically extract, for example, 50 nt downstream of introns end of human 
protein encoding genes (or according to any other criteria).

ACNUC is known to run on Sun (SunOs or Solaris), IBM Risc workstations, 
SGI computers, Dec-alpha systems, and VAX/VMS systems. 
It should be easily installed on most unix platforms. Contact M. Gouy for help
for other unix systems.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc


ref: M. Gouy et al. (1985) CABIOS 3:167-172
     M. Gouy et al. (1984) Nucl. Acids Res. 12:121-127

 Contact = M. Gouy: mgouy@evomol.univ-lyon1.fr


Laurent Duret
Laboratoire de Biometrie, Genetique et Biologie des Populations
URA CNRS 243 Universite Claude Bernard - Lyon I
43, Bd du 11 Novembre 1918 F-69622 Villeurbanne cedex

Tel: 	+33 72.44.81.42
E-mail:	duret@biomserv.univ-lyon1.fr









From owner-embldatabank@net.bio.net Tue Mar 01 22:00:00 1994
Path: biosci!daresbury!trane.uninett.no!sunic!pipex!zaphod.crihan.fr!jussieu.fr!univ-lyon1.fr!news
From: duret@evoserv.univ-lyon1.fr (Laurent Duret)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: Help searching databases = ACNUC
Date: 2 Mar 1994 07:34:23 GMT
Organization: Universite Claude Bernard - Lyon 1
Lines: 55
Distribution: world
Message-ID: <2l1fhv$nfs@cismsun.univ-lyon1.fr>
References: <0hQpzNO00iV7A3=HJY@andrew.cmu.edu>
Reply-To: duret@evoserv.univ-lyon1.fr
NNTP-Posting-Host: evoserv.univ-lyon1.fr

>I need some help in searching through the sequence data bases for splice
>junctions.  
>I am trying to look at sequences next to splice junctions
>but in the EXON.

M. Gouy has developped a database named ACNUC that allows such request (ref).

                                ACNUC
             A RETRIEVAL SYSTEM FOR GENBANK, EMBL, AND NBRF/PIR

ACNUC is a retrieval system for the nucleotide sequence databases GenBank
or EMBL and for the protein sequence data base NBRF/PIR.

ACNUC allows to select sequences from many criteria (keyword, taxonomy, bibliography,
sequence length, molecule type, organelle etc...) from these 3 data
bases, to translate protein-coding genes in protein, and to extract
selected sequences in user files. ACNUC is unique in providing direct access
to coding regions (e.g. protein coding regions, tRNA or rRNA coding regions)
of DNA fragments present in GenBank and in EMBL (introns, exon, CDS, 3'UTR, mRNA,
5'UTR, tRNA, etc... described on the FEATURES).


Notably, ACNUC allows to extract fragments adjacent to the extremities of a 
subsequence (CDS, intron, tRNA, exon, etc...). Therefore it is possible to
systematically extract, for example, 50 nt downstream of introns end of human 
protein encoding genes (or according to any other criteria).

ACNUC is known to run on Sun (SunOs or Solaris), IBM Risc workstations, 
SGI computers, Dec-alpha systems, and VAX/VMS systems. 
It should be easily installed on most unix platforms. Contact M. Gouy for help
for other unix systems.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc


ref: M. Gouy et al. (1985) CABIOS 3:167-172
     M. Gouy et al. (1984) Nucl. Acids Res. 12:121-127

 Contact = M. Gouy: mgouy@evomol.univ-lyon1.fr


Laurent Duret
Laboratoire de Biometrie, Genetique et Biologie des Populations
URA CNRS 243 Universite Claude Bernard - Lyon I
43, Bd du 11 Novembre 1918 F-69622 Villeurbanne cedex

Tel: 	+33 72.44.81.42
E-mail:	duret@biomserv.univ-lyon1.fr






From owner-embldatabank@net.bio.net Sun Mar 06 22:00:00 1994
Newsgroups: bionet.molbio.proteins,bionet.general,bionet.molbio.embldatabank,bionet.molbio.gene-linkage,bionet.molbio.swiss-prot,bionet.software
Path: biosci!bcm!cs.utexas.edu!howland.reston.ans.net!torn!utnut!utcsri!newsflash.concordia.ca!sifon!VM1.MCGILL.CA
From: BAEV <MDNB000@MUSICA.MCGILL.CA>
Subject: Peptide Motif Search by e-mail
Message-ID: <07MAR94.17195731.0096@VM1.MCGILL.CA>
Lines: 8
Sender: usenet@MUSICA.MCGILL.CA
Organization: McGill University
Date: Mon, 7 Mar 1994 20:55:19 GMT
Xref: biosci bionet.molbio.proteins:1509 bionet.general:8144 bionet.molbio.embldatabank:297 bionet.molbio.gene-linkage:295 bionet.molbio.swiss-prot:1 bionet.software:7502

Dear colleagues...I would appreciate any suggestions on peptide
motif search programs available on electronic mail (e-mail)
servers around the world. Thank you in advance.

Nikolai Baeff, M.D.
McGill University, Canada
mdnb@musica.mcgill.ca


From owner-embldatabank@net.bio.net Mon Mar 07 22:00:00 1994
Path: biosci!bcm!cs.utexas.edu!uunet!netnews.jhuapl.edu!netnews.jhuapl.edu!not-for-mail
From: vellani@netnews.jhuapl.edu (Thomas J. Vellani F2C x5714)
Newsgroups: bionet.molbio.embldatabank
Subject: subscription
Date: 7 Mar 1994 23:17:11 -0500
Organization: Johns Hopkins University Applied Physics Lab
Lines: 1
Message-ID: <2lgu87$g5h@aplcomm.jhuapl.edu>
NNTP-Posting-Host: aplcomm.jhuapl.edu

Pl

From owner-embldatabank@net.bio.net Mon Mar 07 22:00:00 1994
Path: biosci!bcm!cs.utexas.edu!howland.reston.ans.net!gatech!news-feed-1.peachnet.edu!umn.edu!msus1.msus.edu!vax1.mankato.msus.edu!vengeance
Newsgroups: bionet.molbio.embldatabank
Subject: Omaha Project
Message-ID: <1994Mar8.084440.2359@vax1.mankato.msus.edu>
From: vengeance@vax1.mankato.msus.edu
Date: 8 Mar 94 08:44:40 -0500
Organization: Mankato State University
Lines: 9

   I am trying to build a list of names and E-Mail addresses of
people in the Omaha Nebraska area for a school related project.
   If you live in Omaha or go to school there or know someone
that does and will be around for three months or more, please
reply via E-Mail to Vengeance@vax1.mankato.msus.edu.

Thank you very much!

Ryan Krueger

From owner-embldatabank@net.bio.net Sat Mar 12 22:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!agate!howland.reston.ans.net!vixen.cso.uiuc.edu!uwm.edu!news.doit.wisc.edu!psl.wisc.edu!news
From: Rapu Lak <rapulak@macc.wisc.edu>
Subject: how can I construct a subset of all the sequences?
Message-ID: <1994Mar13.211036.9827@pslu1.psl.wisc.edu>
X-Xxdate: Sun, 13 Mar 94 23:15:36 GMT
Sender: news@pslu1.psl.wisc.edu (USENET News System)
Organization:  UW-Madison
X-Useragent: Nuntius v1.1.1d17
Date: Sun, 13 Mar 94 21:10:36 GMT
Lines: 47

This is the problem:

We are interested in various questions about RNA.  Many of these
interests are about eukaryotic mRNA and specifically, the 3' ends,
3' UTRs, 3'-most exons and the 3'-most introns.  I would like to
construct a database that contains these data.  But I know that I
don't know how to do this.  

I imagine it is possible to search the entire DNA sequence database
for all eukaryotic DNA sequences, and then search for ones that
identify a translation STOP codon (from some annotation??), and
then limit the search to those that also identify the site of the
3' most intron-exon boundary.  Finally, from among these files
identify those that also have a comment (annotation?) about the
site of  cleavage and polyadenylation.  Then we would like to
extract only the DNA sequence that corresponds to the last exon as
well as the exon containing the translation stop codon (if they are
different) and the sequence of the 3'UTR up to and including the
site of cleavage and polyadenylation.  So we want to identify the
DNA sequence files that have this information and then create a new
file that contains only the DNA sequence corresponding to the
specific part of the mRNA we are interested in.  I then imagine
that we would have a large group of new files, each containing
annotations to the site of the translation stop codon, maybe the
translation reading frame, the site of cleavage and
polyadenylation,  .... (what else?).  These files would define the
database that we are interested in analyzing. 

Am I dreaming the impossible dream?  I suppose I can ask at least
one relevant question.  Are there tools or some established
programs for doing this kind of database search and database
creation?  I'd like to hear from people who have used such
programs, from people who have written such programs, and from
people who might be interested in trying such a program to actually
create such a database for us.  I would like to make contact with
those of the "net" who are comfortable doing these kinds of
manipulations and could possibly guide me to success.  Also, what
and where are the relevant FAQs to this subject/problem?  Have I
posted this question to the appropriate newgroup?

Please reply to me directly by email and I will post a SUMMARY of
what comes my way.


Rock Pulak
UW-Biochemistry
rapulak@macc.wisc.edu

From owner-embldatabank@net.bio.net Sat Mar 12 22:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!agate!howland.reston.ans.net!vixen.cso.uiuc.edu!uwm.edu!news.doit.wisc.edu!psl.wisc.edu!news
From: Rapu Lak <rapulak@macc.wisc.edu>
Subject: how do I construct a subset of the total  DNA sequence database?
Message-ID: <1994Mar13.211708.9978@pslu1.psl.wisc.edu>
X-Xxdate: Sun, 13 Mar 94 23:22:08 GMT
Sender: news@pslu1.psl.wisc.edu (USENET News System)
Organization:  UW-Madison
X-Useragent: Nuntius v1.1.1d17
Date: Sun, 13 Mar 94 21:17:08 GMT
Lines: 47

This is the problem:

We are interested in various questions about RNA.  Many of these
interests are about eukaryotic mRNA and specifically, the 3' ends,
3' UTRs, 3'-most exons and the 3'-most introns.  I would like to
construct a database that contains these data.  But I know that I
don't know how to do this.  

I imagine it is possible to search the entire DNA sequence database
for all eukaryotic DNA sequences, and then search for ones that
identify a translation STOP codon (from some annotation??), and
then limit the search to those that also identify the site of the
3' most intron-exon boundary.  Finally, from among these files
identify those that also have a comment (annotation?) about the
site of  cleavage and polyadenylation.  Then we would like to
extract only the DNA sequence that corresponds to the last exon as
well as the exon containing the translation stop codon (if they are
different) and the sequence of the 3'UTR up to and including the
site of cleavage and polyadenylation.  So we want to identify the
DNA sequence files that have this information and then create a new
file that contains only the DNA sequence corresponding to the
specific part of the mRNA we are interested in.  I then imagine
that we would have a large group of new files, each containing
annotations to the site of the translation stop codon, maybe the
translation reading frame, the site of cleavage and
polyadenylation,  .... (what else?).  These files would define the
database that we are interested in analyzing. 

Am I dreaming the impossible dream?  I suppose I can ask at least
one relevant question.  Are there tools or some established
programs for doing this kind of database search and database
creation?  I'd like to hear from people who have used such
programs, from people who have written such programs, and from
people who might be interested in trying such a program to actually
create such a database for us.  I would like to make contact with
those of the "net" who are comfortable doing these kinds of
manipulations and could possibly guide me to success.  Also, what
and where are the relevant FAQs to this subject/problem?  Have I
posted this question to the appropriate newgroup?

Please reply to me directly by email and I will post a SUMMARY of
what comes my way.


Rock Pulak
UW-Biochemistry
rapulak@macc.wisc.edu

From owner-embldatabank@net.bio.net Sun Mar 13 22:00:00 1994
Path: biosci!bcm!news.msfc.nasa.gov!europa.eng.gtefsd.com!howland.reston.ans.net!torn!utnut!utgpu!utcsri!newsflash.concordia.ca!canopus.cc.umanitoba.ca!frist
From: frist@cc.umanitoba.ca ()
Newsgroups: bionet.molbio.embldatabank
Subject: Re: how can I construct a subset of all the sequences?
Date: 14 Mar 1994 19:32:42 GMT
Organization: University of Manitoba, Winnipeg, Manitoba, Canada
Lines: 209
Message-ID: <2m2e4q$gds@canopus.cc.umanitoba.ca>
References: <1994Mar13.211036.9827@pslu1.psl.wisc.edu>
NNTP-Posting-Host: antares.cc.umanitoba.ca

In article <1994Mar13.211036.9827@pslu1.psl.wisc.edu> Rapu Lak <rapulak@macc.wisc.edu> writes:
>This is the problem:
>
>We are interested in various questions about RNA.  Many of these
>interests are about eukaryotic mRNA and specifically, the 3' ends,
>3' UTRs, 3'-most exons and the 3'-most introns.  I would like to
>construct a database that contains these data.  But I know that I
>don't know how to do this.  
>
This is exactly the type of task that the XYLEM package was designed
to do.  

Fristensky B. (1994) Feature Expresssions: creating and manipulating
sequence datasets. NAR 21:5997-6003.


>I imagine it is possible to search the entire DNA sequence database
>for all eukaryotic DNA sequences, and then search for ones that
>identify a translation STOP codon (from some annotation??), and
>then limit the search to those that also identify the site of the
>3' most intron-exon boundary.  Finally, from among these files
>identify those that also have a comment (annotation?) about the
>site of  cleavage and polyadenylation.
I'll start with the assumption that you are only interested in those
GenBank entries that have the various features you have described
annotated (That will be a lot fewer than you might think.) 
Using FINDKEY in XYLEM, you could identify entries that contain
the information you need by searching for feature keys like 
"polyA_signal", "prim_transcript"  or "3'UTR". For example, the GenBank
Primate division has 466 entries containing the "3'UTR" feature key.
FINDKEY will give you a list of LOCUS names, whose corresponding entries 
could then be retrieved by FETCH. This dataset could be further trimmed by 
inspection (simply delete unsuitable names from the namefile and do a
fetch on your original dataset to generate a refined dataset. 

Another way to do this part would be to search for the same keywords
by sending a request to the email server at NCBI.

  Then we would like to
>extract only the DNA sequence that corresponds to the last exon as
>well as the exon containing the translation stop codon (if they are
>different) and the sequence of the 3'UTR up to and including the
>site of cleavage and polyadenylation.  So we want to identify the
>DNA sequence files that have this information and then create a new
>file that contains only the DNA sequence corresponding to the
>specific part of the mRNA we are interested in.  I then imagine
>that we would have a large group of new files, each containing
>annotations to the site of the translation stop codon, maybe the
>translation reading frame, the site of cleavage and
>polyadenylation,  .... (what else?).  These files would define the
>database that we are interested in analyzing. 
>
I'm not entirely sure what you want to do here, but here's an example of
what the FEATURES program can do. Here is an abbreviated version of
one of the GenBank entries retrieved from the primate division using
the keyword 3'UTR:

LOCUS       ALOEGLOBIM   1691 bp ds-DNA             PRI       10-JAN-1994
DEFINITION  Alouatta seniculus epsilon-globin gene, complete cds.
ACCESSION   L25367
FEATURES             Location/Qualifiers
     5'UTR           1..144
                     /note="putative"
     exon            1..236
                     /number=1
                     /note="putative"
     intron          237..359
                     /number=1
                     /note="putative"
     exon            360..582
                     /number=2
                     /note="putative"
     intron          583..1391
                     /number=2
                     /note="putative"
     exon            1392..1691
                     /number=3
                     /note="putative"
     3'UTR           1521..1691
                     /note="putative"
     source          1..1691
                     /organism="Alouatta seniculus"
                     /dev_stage="adult"
                     /sequenced_mol="DNA"
                     /sex="female"
                     /tissue_type="lymphocyte"
     CDS             join(145..236,360..582,1392..1520)
                     /note="putative; NCBI gi: 440091."
                     /product="epsilon-globin"
                     /codon_start=1

If you ask FEATURES to extract all 3'UTR sequences that are annotated
in the dataset, you get three files: a message file, a sequence file,
and an expression file. The results from the first three of the 466
entries in the dataset are shown below:

message file
-----------
GETOB          Version 1.2        2 Jan 1994
 
AGMA13GT:3'UTR1
     1         371
 
 
/product="alpha-1,3-galactosyltransferase"
/gene="alpha-1.3GT"
//----------------------------------------------
ALOA13GT:3'UTR1
     1         371
 
 
/product="alpha-1,3-galactosyltransferase"
/gene="alpha-1.3GT"
//----------------------------------------------
ALOEGLOBIM:3'UTR1
     1521         1691
 
 
/note="putative"
//----------------------------------------------


sequence file
-------------
>AGMA13GT:3'UTR1
tttgaggtcaagccagagaagaggtggcaacacatcagcatgatgcatgt
gaagatcatcagggagcacatcttggcccacatccaacacgaggtcgact
tcctcttctgcatggatgtagaccaggtcttccaagacaattttggggtg
gacaccctaggccagtcagtggatcagctacagccctggtggtacaaggc
agatcctgaggactttacctaggaaaggcagaaagagtcagcagcatgca
ttccatttggccagggggatttttattaccacacagccatgtttggagga
acacccattcaggttctcaacatcccccaggagtgctttaaaggaatcct
cctggaaaagaaaaatgacat
>ALOA13GT:3'UTR1
tttgaggtcaagccagagaagaggtggcaacacatctgcatgatgggtat
gaagaccatcggggagcacatcttggcccacatccaacacgaggtcgact
tcctcttctgcatggatgtggaccaggtcttccaagaccattttggggtg
gacaccctgggccagtcagtggctcagctacaggcctggtggtacaaggc
agctcctgataactttacctatgagaggcggaaacagtcggcagcatata
ttccatttggccagggggatttttattaccacgcagccatttttggagga
acacccattcaggttctcaacatcacccaggagtgctttaagggaatcct
cctggacaagaaaaatgacat
>ALOEGLOBIM:3'UTR1
gttatctcccagtttgccagtgttcctgtgaccctgacacccttcttctg
cacatgaatactgggcttggccttgagaggaaggtttctgtttaataaag
tacattttcttcagtaatcaaaaattgcaacttcatcttctccatcttgt
actcttgtgctaaaggaaaag

expression file
---------------
>AGMA13GT:3'UTR1
@M73307:1..371
>ALOA13GT:3'UTR1
@M73311:1..371
>ALOEGLOBIM:3'UTR1

The message file is a log of how each feature was extracted, with
accompanying qualifier lines to help in evaluation of results. The
sequence file contains the region of each sequence that was
extracted. The expression file is identical to the sequence file,
with the exception that sequence is replaced by the expression
that was evaluated to create the sequence. (The expression
comes from the Feature Table). 

At this point, you can go through the message file and, if need
be, delete unwanted sequences from the expression file. The 
expression file can then be run back through FEATURES to
created a corrected sequence file.

One of the great things about FEATURES is that even things that
aren't annotated in the GenBank entries can be extracted.
For example, if an entry had CDS ending at position 1028  and polyA_site
annotated at position 1101,    but no 3'UTR, you could create 
the desired sequence fragment with the expression

accession#:1028..1101

Obviously, you could do the same thing to generate the exon or intron
datasets you describe above. For example,  re-run FEATURES on the same dataset, 
extracting 'exon' features. To make a 3'exon file, edit out
all but the last exon for each gene in the expression file, and
run the edited expression file back through FEATURES.
 
>Please reply to me directly by email and I will post a SUMMARY of
>what comes my way.
>
I'm posting to the newsgroup, because I think there will be many
who would be interested in doing comparable things.

>
>Rock Pulak
>UW-Biochemistry
>rapulak@macc.wisc.edu

I should mention that at present, FEATURES only works on GenBank flat
file entries, and not on EMBL entries.

XYLEM can be obtained by anonymous FTP to the directory 'psgendb'
at ftp.cc.umanitoba.ca [130.179.16.24].

===============================================================================
Brian Fristensky                | 
Department of Plant Science     |  A question is like a knife that slices
University of Manitoba          |  through the stage backdrop and gives us
Winnipeg, MB R3T 2N2  CANADA    |  a look at what lies hidden behind.
frist@cc.umanitoba.ca           |  
Office phone:   204-474-6085    |  Milan Kundera, THE UNBEARABLE LIGHTNESS 
FAX:            204-261-5732    |  OF BEING
===============================================================================

From owner-embldatabank@net.bio.net Tue Mar 15 22:00:00 1994
Path: biosci!daresbury!bioftp.unibas.ch!citi2.fr!jussieu.fr!univ-lyon1.fr!swidir.switch.ch!scsing.switch.ch!sun.rediris.es!news.uam.es!anu.iib.uam.es!user
From: JRValverde@Enlil.iib.uam.es (J. R. Valverde)
Newsgroups: bionet.molbio.embldatabank
Subject: Program to update the databases
Followup-To: bionet.molbio.embldatabank
Date: Wed, 16 Mar 1994 17:53:31 +0100
Organization: IIB - CSIC
Lines: 12
Distribution: world
Message-ID: <JRValverde-160394175331@anu.iib.uam.es>
NNTP-Posting-Host: anu.iib.uam.es

Hi there,

    I've learned that there is a program that automatically handles the
updates of the EMBL databases on VAXen with the GCG package. It seems that
this program has been written to support UCX and MultiNet.

    I'd like to know if someone has ported the program to CMU/IP. And if
not, where can I contact the authors or find the source code so that I can
make the port myself.

    Thanks
                        jr

From owner-embldatabank@net.bio.net Fri Mar 18 22:00:00 1994
Newsgroups: bionet.software,bionet.molbio.embldatabank
Path: biosci!daresbury!bioftp.unibas.ch!comp.bioz.unibas.ch!doelz
From: doelz@comp.bioz.unibas.ch (Reinhard Doelz)
Subject: Re: EMBL db updates
Message-ID: <1994Mar19.192645.5785@comp.bioz.unibas.ch>
Sender: usenet@comp.bioz.unibas.ch (NEWS transaction account)
Nntp-Posting-Host: biox.embnet.unibas.ch
Reply-To: doelz@urz.unibas.ch
Organization: EMBnet Switzerland [BASEL]   
References:  <JRValverde-160394174741@anu.iib.uam.es>
Date: Sat, 19 Mar 1994 19:26:45 GMT
Lines: 92
Xref: biosci bionet.software:7652 bionet.molbio.embldatabank:304

In article <JRValverde-160394174741@anu.iib.uam.es>, JRValverde@Enlil.iib.uam.es (J. R. Valverde) writes:
|> Hi to you all,
|> 
|>    I've learned that there is the possibility of automatically updating
|> the EMBL databases in EMBnet using a special program that handles and
|> automates all file transfers.

The program is called NDT and has been invented and implemented by Peter Gad, 
Uppsala, Swedish EMBnet node. 

As you are (guessing from the mail address) in Madrid, Spain, the good news for 
you are that the spanish EMBnet node at the Centro Nacional de Biotecnologia
runs a daily updated EMBL database already. The contact is: Jose-Maria Carazo, 
Carazo@Samba.cnb.uam.es, Tel.: +341-585-4505 . 

|> 
|>    It seems that this program is only supported for VMS systems running
|> UCX or MultiNet as the TCP/IP carriers.
|> 

Pardon me but I would like to extend your statement to a more general apology 
of the current developer's situation. Both Peter and myself do development
on a basis of academic environments, which implies that we are not 
directly funded to develop software on all-of-the-possible platforms but 
rather are employed to make it run here and at our site. If we give it away 
then this is a matter of goodwill. The current zoo of operating systems 
you could possibly want to support is about 50. We have only about 5 configu-
rations in-house. So, making a relase means to do a lot (factor 10) more 
validation work for a given program. This is, bluntly speaking, impossible. 

|>    Now, my question: anyone knows if there is any version supporting the
|> CMU/IP package? And if not, does anyone know where can I get (if it is
|> freely available) the sources so that I can try to port the program to
|> CMU/IP?
|> 

Peter Gad is 'master of the code' and there should be only versions maintained 
by himself. 

|>    If it is not available, does anyone know who the authors are and how
|> can I contact them?
|> 

The 'Nodes of EMBnet'  database tells you more: 

 -->  4.  EMBNet BioInformation Resource Switzerland/
  -->  1.  About EMBnet/
   -->  7.  Index of 'The Nodes of EMBnet' database  <?>
  
    Words to search for:   gad


 -->  1.  sweden   ..t BioInformation Resource Switzerland/About EMBnet/nodes/.
      2.  EMBnet News   ..et BioInformation Resource Switzerland/About EMBnet/.


                                                 Contact person(s):  Peter Gad
                                                                              
                           Electronic Mail contact(s):   gad@perrier.embnet.se
                                                                              
                                     Tel.: +46-18-174016 , Fax.: +46-18-551759




Let me add a final word wrt updating: The transfer of the data is only half of 
the problem; as you need to integrate the data into your environment. We are 
currently running an EMBnet project on describing the routines of the data trans-
fer procedures, and the integration work. The report will be available at the 
time where the project is finished. 

Regards
Reinhard 




DISCLAIMER
Note that  the software  mentioned  resembles  Computer  Program(s)  which 
require a license in order to be run unless stated otherwise in  a  state-
ment  codistributed with the software. The use of the program(s) was  men-
tioned  within  a specific problem or example and must not be used to con-
clude that other  software products cannot possibly do a similar job. 

-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz@urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+
  |   Klingelbergstrasse 70   | EMBnet         embnet@comp.bioz.unibas.ch |
  |CH 4056 Basel  SWITZERLAND | Switzerland       gopher.embnet.unibas.ch |
  +---------------------------+------------- http://beta.embnet.unibas.ch/

