From owner-embldatabank@net.bio.net Mon May 02 23:00:00 1994
Path: biosci!PC3Q.CORP.HARRIS.COM!pmc
From: pmc@PC3Q.CORP.HARRIS.COM (Paul MacGyver Carman)
Newsgroups: bionet.molbio.embldatabank
Subject: Mailing List
Date: 3 May 1994 07:30:48 -0700
Organization: BIOSCI International Newsgroups for Molecular Biology
Lines: 18
Sender: daemon@net.bio.net
Distribution: bionet
Message-ID: <9405031014.aa20894@pc3q.pc3q.corp.harris.com>
NNTP-Posting-Host: net.bio.net

Would you please send more info on the Biology job/mailing list?

Thanks!

-Paul


********************************************************************************
*                          *						       *
* Paul MacGyver Carman     * He who devotes himself to learning		       *
* Harris Corporate HQ      *  seeks from day to day to increase his knowledge. *
* 1025 W. Nasa Blvd. MS 75 * He who devotes himself to knowing his true nature *
* Melbourne, FL 32919      *  seeks from day to day to diminish his doing.     *
* (407) 724-3205           *						       *
* (407) 724-3888 (fax)     *     Lao Tzu			               *
* pcarman@harris.com       *      "Tao Te Ching - The Way of Power"	       *
*                          *						       *
********************************************************************************

From owner-embldatabank@net.bio.net Mon May 02 23:00:00 1994
Path: biosci!daresbury!trane.uninett.no!sunic!EU.net!chsun!elna.ethz.ch!usenet
From: svuilleu@micro.biol.ethz.ch
Newsgroups: bionet.molbio.embldatabank
Subject: total number of bases?
Date: 3 May 1994 19:41:02 GMT
Organization: none
Lines: 15
Distribution: world
Message-ID: <2q69ce$ra0@elna.ethz.ch>
NNTP-Posting-Host: b22-pro486-2.ethz.ch

Hi all,

I can't seem to find a reliable current estimate of the total 
number of bases in all the different sequences stored in 
readily accessible databases (genbank, embl...). Also, 
from the last genembl release, I gather there are now about 
180'000 gene sequences available...Is that it?
Guesses, insights and pointers?
Thank you for your time

Stéphane Vuilleumier
Mikrobiologisches Institut
ETH Zürich
Switzerland                            svuilleu@micro.biol.ethz.ch
 

From owner-embldatabank@net.bio.net Tue May 03 23:00:00 1994
Path: biosci!agate!howland.reston.ans.net!news.cac.psu.edu!news.tc.cornell.edu!travelers.mail.cornell.edu!cornell!batcomputer!ghost.dsi.unimi.it!univ-lyon1.fr!news
From: duret@evoserv.univ-lyon1.fr (Laurent Duret)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: total number of bases?
Date: 4 May 1994 06:07:23 GMT
Organization: Universite Claude Bernard - Lyon 1
Lines: 49
Distribution: world
Message-ID: <2q7e2r$7f6@cismsun.univ-lyon1.fr>
References: <2q69ce$ra0@elna.ethz.ch>
Reply-To: duret@evoserv.univ-lyon1.fr
NNTP-Posting-Host: evoserv.univ-lyon1.fr

In article ra0@elna.ethz.ch, svuilleu@micro.biol.ethz.ch writes:
> Hi all,
> 
> I can't seem to find a reliable current estimate of the total 
> number of bases in all the different sequences stored in 
> readily accessible databases (genbank, embl...). Also, 
> from the last genembl release, I gather there are now about 
> 180'000 gene sequences available...Is that it?
> Guesses, insights and pointers?
> Thank you for your time
> 
> Stéphane Vuilleumier
> Mikrobiologisches Institut
> ETH Zürich
> Switzerland                            svuilleu@micro.biol.ethz.ch
>  


The size of current releases of GenBank and EMBL is:

GenBank Release 82 (15 April 1994): 180,589,455 bases; 169,896 sequences;

EMBL Library Release 38  (March 1994): 179,346,566 bases; 171,787 sequences;

The NCBI maintains a non-redundant database daily updated
  nr    Non-redundant PDB+GBUpdate+GenBank+EmblUpdate+EMBL:

5:07 AM EDT May 3 1994: 184,980,203 bases; 173,749 sequences;

For more information, you can subscribe to the NCBI newsletter:
============================================================================
 For a free subscription to "NCBI News", the NCBI newsletter, send a request
 along with your name and postal mailing address to:  info@ncbi.nlm.nih.gov
============================================================================

Laurent Duret

================================================================
Laboratoire de Biometrie, Genetique et Biologie des Populations
Bat 741 - URA CNRS 243 Universite Claude Bernard - Lyon I
43, Bd du 11 Novembre 1918 
69622 Villeurbanne cedex FRANCE

Tel: 	+33 72.44.81.42
Fax:	+33 78.89.27.19
E-mail:	duret@biomserv.univ-lyon1.fr
================================================================



From owner-embldatabank@net.bio.net Tue May 03 23:00:00 1994
Path: biosci!agate!howland.reston.ans.net!EU.net!uknet!daresbury!not-for-mail
From: Massimo Delledonne <DELLE%IPCUCSC.earn@earn-relay.ac.uk>
Newsgroups: bionet.molbio.embldatabank
Subject: same sequence is different in EMBL and GENBANK
Date: 4 May 1994 17:06:48 +0100
Lines: 75
Sender: daemon@mserv1.dl.ac.uk
Distribution: bionet
Message-ID: <2q8h6o$av8@mserv1.dl.ac.uk>
Original-To: embl-db <embl-db@dl.AC.UK>

Hello netters,
about 6 months ago we decide to amplify a protease inhibitor from soybean
bowman birk family. With the GCG program I found, in the italian EMBnet node, t
he sequence K01967 (GMBBI) with this title:
"Soybean bowman-birk protease inhibitor, complete coding region"
13-JUN-1985 (Rel 06, Created)
22-APR-1990 (Rel. 23, Last updated, Version 1)
looking at the electronic sequence and at the published sequence (Molecular clo
ning and analysis of a gene coding for the Bowman-Birk protease inhibitor in so
ybean. J. Biol. Chem. 259, 9883-9890 (1984)) cited in the record, I realized it
was the sequence I was looking for.
Then I designed the oligos to amplify the gene, but I was never able to amplify
it .... I decided to try another region, may be the problem was a sequence
error near the 3' of my 3' oligonucleotide (a little confusing as explanation,
but don't try to improve my english .. no way) so I changed region ... result:
 nothing .. still not able to find any band of this about 300 bp fragment.
PCR on the same DNA with primers for other protease inhibitors were ok, but Bow
man-Birk.....
 
Few months ago I've discovered the amazing world of internet, and the power of
gopher .. Discovered merlot.welch.jhu.edu I found a way to go everywhere in the
 world without knowing any internet address ... A couple of days ago, I tried t
o check for new bowman-birk sequences in genebank via gopher, and I discovered
this terrible thing:   8-I
 
Sequence number K01967 (the same of embl)  name SOYCIIPI (different from GMBBI)
definition: Soybean CII protease inhibitor gene, complete coding sequence
PLN 1986
interesting, same number but different definition and different name ...
I was sure to have found a different sequence ..... but ....never found this se
quence in EMBL ... why ?  ...
Authors ... the same
Title of the papers ...the same  ... funny ... never found this sequence on tha
t article .... mmmmhhhh .. interesting ...(same paper on the same volume at the
 same pages .... mumble mumble ...
and then I've discovered the thing that made me very nervous (and that was miss
ing in the EMBL sequence .... another paper:
ERRATUM. Molecular cloning and analysis of a gene coding for the Bowman-Birk pr
otease inhibitor in soybean. J. Biol. Chem. 260, 7806-7806 (1985)
 
mumble mumble ... what happen? Are these the same sequence or not ?
 
Ran to the library, got the erratum ... the original sequence was wrong  :-(
the gene is not a Bowman-birk gene anymore... and the sequence is very differen
t at the 3' terminus of the fragment I was trying to amplify| That was the reas
on|
 
Now my question is:
 
 
 
The GENEBANK sequence was uptade in 1986
The EMBL sequence was uptade in 1990
The GENBANK sequence is the correct one
 
Who made the mistake ? The autors who submitted the correct version to Genbank
and not to EMBL (but in this case why the EMBL sequence was updated in 1990)
or: GENBANK submitted to EMBL an incorrect update ...
or: EMBL made a mistake
 
In any case I think it is really bad to have to check both the databases to be
sure the sequence you are looking for is correct .... And .. how many sequences
in both the databases have the same problem?
 
Massimo Delledonne
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 -
                                                  %%                     %%
Massimo Delledonne                              _%%%_____________________%%%_
Istituto di Genetica vegetale                    %%%  ()     %%      ()  %%%
Universita' Cattolica S.C. Piacenza -ITALY-       %%   ()   ("")    ()   %%
Bitnet    delle@ipcucsc                                ( )___%%___( )
Internet  delle@imicilea.cilea.it                        (        )
                                                          (      )
"A wrong model is better than no model at all"  (Goethe)   (%%%%)

From owner-embldatabank@net.bio.net Tue May 03 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!bcm!cs.utexas.edu!howland.reston.ans.net!darwin.sura.net!fconvx.ncifcrf.gov!fcsparc6!toms
From: toms@fcsparc6.ncifcrf.gov (Tom Schneider)
Subject: Re: same sequence is different in EMBL and GENBANK
Message-ID: <CpAt6w.12J@ncifcrf.gov>
Sender: usenet@ncifcrf.gov (C News)
Nntp-Posting-Host: fcsparc6.ncifcrf.gov
Organization: Frederick Cancer Research and Development Center
References: <2q8h6o$av8@mserv1.dl.ac.uk>
Distribution: bionet
Date: Wed, 4 May 1994 22:01:43 GMT
Lines: 30

In article <2q8h6o$av8@mserv1.dl.ac.uk> Massimo Delledonne
<DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:

| The GENEBANK sequence was uptade in 1986
| The EMBL sequence was uptade in 1990
| The GENBANK sequence is the correct one
|  
| Who made the mistake ? The autors who submitted the correct version to Genbank
| and not to EMBL (but in this case why the EMBL sequence was updated in 1990)
| or: GENBANK submitted to EMBL an incorrect update ...
| or: EMBL made a mistake
|  
| In any case I think it is really bad to have to check both the databases to be
| sure the sequence you are looking for is correct .... And .. how many sequences
| in both the databases have the same problem?
|  
| Massimo Delledonne

Although horrifying, this is a perfect example of one of the reasons that
"federations" of databases are likely to fail horribly.  EMBL and GENBANK are
supposed to work closely together!  What will happen when we have 50 sequence
databases and they DON'T even try to work together?  People will be checking
one database against the other and finding errors.  If a database gets a
reputation for handling data poorly, won't people simply stop using it?

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

From owner-embldatabank@net.bio.net Wed May 04 23:00:00 1994
Path: biosci!agate!howland.reston.ans.net!newsserver.jvnc.net!raffles.technet.sg!nuscc.nus.sg!mcbtansh
From: mcbtansh@leonis.nus.sg (Tan Shyh Han)
Newsgroups: bionet.molbio.embldatabank
Subject: Sequence of Sp1 Transcription Factor
Date: 5 May 1994 04:42:39 GMT
Organization: National University of Singapore
Lines: 4
Message-ID: <2q9tfv$95u@nuscc.nus.sg>
NNTP-Posting-Host: leonis.nus.sg
X-Newsreader: TIN [version 1.2 PL0]


Does anyone know the full length sequence of the Sp1 sequence? I have
searched the databases but the sequence posted by Kadonaga et al only
contain about 3/4 of the 3' sequence.

From owner-embldatabank@net.bio.net Wed May 04 23:00:00 1994
Path: biosci!agate!howland.reston.ans.net!pipex!lyra.csx.cam.ac.uk!bjd12
From: bjd12@cus.cam.ac.uk (Ben Davis)
Newsgroups: bionet.general,bionet.molbio.embldatabank,bionet.molbio.proteins,bionet.software
Subject: databases search - protein size
Date: 5 May 1994 11:36:17 GMT
Organization: U of Cambridge, England
Lines: 18
Message-ID: <2qalnh$36h@lyra.csx.cam.ac.uk>
NNTP-Posting-Host: grus.cus.cam.ac.uk
X-Newsreader: TIN [version 1.2 PL2]
Xref: biosci bionet.general:8904 bionet.molbio.embldatabank:318 bionet.molbio.proteins:1879 bionet.software:8099

Hi

I'm trying to find a way of searching databases for proteins (ideally from
E.Coli) with a mass in a given range (say between 10 and 18 kDa).

Anyone got any suggestions ?

Ben

--
______________________________________________________________________________

Ben Davis,
MRC Protein Function and Design,
Cambridge, UK
______________________________________________________________________________

"They can make me do it, but they can't make me do it with dignity."

From owner-embldatabank@net.bio.net Wed May 04 23:00:00 1994
Path: biosci!daresbury!trane.uninett.no!sunic!EU.net!uknet!demon!news2.sprintlink.net!news.sprintlink.net!sundog.tiac.net!usenet.elf.com!rpi!batcomputer!ghost.dsi.unimi.it!univ-lyon1.fr!news
From: duret@evoserv.univ-lyon1.fr (Laurent Duret)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: databases search - protein size
Date: 5 May 1994 15:53:44 GMT
Organization: Universite Claude Bernard - Lyon 1
Lines: 182
Distribution: world
Message-ID: <2qb4q8$j2v@cismsun.univ-lyon1.fr>
References: <2qalnh$36h@lyra.csx.cam.ac.uk>
Reply-To: duret@evoserv.univ-lyon1.fr
NNTP-Posting-Host: evoserv.univ-lyon1.fr

In article 36h@lyra.csx.cam.ac.uk, bjd12@cus.cam.ac.uk (Ben Davis) writes:
> Hi
> 
> I'm trying to find a way of searching databases for proteins (ideally from
> E.Coli) with a mass in a given range (say between 10 and 18 kDa).
> 
> Anyone got any suggestions ?
> 


Manolo Gouy and co-workers has written a very good software named ACNUC 
that allows one to make such query:

- select complete coding sequences from E. coli, which length is comprised between
  250-500 nt (which roughly corresponds to 10 - 18 kDa when translated in aa)
  and then translate them into protein.

example:

(lines preceded by : are those entered by the user)
//////////////////////////////////////////////////////////////////////////////////
:query
             ****     ACNUC Data Base Content      ****                        
                GenBank Release 82 (15 April 1994)                             
180,589,455 bases; 169,896 sequences;  90,795 subsequences; 69,376 references. 
Software by M. Gouy & M. Jacobzone, Laboratoire de biometrie, Universite Lyon I

Command? (H for command list)
:select
Enter your selection criteria, or H(elp) (EX: sp=homo sapiens et k=globin@)
:sp=escherichia coli et t=cds et no k=partial
Sequence list named LIST1      contains  3988 seqs (among which  3889 subseqs)
Command? (H for command list)
:modify
List name, or H(elp)? [default=LIST1]

You can modify this sequence list according to:
1. Confirmation/Suppression of sequences from list
2. Sequence length
3. Sequence insertion date
4. Replace subsequences by seq containing them
5. Add subsequences of seq in list
Enter your choice (1-5):
:2
Give your length threshold: (ex:  L>200  or   L<1000)
:l>250
There are now  3698 sequences in list: LIST1     
Command? (H for command list)
:modify
List name, or H(elp)? [default=LIST1]

You can modify this sequence list according to:
1. Confirmation/Suppression of sequences from list
2. Sequence length
3. Sequence insertion date
4. Replace subsequences by seq containing them
5. Add subsequences of seq in list
Enter your choice (1-5):
:2
Give your length threshold: (ex:  L>200  or   L<1000)
:l<500
There are now   772 sequences in list: LIST1     
Command? (H for command list)
:extract
List name, sequence, or accession #, or H(elp)? [default=LIST1]

Name of file to write extracted sequences? (or GCG)
:my_file
Do you want:
  (1) Simple extraction
  (2) Translate into protein and extract
  (3) Fragments or adjacent sequences
  (4) Regions defined by sequence FEATURES
  (5) Regions adjacent to sequence FEATURES
2
Translating and extracting CB2PIL              
Translating and extracting EC2MIN.ILVH         
Translating and extracting EC2MIN.PE8          
...

Translating and extracting U01159.TRBJ         
Number of extracted sequences:   772
Command? (H for command list)
:stop
STOP: End of ACNUC retrieval program
////////////////////////// end of the example /////////////////////////////


Thus GenBank (release 82) contains 772 E. coli complete coding sequence
coding for proteins which size range between 10 - 18 kDa. 
These sequences are saved in the file named "my_file".

You can also have access and save all GenBank information attached to these
sequences.

Here I only have GenBank, but ACNUC is also available for EMBL and PIR-NBRF.

ACNUC allows many different requests with a relatively simple query language.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc

I include bellow the README file provided at this FTP site.

I hope this helps,


Laurent Duret

================================================================
Laboratoire de Biometrie, Genetique et Biologie des Populations
Bat 741 - URA CNRS 243 Universite Claude Bernard - Lyon I
43, Bd du 11 Novembre 1918 
69622 Villeurbanne cedex FRANCE

Tel: 	+33 72.44.81.42
Fax:	+33 78.89.27.19
E-mail:	duret@biomserv.univ-lyon1.fr
================================================================



============================= README   =================================

                                ACNUC
             A RETRIEVAL SYSTEM FOR GENBANK, EMBL, AND NBRF/PIR

ACNUC is a retrieval system for the nucleotide sequence databases GenBank
or EMBL and for the protein sequence data base NBRF/PIR.

ACNUC is known to run on Sun (SunOs or Solaris), IBM Risc workstations, 
SGI computers, Dec-alpha systems, and VAX/VMS systems. 
It should be easily installed on most unix platforms. Contact me for help
for other unix systems.


ACNUC allows to select sequences from many criteria from these 3 data
bases, to translate protein-coding genes in protein, and to extract
selected sequences in user files. ACNUC is unique in providing direct access
to coding regions (e.g. protein coding regions, tRNA or rRNA coding regions)
of DNA fragments present in GenBank and in EMBL.

ACNUC is distributed by anonymous ftp from the internet address:
biom3.univ-lyon1.fr    or, numerically,    134.214.92.37
The directory there is /pub/acnuc

ACNUC is available in two different formats:
	1) Interfaced with the flat files as distributed by GenBank, EMBL, and
	NBRF/PIR. These flat files can be obtained from the data base
	distribution centers by ftp, by tape, or by cd-rom.

	2) (NOT FOR NBRF/PIR) Interfaced with the GCG package.

If the GCG package is installed on your site, then choose format 2) above
because you will not duplicate the data base on your computer. You install
a new database release for the GCG package yourself. Then you proceed to ACNUC 
installation that will access GCG files in read-only mode.
If the GCG package is not installed on your site, choose format 1) above.
If the database flatfiles are not already on your site, the acnuc installation
procedure provides a procedure to get these files by ftp. Flat files are 
accessed by ACNUC in read-only mode.

ACNUC is made of:
	1) a data base, that can be in one of 2 formats as said above;
	2) a retrieval program, named querydiv.
	3) a set of index files that are distributed by ftp by us.

The retrieval program is written in FORTRAN (with a few routines written in c).

ACNUC is updated at each new GenBank, EMBL, and NBRF/PIR release.

ACNUC installation is described in file install_acnuc.doc.

M. Gouy
Laboratoire de Biometrie
Universite Claude Bernard
69622 VILLEURBANNE, France
+33 72.44.81.42
E-mail:  mgouy@evomol.univ-lyon1.fr
==================================== end of file ================================


From owner-embldatabank@net.bio.net Thu May 05 23:00:00 1994
Path: biosci!daresbury!bioftp.unibas.ch!embl-heidelberg.de!stoehr
From: stoehr@embl-heidelberg.de (Peter Stoehr)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: same sequence is different in EMBL and GENBANK
Message-ID: <1994May6.172800.171123@eros.embl-heidelberg.de>
Date: 6 May 94 17:27:59 +0100
References: <2q8h6o$av8@mserv1.dl.ac.uk>
Distribution: bionet
Organization: European Molecular Biology Laboratory
Lines: 25

In article <2q8h6o$av8@mserv1.dl.ac.uk>, Massimo Delledonne
<DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:

> Who made the mistake ? The autors who submitted the correct version to Genbank
> and not to EMBL (but in this case why the EMBL sequence was updated in 1990)
> or: GENBANK submitted to EMBL an incorrect update ...
> or: EMBL made a mistake
>  
> In any case I think it is really bad to have to check both the databases to be
> sure the sequence you are looking for is correct .... And .. how many sequences
> in both the databases have the same problem?
>  
> Massimo Delledonne

The entry concerned K01967 has been updated today in the EMBL database using
the current version from GenBank (the originator). Clearly we missed
the critical update which occurred back in 1986 and apologise sincerely that
this has caused you such problems.

Regards,
Peter Stoehr
EMBL Data Library

ps I am not yet sure what the update was to the EMBL entry in 1990, but
   it wasn't this important one.

From owner-embldatabank@net.bio.net Thu May 05 23:00:00 1994
Path: biosci!JHUVM.HCF.JHU.EDU!RCARPER%JHUHYG.BITNET
From: RCARPER%JHUHYG.BITNET@JHUVM.HCF.JHU.EDU (robin carper)
Newsgroups: bionet.molbio.embldatabank
Subject: EMBL3 sequences
Date: 6 May 1994 05:07:49 -0700
Organization: BIOSCI International Newsgroups for Molecular Biology
Lines: 6
Sender: daemon@net.bio.net
Distribution: bionet
Message-ID: <199405061207.FAA28174@net.bio.net>
NNTP-Posting-Host: net.bio.net

Has anyone designed primers for up and downstream of the polylinker sites in EM
BL3?  or does anybody have the exact sequences available?  I know that the lamb
da 1059 from which the EMBLs were derived should be much the same as the sequen
ce for lamda posted in GenBank, but... Does anyone have reliable primers for th
e flanking sides?  I am trying to amplify some EMBL3 inserts and would apprecia
te any info.  Stratagene does not have the sequence.  Thanks.

From owner-embldatabank@net.bio.net Thu May 05 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank,bionet.molbio.genbank
Path: biosci!daresbury!bioftp.unibas.ch!embnet
From: embnet@comp.bioz.unibas.ch (EMBnet Switzerland)
Subject: Re: total number of bases?
Message-ID: <1994May4.211607.5649@comp.bioz.unibas.ch>
Organization: EMBnet Switzerland [Basel]
X-Newsreader: TIN [version 1.2 PL2]
References: <2q69ce$ra0@elna.ethz.ch> <2q7e2r$7f6@cismsun.univ-lyon1.fr>
Date: Wed, 4 May 1994 21:16:07 GMT
Lines: 96
Xref: biosci bionet.molbio.embldatabank:322 bionet.molbio.genbank:1621

...
: > I can't seem to find a reliable current estimate of the total 
: > number of bases in all the different sequences stored in 
: > readily accessible databases (genbank, embl...). Also, 

Stephane, 
the number of sequences which you have on AEOLUS (your computer running GCG) 
should be the following: 
EMBL 
       894 bb.seq
     16665 em_ba.seq
     33315 em_est.seq
      5676 em_fun.seq
     10873 em_in.seq
      5139 em_om.seq
      5462 em_or.seq
      5599 em_ov.seq
       967 em_ph.seq
      8005 em_pl.seq
     30542 em_pr.seq
     19695 em_ro.seq
      5318 em_sy.seq
      4872 em_un.seq
     15649 em_vi.seq
      3116 patent.seq
    171787 total

GENBANK exclusion set (GENBANK 82 - EMBL 38 with GCG) 
       212 gb_ba.seq
       356 gb_est.seq
       220 gb_in.seq
        97 gb_om.seq
       111 gb_ov.seq
         0 gb_pat.seq
         1 gb_ph.seq
       223 gb_pl.seq
       657 gb_pr.seq
       250 gb_ro.seq
        66 gb_st.seq
        28 gb_sy.seq
        17 gb_un.seq
       274 gb_vi.seq
         1 gbphg.seq
      2513 total

and the weekly updates from EMBnet Switzerland 

       901 gb_new.seq   - all new GENBANK not in the EMBL updates 
      7717 xembl.seq    - all really new EMBL entries 
      7250 xxembl.seq 	- all entries updated by EMBL wrt last release 

I wouldn't use the basepair numbers, though, as mentioned below, 
for statistics as the data are based on ACCESSION numbers and therefore 
get you a lot of redundancies. 


: The size of current releases of GenBank and EMBL is:
: GenBank Release 82 (15 April 1994): 180,589,455 bases; 169,896 sequences;
: EMBL Library Release 38  (March 1994): 179,346,566 bases; 171,787 sequences;

: The NCBI maintains a non-redundant database daily updated
:   nr    Non-redundant PDB+GBUpdate+GenBank+EmblUpdate+EMBL:

: 5:07 AM EDT May 3 1994: 184,980,203 bases; 173,749 sequences;


The HASSLE server of EMBnet Switzerland recalculates a 'nr' for both 
proteins and DNA on a weekly basis. Last saturday we had, based on 
EMBL with Genbank added, and EMBL updates with Genbank updates added, 
slightly less than the data reported above (but  this was from April 30). 

(specifically to Stephane) 
Unfortunately, the host you use runs a TCP/IP product which doesn't 
support HASSLE at the moment (Wollongong), but times may come where 
you support UCX, Multinet, or TCPware. If you need an account on the 
EMBnet Switzerland UNIX cluster, let me know. 
(for all) 
HASSLE is available for most flavours of UNIX and the VMS emulations 
of IP mentioned above. Contact us for details - both customer and server 
mode are supported in full source. Services within EMBnet running via 
HASSLE are BLAST, (T)FASTA, PROFILE and S&W search (via the Biocellerator
at EMBnet Israel at Weizmann/Rehovot) and MOWSE (from EMBnet UK at Daresbury). 
MEDFETCH is a first SRS type gateway to ENTREz-based Swissprot, and 
FETCH gets database entries in GCG format. 



Regards
Reinhard Doelz
EMBnet Switzerland

-- 
 
+----------------------------------+-------------------------------------+
|     EMBnet SWITZERLAND           | RFC     embnet@comp.bioz.unibas.ch  |
|      Biocomputing                | (small) FTP and GOPHER server       |

From owner-embldatabank@net.bio.net Thu May 05 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!daresbury!bioftp.unibas.ch!doelz
From: doelz@comp.bioz.unibas.ch (Reinhard Doelz)
Subject: Re: same sequence is different in EMBL and GENBANK
Message-ID: <1994May5.062921.10972@comp.bioz.unibas.ch>
Organization: EMBnet Switzerland [Basel]
X-Newsreader: TIN [version 1.2 PL2]
References: <2q8h6o$av8@mserv1.dl.ac.uk> <CpAt6w.12J@ncifcrf.gov>
Distribution: bionet
Date: Thu, 5 May 1994 06:29:21 GMT
Lines: 41

Tom Schneider (toms@fcsparc6.ncifcrf.gov) wrote:
...
: In article <2q8h6o$av8@mserv1.dl.ac.uk> Massimo Delledonne
: <DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:
...
: | Who made the mistake ? The autors who submitted the correct version to Genbank
...
: Although horrifying, this is a perfect example of one of the reasons that
: "federations" of databases are likely to fail horribly.  EMBL and GENBANK are
: supposed to work closely together!  What will happen when we have 50 sequence
: databases and they DON'T even try to work together?  People will be checking
: one database against the other and finding errors.  If a database gets a
: reputation for handling data poorly, won't people simply stop using it?

I strongly argue against this. We have now PIR versus Swissprot, 
Los Alamos/NCBI/EMBL, not to mention pacific rim originating sources ...
We MUST make the databases work together. We'll never cope with the 
flood otherwise. I am scared to read that you seem to suggest doing 
once again a new database or even ?stop? using a database because it were 
poor. Can we afford, fund, pay for this? 

Regards
Reinhard 


PS: We have most of the databases available worldwide and crosscheck 
them, e.g. genbank vs EMBL, via standard programs like 'GCG' or 'nrdb'
from GCG Inc, and NCBI, resp. Counting all, we get about 20 GByte needed
for this effort. Justification being, we are an EMBnet node and try 
to deliver what people need. Your suggestion would be to shift all that 
to the users? END users?  It might be useful to bring up oddities as 
above to these newsgroups but the direct partner to report database 
problems should be the database vendors. I found both GENBANK and EMBL 
staff very helpful (special Thanks to Peter Stoehr). Ther really appreciate
detailed reports. 

-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz@urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+

From owner-embldatabank@net.bio.net Thu May 05 23:00:00 1994
Path: biosci!agate!msuinfo!harbinger.cc.monash.edu.au!newshost.anu.edu.au!helios!ingrid
From: ingrid@helios.anu.edu.au (Ingrid Jakobsen)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: same sequence is different in EMBL and GENBANK
Date: 6 May 1994 08:07:36 GMT
Organization: Australian National University
Lines: 49
Sender: ingrid@helios (Ingrid Jakobsen)
Distribution: bionet
Message-ID: <2qcts8$a2i@manuel.anu.edu.au>
References: <2q8h6o$av8@mserv1.dl.ac.uk>
NNTP-Posting-Host: 150.203.7.83

In article <2q8h6o$av8@mserv1.dl.ac.uk>, Massimo Delledonne <DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:

(Sad story deleted about EMBL and GenBank entries being diffent
for the same Accession Number)

|>  
|> Now my question is:
|>  
|>  
|>  
|> The GENEBANK sequence was uptade in 1986
|> The EMBL sequence was uptade in 1990
|> The GENBANK sequence is the correct one
|>  
|> Who made the mistake ? The autors who submitted the correct version to Genbank
|> and not to EMBL (but in this case why the EMBL sequence was updated in 1990)
|> or: GENBANK submitted to EMBL an incorrect update ...
|> or: EMBL made a mistake
|>  
|> In any case I think it is really bad to have to check both the databases to be
|> sure the sequence you are looking for is correct .... And .. how many sequences
|> in both the databases have the same problem?
|>  
|> Massimo Delledonne

I can sympathise totally. It has also been my experience that there are
discrepancies between EMBL and GenBank, and you are quite right, you should
be able to get the right answer from just one database.

From what I have seen, the problem is usually with the EMBL entry.
If the authors send corrections to GenBank, it is updated but EMBL isn't
always, while I haven't seen corrections in EMBL not found in GenBank yet.

I have also seen duplicate entries eliminated from GenBank, but kept
on EMBL, and sequences withdrawn because corrections showed them to be
identical to previous sequences, but retained on EMBL.

So my solution in general is to stick to GenBank, and not use EMBL. I know
this is a sad thing to say, but as Massimo has also found out, it just 
doesn't seem as up-to-date. I don't know which side of the Atlantic the
problem is on: GenBank not sending information on, or EMBL not using it.

Unforturnately, I still run into the problem regularly, because the 
"non-redundant" database for blast includes EMBL entries not on GenBank -
which usually means old, erroneous sequence or duplicates of other
sequences, which GenBank have obviously gone to some trouble to eliminate
from their own database.

Ingrid

From owner-embldatabank@net.bio.net Sat May 07 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank,bionet.molbio.genbank,bionet.molbio.proteins
Path: biosci!daresbury!bioftp.unibas.ch!doelz
From: doelz@comp.bioz.unibas.ch (Reinhard Doelz)
Subject: Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)
Message-ID: <1994May8.084902.4583@comp.bioz.unibas.ch>
Organization: EMBnet Switzerland [Basel]
X-Newsreader: TIN [version 1.2 PL2]
References: <2q8h6o$av8@mserv1.dl.ac.uk> <2qcts8$a2i@manuel.anu.edu.au>
Distribution: bionet
Date: Sun, 8 May 1994 08:49:02 GMT
Lines: 194
Xref: biosci bionet.molbio.embldatabank:325 bionet.molbio.genbank:1622 bionet.molbio.proteins:1900

My apologies... this is a bit long but try to read it carefully. 


Ingrid Jakobsen (ingrid@helios.anu.edu.au) wrote:
: In article <2q8h6o$av8@mserv1.dl.ac.uk>, Massimo Delledonne <DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:

: (Sad story deleted about EMBL and GenBank entries being diffent
: for the same Accession Number)

... and some other remarks deleted ... 

: I have also seen duplicate entries eliminated from GenBank, but kept
: on EMBL, and sequences withdrawn because corrections showed them to be
: identical to previous sequences, but retained on EMBL.

: So my solution in general is to stick to GenBank, and not use EMBL. I know
: this is a sad thing to say, but as Massimo has also found out, it just 
: doesn't seem as up-to-date. I don't know which side of the Atlantic the
: problem is on: GenBank not sending information on, or EMBL not using it.

Just as a matter of fairness, blaming anyone on examples won't help, and 
deducing that EMBL is bad is presumably a valued view in the states but 
if I claim GENBANK is bad the US wouldn't tolerate it either :-) 

It is that the both talk to each other on computer basis. Computer parsing 
programs are a mess, in particular as both databases don't agree entirely 
on their formats; i.e. parsing extends to mapping and voila - there are 
problems. The following example is an annecdote I just came accross where 
both databases have a duplicate. 

The entry GGCOL8 in EMBL (15 April 1994, updated 20-April 1994) is LOCUS 
CHKC1A206 (1-Jun-1984) in GENBANK. However, Entry GGCOL8 (9-Jun 1982, updated
6-Jul 1989) with LOCUS GGCOL8 (6-Jul 89) is only with one Accession number 
more; J00827 being the first and V00400 being the additional one. 

Both entries refer to a journal Cell 22, 887-892 (1980) - i.e., EMBL seems
to take 14 years before they do a mistake, whereas GENBANK takes only 5 years.

The sequences are entirely identical. I use the GENBANK format here to show 
differences: 
<   AUTHORS   Yamada,Y., Avvedimento,E.V., Mudryj,M., Ohkubo,H., Vogeli,G.,
>   AUTHORS   Yamada,Y., Avvedimento,E., Mudryj,M., Ohkubo,H., Vogeli,G.,

<             amplification of a dna segment containing an exon of 54 bp
>             amplification of a DNA segment containing an exon of 54 bp


I guess (or at least hope) that these are not the reason for the duplication.
The problem comes in the feature table! For the sake of completion I use EMBL
format here, in truncated form: 

FT   intron          <1. .8            |  FT   source          1. .70
FT                   /note="collagen   |  FT                   /organism="Gallu
FT   prim_transcript <1. .>70          |  FT   CDS             9. .62
FT                   /note="collagen   |  FT                   /note="exon 6"
FT   exon            9. .62            |
FT                   /number=42        |
FT                   /note="collagen   |
FT   exon            9. .62            |
FT                   /note="collagen   |
FT                   putative"         |
FT   intron          63. .>70          |
FT                   /note="collagen   |
FT   source          1. .70            |
FT                   /organism="Gallu  |

One entry says
/note="collagen helipeptide, exon 42 (AA 37 to 54);
and the other says 
/note="exon 6"
--- from the SAME reference. 

In one entry, it tells CDS, in the other, there is no CDS. Why? Simply because 
in one entry there is CDS from 9 to 62, and mat_peptide in the other entry: 

(GENBANK format again)

<      mat_peptide     9..62
<                      /partial
<                      /codon_start=1
<                      /note="collagen helipeptide, 2 (AA 37 to 54)"
>      CDS             9..62
>                      /note="exon 6;  NCBI gi: 63306."
>                      /codon_start=1
>                      /translation="GPQGPRGPPGPPGKAGED"

So what is the difference between a mat_peptide and a CDS? 
The gbrel.txt from release 82 tells us 

CDS             Sequence coding for amino acids in protein (includes
                stop codon)
mat_peptide     Mature peptide coding region (does not include stop codon)

and ftable.doc from EMBL 
CDS              Sequence coding for amino acids in protein
exon             Region that codes for part of spliced mRNA


OK, so far the documentation; but GENBANK's precise definition is contra-
dictory here as once it is _with_ and once _without_ stop codon. Well; 
it ist't quite so as the last three nucleotides are coding D as stated in 
the translation:    /translation="GPQGPRGPPGPPGKAGED" 

I haven't analyzed this systematically but I am afraid that inconsistencies 
like this make database provider's life difficult. As human intervention
is extremely expensive (manpower) and we (customers) don't want to pay the 
prediction that it will become worse in the future is a safe guess. 

You rely on BLAST searching? 
Fine. I used the peptide as described above and seqrched the 'nr' dataset
which we do in-house on all protein databases available. 

The entry scoring 
 Score = 108 (49.3 bits), Expect = 1.1e-08, P = 1.1e-08
 Identities = 18/18 (100%), Positives = 18/18 (100%)

if looked up in the result, is located at position 8 (as the only 
entirely matching entry - other irrelevant matches lead the score) and
does NOT occur in either SWISSPROT nore PIR database, but only in PATCHX
(Pfeiffer, MIPS Martinsried). Entry: patchx:M25963 ; There, we read: 
LOCUS       CHKCOLA07
DEFINITION  Chicken alpha-2 collagen gene type I gene, exons 13-15
ACCESSION   M25963
SOURCE
  ORGANISM  Gallus gallus
REFERENCE   1
  AUTHORS   Boedtker,H., Finer,M. and Aho,S.
  TITLE     The structure of the chicken alpha-2 collagen gene
  JOURNAL   Ann. N. Y. Acad. Sci. 460, 85-116 (1985)
FEATURES
     CDS             join(M25956:1548. .1617,M25956:3513. .3523,
                     M25956:4131. .4148,M25956:4783. .4818,M25959:182. .265,
                     M25961:205. .261,M25962:609. .653,M25962:755. .808,
                     M25962:1118. .1171,M25962:1539. .1592,M25962:2078. .2131,
                     M25962:2345. .2398,6. .50,287. .340,439. .483) /partial
                     /note="alpha-collagen type I;; NCBI gi: 211605."
                     /codon_start=1


Note that there's now talking on entry M25963, with both EMBL and GENBANK
versions, and this is exon 13-15, whereas the original source talked about
exon 42, and exon 6, respectively. 

A DNA comparison reveals. 

 Ggcol8 x M25963           May 8, 1994  10:23  ..

                  .         .         .
      15 CAAGGTCCTCGTGGTCCCCCTGGTCCTCCAGGAA 48
         || ||| |||| |||| |||||||     |||||
     284 CAGGGTGCTCGCGGTCTCCCTGGTGAGAGAGGAA 317


Oh well, interesting... Why don't you try a BLAST at home and see ? 
... on DNA? 

CONCLUSION
==========

I think we all agree that databases are non-optimal. On the other hand, 
if you see those guys working, they don't feel lazy, nor do they enjoy 
being reminded that they do produce low-quality data. (I won't talk 
on proteins here but the situation there is even worse). The data need
better MAINTENANCE! 
We could spend another XX M$ on both sides of the atlantic to have a 
staff of workers clean up the past, and cope with the flood of the future. 
But still, this wouldn't help. I think that there's something severely 
wrong with responsibilities. The researchers don't do what they should, namely 
take care of their own entries or areas, and correct the entries as appropriate.
And, for the future, the genome projects should adopt slightly more 
responsibility for what they produce. Just dumping thousands of low-quality
data entries to the databases, generated by robots, and complain afterwards
doesn't help. The funding agencies must understand that a genome project 
is USELESS (read: wasted money) if the data are not integrated well into the 
data sets. The coordinators of the projects must refer from cooking their 
own little databases as they comlain the loudest on the unability of the 
general database providers. We certainly don't need hundreds of small databases
but rather one set which is complete, and high quality. 
?We ? 

Who are 'We' that we tolerate these duplications without doing something
ourselves? A change in culture is needed. 

Regards
Reinhard Doelz

EMBnet Switzerland 


-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz@urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+

From owner-embldatabank@net.bio.net Sun May 08 23:00:00 1994
Path: biosci!agate!doc.ic.ac.uk!daresbury!bioftp.unibas.ch!embl-heidelberg.de!stoehr
From: stoehr@embl-heidelberg.de (Peter Stoehr)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: same sequence is different in EMBL and GENBANK
Message-ID: <1994May9.094735.171140@eros.embl-heidelberg.de>
Date: 9 May 94 09:47:35 +0100
References: <2q8h6o$av8@mserv1.dl.ac.uk> <2qcts8$a2i@manuel.anu.edu.au>
Distribution: bionet
Organization: European Molecular Biology Laboratory
Lines: 19

In article <2qcts8$a2i@manuel.anu.edu.au>, ingrid@helios.anu.edu.au
(Ingrid Jakobsen) writes:

> I can sympathise totally. It has also been my experience that there are
> discrepancies between EMBL and GenBank, and you are quite right, you should
> be able to get the right answer from just one database.

.. or at least the *same* answer.
I did a quick check and find about 0.9% different, approx. 166673 out of 168107
entries with the same primary accession number are identical in NCBI-GenBank
and current EMBL. I used GenBank rel. 82 so the numbers may have reduced a
little since.

Anyway, the total of 1434 entries which are out of synch is of course too
many, and I hope we can fix these in the next few days.

Regards,
Peter Stoehr
EMBL Data Library

From owner-embldatabank@net.bio.net Sun May 08 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!agate!howland.reston.ans.net!europa.eng.gtefsd.com!darwin.sura.net!fconvx.ncifcrf.gov!fcsparc6!toms
From: toms@fcsparc6.ncifcrf.gov (Tom Schneider)
Subject: Re: same sequence is different in EMBL and GENBANK
Message-ID: <CpJo4M.F95@ncifcrf.gov>
Sender: usenet@ncifcrf.gov (C News)
Nntp-Posting-Host: fcsparc6.ncifcrf.gov
Organization: Frederick Cancer Research and Development Center
References: <2q8h6o$av8@mserv1.dl.ac.uk> <CpAt6w.12J@ncifcrf.gov> <1994May5.062921.10972@comp.bioz.unibas.ch>
Distribution: bionet
Date: Mon, 9 May 1994 16:50:45 GMT
Lines: 67

In article <1994May5.062921.10972@comp.bioz.unibas.ch>
doelz@comp.bioz.unibas.ch (Reinhard Doelz) writes:

| Tom Schneider (toms@fcsparc6.ncifcrf.gov) wrote:
| ...
| : In article <2q8h6o$av8@mserv1.dl.ac.uk> Massimo Delledonne
| : <DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:
| ...
| : | Who made the mistake ? The autors who submitted the correct version to Genbank
| ...
| : Although horrifying, this is a perfect example of one of the reasons that
| : "federations" of databases are likely to fail horribly.  EMBL and GENBANK are
| : supposed to work closely together!  What will happen when we have 50 sequence
| : databases and they DON'T even try to work together?  People will be checking
| : one database against the other and finding errors.  If a database gets a
| : reputation for handling data poorly, won't people simply stop using it?
| 
| I strongly argue against this. We have now PIR versus Swissprot, 
| Los Alamos/NCBI/EMBL, not to mention pacific rim originating sources ...
| We MUST make the databases work together. We'll never cope with the 
| flood otherwise. I am scared to read that you seem to suggest doing 
| once again a new database or even ?stop? using a database because it were 
| poor. Can we afford, fund, pay for this? 

The GenBank advisors (of which I was one) were trying to get a unified database
10 years ago.  The hope was that GenBank and EMBL could have identical
formats.  This was not politically possible because both sides wanted control.
So the two worked together, hopefully having the same data.  As we see now this
hasn't worked very well.  It means that people writing programs have to handle
two different formats, and it means that the two databases drift apart.  I was
not proposing that people choose a database, but rather that many databases
have chosen to work against one another rather than with each other.  Instead
of a single unified database we are aiming to have a database for every
chromosome of every species...  Under that ridiculous circumstance, the several
databases which attempt to gather all the data under one format will be in
competition for use by researchers use.  Darwin has something to say about
that.

Do I like this circumstance?  No.  Can we afford to have a bunch of databases
pulling in different directions?  No.

| PS: We have most of the databases available worldwide and crosscheck 
| them, e.g. genbank vs EMBL, via standard programs like 'GCG' or 'nrdb'
| from GCG Inc, and NCBI, resp.

Are you doing this and not telling the databases how to become closer to one
another?  If your effort at crosschecking were to be fed back into the
databases you wouldn't have to do it at every release.  You would also save a
lot of effort by others who probably have to do the same thing.

| Counting all, we get about 20 GByte needed
| for this effort. Justification being, we are an EMBnet node and try 
| to deliver what people need. Your suggestion would be to shift all that 
| to the users? END users?  It might be useful to bring up oddities as 
| above to these newsgroups but the direct partner to report database 
| problems should be the database vendors. I found both GENBANK and EMBL 
| staff very helpful (special Thanks to Peter Stoehr). Ther really appreciate
| detailed reports. 

As you say in another posting, the end users have to become more involved.  But
end users cannot make the overall database consistent.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

From owner-embldatabank@net.bio.net Sun May 08 23:00:00 1994
Path: biosci!agate!dog.ee.lbl.gov!ihnp4.ucsd.edu!swrinde!cs.utexas.edu!howland.reston.ans.net!pipex!doc.ic.ac.uk!daresbury!bioftp.unibas.ch!embl-heidelberg.de!stoehr
From: stoehr@embl-heidelberg.de (Peter Stoehr)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: same sequence is different in EMBL and GENBANK
Message-ID: <1994May9.140233.171147@eros.embl-heidelberg.de>
Date: 9 May 94 14:02:32 +0100
References: <2q8h6o$av8@mserv1.dl.ac.uk> <2qcts8$a2i@manuel.anu.edu.au> <1994May9.094735.171140@eros.embl-heidelberg.de>
Distribution: bionet
Organization: European Molecular Biology Laboratory
Lines: 11

In article <1994May9.094735.171140@eros.embl-heidelberg.de>,
stoehr@embl-heidelberg.de (Peter Stoehr) writes:

> Anyway, the total of 1434 entries which are out of synch is of course too
> many, and I hope we can fix these in the next few days.

Sorry, I miscounted. There are 717 in error, not 1434.

Regards,
Peter Stoehr
EMBL Data Library

From owner-embldatabank@net.bio.net Sun May 08 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank
Path: biosci!agate!howland.reston.ans.net!pipex!doc.ic.ac.uk!daresbury!bioftp.unibas.ch!doelz
From: doelz@comp.bioz.unibas.ch (Reinhard Doelz)
Subject: Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)
Message-ID: <1994May9.131929.1805@comp.bioz.unibas.ch>
Organization: EMBnet Switzerland [Basel]
X-Newsreader: TIN [version 1.2 PL2]
Distribution: bionet
Date: Mon, 9 May 1994 13:19:29 GMT
Lines: 194


My apologies... this is a bit long but try to read it carefully. 


Ingrid Jakobsen (ingrid@helios.anu.edu.au) wrote:
: In article <2q8h6o$av8@mserv1.dl.ac.uk>, Massimo Delledonne <DELLE%IPCUCSC.earn@earn-relay.ac.uk> writes:

: (Sad story deleted about EMBL and GenBank entries being diffent
: for the same Accession Number)

... and some other remarks deleted ... 

: I have also seen duplicate entries eliminated from GenBank, but kept
: on EMBL, and sequences withdrawn because corrections showed them to be
: identical to previous sequences, but retained on EMBL.

: So my solution in general is to stick to GenBank, and not use EMBL. I know
: this is a sad thing to say, but as Massimo has also found out, it just 
: doesn't seem as up-to-date. I don't know which side of the Atlantic the
: problem is on: GenBank not sending information on, or EMBL not using it.

Just as a matter of fairness, blaming anyone on examples won't help, and 
deducing that EMBL is bad is presumably a valued view in the states but 
if I claim GENBANK is bad the US wouldn't tolerate it either :-) 

It is that the both talk to each other on computer basis. Computer parsing 
programs are a mess, in particular as both databases don't agree entirely 
on their formats; i.e. parsing extends to mapping and voila - there are 
problems. The following example is an annecdote I just came accross where 
both databases have a duplicate. 

The entry GGCOL8 in EMBL (15 April 1994, updated 20-April 1994) is LOCUS 
CHKC1A206 (1-Jun-1984) in GENBANK. However, Entry GGCOL8 (9-Jun 1982, updated
6-Jul 1989) with LOCUS GGCOL8 (6-Jul 89) is only with one Accession number 
more; J00827 being the first and V00400 being the additional one. 

Both entries refer to a journal Cell 22, 887-892 (1980) - i.e., EMBL seems
to take 14 years before they do a mistake, whereas GENBANK takes only 5 years.

The sequences are entirely identical. I use the GENBANK format here to show 
differences: 
<   AUTHORS   Yamada,Y., Avvedimento,E.V., Mudryj,M., Ohkubo,H., Vogeli,G.,
>   AUTHORS   Yamada,Y., Avvedimento,E., Mudryj,M., Ohkubo,H., Vogeli,G.,

<             amplification of a dna segment containing an exon of 54 bp
>             amplification of a DNA segment containing an exon of 54 bp


I guess (or at least hope) that these are not the reason for the duplication.
The problem comes in the feature table! For the sake of completion I use EMBL
format here, in truncated form: 

FT   intron          <1. .8            |  FT   source          1. .70
FT                   /note="collagen   |  FT                   /organism="Gallu
FT   prim_transcript <1. .>70          |  FT   CDS             9. .62
FT                   /note="collagen   |  FT                   /note="exon 6"
FT   exon            9. .62            |
FT                   /number=42        |
FT                   /note="collagen   |
FT   exon            9. .62            |
FT                   /note="collagen   |
FT                   putative"         |
FT   intron          63. .>70          |
FT                   /note="collagen   |
FT   source          1. .70            |
FT                   /organism="Gallu  |

One entry says
/note="collagen helipeptide, exon 42 (AA 37 to 54);
and the other says 
/note="exon 6"
--- from the SAME reference. 

In one entry, it tells CDS, in the other, there is no CDS. Why? Simply because 
in one entry there is CDS from 9 to 62, and mat_peptide in the other entry: 

(GENBANK format again)

<      mat_peptide     9..62
<                      /partial
<                      /codon_start=1
<                      /note="collagen helipeptide, 2 (AA 37 to 54)"
>      CDS             9..62
>                      /note="exon 6;  NCBI gi: 63306."
>                      /codon_start=1
>                      /translation="GPQGPRGPPGPPGKAGED"

So what is the difference between a mat_peptide and a CDS? 
The gbrel.txt from release 82 tells us 

CDS             Sequence coding for amino acids in protein (includes
                stop codon)
mat_peptide     Mature peptide coding region (does not include stop codon)

and ftable.doc from EMBL 
CDS              Sequence coding for amino acids in protein
exon             Region that codes for part of spliced mRNA


OK, so far the documentation; but GENBANK's precise definition is contra-
dictory here as once it is _with_ and once _without_ stop codon. Well; 
it ist't quite so as the last three nucleotides are coding D as stated in 
the translation:    /translation="GPQGPRGPPGPPGKAGED" 

I haven't analyzed this systematically but I am afraid that inconsistencies 
like this make database provider's life difficult. As human intervention
is extremely expensive (manpower) and we (customers) don't want to pay the 
prediction that it will become worse in the future is a safe guess. 

You rely on BLAST searching? 
Fine. I used the peptide as described above and seqrched the 'nr' dataset
which we do in-house on all protein databases available. 

The entry scoring 
 Score = 108 (49.3 bits), Expect = 1.1e-08, P = 1.1e-08
 Identities = 18/18 (100%), Positives = 18/18 (100%)

if looked up in the result, is located at position 8 (as the only 
entirely matching entry - other irrelevant matches lead the score) and
does NOT occur in either SWISSPROT nore PIR database, but only in PATCHX
(Pfeiffer, MIPS Martinsried). Entry: patchx:M25963 ; There, we read: 
LOCUS       CHKCOLA07
DEFINITION  Chicken alpha-2 collagen gene type I gene, exons 13-15
ACCESSION   M25963
SOURCE
  ORGANISM  Gallus gallus
REFERENCE   1
  AUTHORS   Boedtker,H., Finer,M. and Aho,S.
  TITLE     The structure of the chicken alpha-2 collagen gene
  JOURNAL   Ann. N. Y. Acad. Sci. 460, 85-116 (1985)
FEATURES
     CDS             join(M25956:1548. .1617,M25956:3513. .3523,
                     M25956:4131. .4148,M25956:4783. .4818,M25959:182. .265,
                     M25961:205. .261,M25962:609. .653,M25962:755. .808,
                     M25962:1118. .1171,M25962:1539. .1592,M25962:2078. .2131,
                     M25962:2345. .2398,6. .50,287. .340,439. .483) /partial
                     /note="alpha-collagen type I;; NCBI gi: 211605."
                     /codon_start=1


Note that there's now talking on entry M25963, with both EMBL and GENBANK
versions, and this is exon 13-15, whereas the original source talked about
exon 42, and exon 6, respectively. 

A DNA comparison reveals. 

 Ggcol8 x M25963           May 8, 1994  10:23  ..

                  .         .         .
      15 CAAGGTCCTCGTGGTCCCCCTGGTCCTCCAGGAA 48
         || ||| |||| |||| |||||||     |||||
     284 CAGGGTGCTCGCGGTCTCCCTGGTGAGAGAGGAA 317


Oh well, interesting... Why don't you try a BLAST at home and see ? 
... on DNA? 

CONCLUSION
==========

I think we all agree that databases are non-optimal. On the other hand, 
if you see those guys working, they don't feel lazy, nor do they enjoy 
being reminded that they do produce low-quality data. (I won't talk 
on proteins here but the situation there is even worse). The data need
better MAINTENANCE! 
We could spend another XX M$ on both sides of the atlantic to have a 
staff of workers clean up the past, and cope with the flood of the future. 
But still, this wouldn't help. I think that there's something severely 
wrong with responsibilities. The researchers don't do what they should, namely 
take care of their own entries or areas, and correct the entries as appropriate.
And, for the future, the genome projects should adopt slightly more 
responsibility for what they produce. Just dumping thousands of low-quality
data entries to the databases, generated by robots, and complain afterwards
doesn't help. The funding agencies must understand that a genome project 
is USELESS (read: wasted money) if the data are not integrated well into the 
data sets. The coordinators of the projects must refer from cooking their 
own little databases as they comlain the loudest on the unability of the 
general database providers. We certainly don't need hundreds of small databases
but rather one set which is complete, and high quality. 
?We ? 

Who are 'We' that we tolerate these duplications without doing something
ourselves? A change in culture is needed. 

Regards
Reinhard Doelz

EMBnet Switzerland 

-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz@urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+

From owner-embldatabank@net.bio.net Sun May 08 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank,bionet.molbio.genbank,bionet.molbio.proteins
Path: biosci!agate!library.ucla.edu!europa.eng.gtefsd.com!darwin.sura.net!fconvx.ncifcrf.gov!fcsparc6!toms
From: toms@fcsparc6.ncifcrf.gov (Tom Schneider)
Subject: Re: Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)
Message-ID: <CpJoLE.FMn@ncifcrf.gov>
Sender: usenet@ncifcrf.gov (C News)
Nntp-Posting-Host: fcsparc6.ncifcrf.gov
Organization: Frederick Cancer Research and Development Center
References: <2q8h6o$av8@mserv1.dl.ac.uk> <2qcts8$a2i@manuel.anu.edu.au> <1994May8.084902.4583@comp.bioz.unibas.ch>
Distribution: bionet
Date: Mon, 9 May 1994 17:00:50 GMT
Lines: 57
Xref: biosci bionet.molbio.embldatabank:330 bionet.molbio.genbank:1623 bionet.molbio.proteins:1906

In article <1994May8.084902.4583@comp.bioz.unibas.ch> doelz@comp.bioz.unibas.ch
(Reinhard Doelz) writes:

| I haven't analyzed this systematically but I am afraid that inconsistencies 
| like this make database provider's life difficult.

It makes the database user's life extremely difficult.

| As human intervention
| is extremely expensive (manpower) and we (customers) don't want to pay the 
| prediction that it will become worse in the future is a safe guess. 

Yes, unless action is taken soon eventually there will be a crisis.

| I think we all agree that databases are non-optimal. On the other hand, 
| if you see those guys working, they don't feel lazy, nor do they enjoy 
| being reminded that they do produce low-quality data. (I won't talk 
| on proteins here but the situation there is even worse). The data need
| better MAINTENANCE! 

Yes

| We could spend another XX M$ on both sides of the atlantic to have a 
| staff of workers clean up the past, and cope with the flood of the future. 
| But still, this wouldn't help. I think that there's something severely 
| wrong with responsibilities. The researchers don't do what they should, namely 
| take care of their own entries or areas, and correct the entries as appropriate.

BINGO!

| And, for the future, the genome projects should adopt slightly more 
| responsibility for what they produce. Just dumping thousands of low-quality
| data entries to the databases, generated by robots, and complain afterwards
| doesn't help. The funding agencies must understand that a genome project 
| is USELESS (read: wasted money) if the data are not integrated well into the 
| data sets. The coordinators of the projects must refer from cooking their 
| own little databases as they comlain the loudest on the unability of the 
| general database providers. We certainly don't need hundreds of small databases
| but rather one set which is complete, and high quality. 
| ?We ? 

BINGO!

| Who are 'We' that we tolerate these duplications without doing something
| ourselves? A change in culture is needed. 

Duplication should not be tolerated, that's why it is the first principle in my
database philosophy paper.  (anonymous ftp from
ftp.ncifcrf.gov/pub/delila/philgen* but in revision at the moment.  If you
would like me to tell you when the next revision is out, please send me a
note.)

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

From owner-embldatabank@net.bio.net Mon May 09 23:00:00 1994
Path: biosci!agate!spool.mu.edu!torn!nott!uotcsi2!mgcheo.med.uottawa.ca!sbaird
From: sbaird@mgcheo.med.uottawa.ca (Stephen Baird)
Newsgroups: bionet.molbio.embldatabank,bionet.molbio.genbank,bionet.molbio.proteins
Subject: Re: Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)
Followup-To: bionet.molbio.embldatabank,bionet.molbio.genbank,bionet.molbio.proteins
Date: 10 May 1994 04:11:51 GMT
Organization: Department of Computer Science, University of Ottawa
Lines: 34
Distribution: bionet
Message-ID: <2qn1i7$l92@csi0.csi.uottawa.ca>
References: <1994May8.084902.4583@comp.bioz.unibas.ch>
NNTP-Posting-Host: mgcheo.med.uottawa.ca
X-Newsreader: TIN [version 1.1 PL8]
Xref: biosci bionet.molbio.embldatabank:331 bionet.molbio.genbank:1624 bionet.molbio.proteins:1912

Reinhard Doelz (doelz@comp.bioz.unibas.ch) wrote (with a lot deleted):

: We could spend another XX M$ on both sides of the atlantic to have a 
: staff of workers clean up the past, and cope with the flood of the future. 
: But still, this wouldn't help. I think that there's something severely 
: wrong with responsibilities. The researchers don't do what they should, namely 
: take care of their own entries or areas, and correct the entries as appropriate.
: And, for the future, the genome projects should adopt slightly more 
: responsibility for what they produce. Just dumping thousands of low-quality
: data entries to the databases, generated by robots, and complain afterwards
: doesn't help. 

: Regards
: Reinhard Doelz


I like the idea that researchers should be responsible for their entries in
the databases (unfortunately not all of us have organized/clean/up-to-date
lab benches or offices and database entries might reflect that).  I was 
wondering what one should do when a competitor duplicates your entry or
a complete cDNA is sequenced for which there is a EST database entry.
How can the best sequences prevail.



|--------------------------------------------------------------------|
| Stephen Baird                        sbaird@mgcheo.med.uottawa.ca  | 
| Molecular Genetics                       tel: 613-738-3925         |
| Children's Hospital of Eastern Ontario   fax: 613-738-4833         |
| 415 Smyth Rd.                                                      |
| Ottawa, Ontario                                                    |
| Canada                                                             |
| K1H 8M8                                                            |
|--------------------------------------------------------------------|

From owner-embldatabank@net.bio.net Wed May 11 23:00:00 1994
Path: biosci!agate!howland.reston.ans.net!spool.mu.edu!umn.edu!msus1.msus.edu!vax1.mankato.msus.edu!vax1.mankato.msus.edu!nntp
Newsgroups: bionet.molbio.methds-reagnts,bionet.molbio.genbank,bionet.molbio.embldatabank
Subject: Sequence alignment help needed
Message-ID: <1994May11.105414.5810@vax1.mankato.msus.edu>
From: MOL13@VAX1.Mankato.MSUS.edu
Date: Wed, 11 May 94 10:43:20 PDT
Nntp-Posting-Host: 134.29.9.240
X-Newsreader: NEWTNews & Chameleon -- TCP/IP for MS Windows from NetManage
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Lines: 18
Xref: biosci bionet.molbio.methds-reagnts:14177 bionet.molbio.genbank:1627 bionet.molbio.embldatabank:332


Our group needs help in performing multiple sequence alignments of sequences
for alpha and beta adrenergic receptors across a number of species.

Although we have some software packages for doing this from the EMBL server, as 
novices using these programs we are having trouble getting any meaningful 
results.  Our goal is to secure such data to aid us in the design of a number 
of PCR primer pairs.

Any help would be greatly appreciated.  Please E-mail me at my below address.

Thanks.

Mark Lyte
Department of Biological Sciences
Mankato State University
E-mail:  MOL13@VAX1.Mankato.MSUS.edu


From owner-embldatabank@net.bio.net Mon May 16 23:00:00 1994
Path: biosci!bcm!cs.utexas.edu!howland.reston.ans.net!agate!msuinfo!harbinger.cc.monash.edu.au!newshost.anu.edu.au!helios!ingrid
From: ingrid@helios.anu.edu.au (Ingrid Jakobsen)
Newsgroups: bionet.molbio.embldatabank
Subject: Re: Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)
Date: 17 May 1994 06:44:02 GMT
Organization: Australian National University
Lines: 202
Sender: ingrid@helios (Ingrid Jakobsen)
Distribution: bionet
Message-ID: <2r9p3j$sa7@manuel.anu.edu.au>
References: <1994May9.131929.1805@comp.bioz.unibas.ch>
NNTP-Posting-Host: 150.203.7.83

I hope I am just being paranoid this week, but what you wrote feels 
somewhat like a flame, and I feel I have to defend myself in a few
places. My apologies if this is all old hat now, but this post only 
arrived at our site yesterday:

In article <1994May9.131929.1805@comp.bioz.unibas.ch>, doelz@comp.bioz.unibas.ch (Reinhard Doelz) writes:
|> 
|> My apologies... this is a bit long but try to read it carefully.

I have done that, and I get the feeling you didn't do me the 
courtesy of reading what I wrote carefully. My post was much shorter.
 
|> 
|> Ingrid Jakobsen (ingrid@helios.anu.edu.au) wrote:
|> 
|> : I have also seen duplicate entries eliminated from GenBank, but kept
|> : on EMBL, and sequences withdrawn because corrections showed them to be
|> : identical to previous sequences, but retained on EMBL.
|> 
|> : So my solution in general is to stick to GenBank, and not use EMBL. I know
|> : this is a sad thing to say, but as Massimo has also found out, it just 
|> : doesn't seem as up-to-date. I don't know which side of the Atlantic the
|> : problem is on: GenBank not sending information on, or EMBL not using it.
|> 
|> Just as a matter of fairness, blaming anyone on examples won't help, and

I didn't blame anyone based on examples. I concluded that EMBL was less
reliable than GenBank based on examples, sure, but I was careful to say
that I didn't know where the blame actually lay.

It may be that EMBL overall is more reliable than GenBank. In that case,
I have seen a very unrepresentative sample because in every case I have 
seen the GenBank entry is "better" - more recently corrected or whatever.

I didn't make this decision based on some idea that GenBank ought to be
better, I went and chased down the original references. 


|> deducing that EMBL is bad is presumably a valued view in the states but 
|> if I claim GENBANK is bad the US wouldn't tolerate it either :-)

I should point out that I am not located in the states, I am based in
Australia. Most Australian researchers have no idea which database their
"loyalities" should be with, EMBL or Genbank or DDBJ. I went into this
with absolutely no opinion either way, as I mentioned, I reached my 
personal decision after considerable running to and from the library.

I would also like to point out that I find the current situation with
two US databases ridiculous, and I admire the fact that numerous countries
can co-operate on EMBL. This is much more sensible in my opinion. 

|> It is that the both talk to each other on computer basis. Computer parsing 
|> programs are a mess, in particular as both databases don't agree entirely 
|> on their formats; i.e. parsing extends to mapping and voila - there are 
|> problems. The following example is an annecdote I just came accross where 
|> both databases have a duplicate. 
|> 
|> The entry GGCOL8 in EMBL (15 April 1994, updated 20-April 1994) is LOCUS 
|> CHKC1A206 (1-Jun-1984) in GENBANK. However, Entry GGCOL8 (9-Jun 1982, updated
|> 6-Jul 1989) with LOCUS GGCOL8 (6-Jul 89) is only with one Accession number 
|> more; J00827 being the first and V00400 being the additional one. 
|> 
|> Both entries refer to a journal Cell 22, 887-892 (1980) - i.e., EMBL seems
|> to take 14 years before they do a mistake, whereas GENBANK takes only 5 years.

In other words, your opinion of the databases is also based on examples, or
in fact only _one_ example. So why can't I have my opinion on the databases
based on the number of examples I have seen? And I merely decided that GenBank
was more reliable than EMBL, you seem to claim from one example that GenBank
generally takes five years to make mistakes.

I am prepared to believe that computer parsing may be the problem, I have
no experience with it. It just suggests to me that we are looking at a
big waste of resources with each side trying to duplicate the work of the
other. I don't think the problem can be dealt with unfortunately, as I 
don't think either side is going to let the other become the major database
provider. 

|> I haven't analyzed this systematically but I am afraid that inconsistencies 
|> like this make database provider's life difficult. As human intervention
|> is extremely expensive (manpower) and we (customers) don't want to pay the 
|> prediction that it will become worse in the future is a safe guess. 

I can only agree with this. But consider the expense of the "manpower" wasted
as thousands of researchers waste their time on faulty database entries, as
exemplified by Massimo Delledone, whose bad experience started the whole
discussion.
|> 
|> You rely on BLAST searching? 
|> Fine. I used the peptide as described above and seqrched the 'nr' dataset
|> which we do in-house on all protein databases available. 
|> 
|> The entry scoring 
|>  Score = 108 (49.3 bits), Expect = 1.1e-08, P = 1.1e-08
|>  Identities = 18/18 (100%), Positives = 18/18 (100%)
|> 
|> if looked up in the result, is located at position 8 (as the only 
|> entirely matching entry - other irrelevant matches lead the score)

The reason is that the entries were sorted by Poisson probabilities, 
rather than high scores, which is always a problem when searching with
short sequences. As far as I can remember, that option can be reset.

The leading matches incidentally are also collagen genes, which hardly
makes them "irrelevant". 

 and
|> does NOT occur in either SWISSPROT nore PIR database, but only in PATCHX
|> (Pfeiffer, MIPS Martinsried). Entry: patchx:M25963 ; There, we read: 
|> LOCUS       CHKCOLA07
|> DEFINITION  Chicken alpha-2 collagen gene type I gene, exons 13-15
|> ACCESSION   M25963
|> SOURCE
|>   ORGANISM  Gallus gallus
|> REFERENCE   1
|>   AUTHORS   Boedtker,H., Finer,M. and Aho,S.
|>   TITLE     The structure of the chicken alpha-2 collagen gene
|>   JOURNAL   Ann. N. Y. Acad. Sci. 460, 85-116 (1985)
|> FEATURES
|>      CDS             join(M25956:1548. .1617,M25956:3513. .3523,
|>                      M25956:4131. .4148,M25956:4783. .4818,M25959:182. .265,
|>                      M25961:205. .261,M25962:609. .653,M25962:755. .808,
|>                      M25962:1118. .1171,M25962:1539. .1592,M25962:2078. .2131,
|>                      M25962:2345. .2398,6. .50,287. .340,439. .483) /partial
|>                      /note="alpha-collagen type I;; NCBI gi: 211605."
|>                      /codon_start=1
|> 
|> 
|> Note that there's now talking on entry M25963, with both EMBL and GENBANK
|> versions, and this is exon 13-15, whereas the original source talked about
|> exon 42, and exon 6, respectively. 
|> 
|> A DNA comparison reveals. 
|> 
|>  Ggcol8 x M25963           May 8, 1994  10:23  ..
|> 
|>                   .         .         .
|>       15 CAAGGTCCTCGTGGTCCCCCTGGTCCTCCAGGAA 48
|>          || ||| |||| |||| |||||||     |||||
|>      284 CAGGGTGCTCGCGGTCTCCCTGGTGAGAGAGGAA 317
|> 
|> 
|> Oh well, interesting... Why don't you try a BLAST at home and see ? 
|> ... on DNA?

What are you trying to say here? I had a look at the DNA sequences and it
is in fact M25962 CHKCOLA06 (which is exons 7 to 12, marginally closer to
exon 6) which contains the match to GGCOL8. The reason you get M25963 in
the protein search is because that is the entry that contains the amino
acid translation of all the entries listed under CDS above. The PATCHX
(and incidentally Genpept) use this accession number because that is 
the entry where the translation was put. 

You can find M25962 and M25963 on EMBL too, with the same notation, it
just doesn't provide the translation, which might be a good thing, or 
a bad thing, you decide...
 
|> 
|> CONCLUSION
|> ==========
|> 
|> I think we all agree that databases are non-optimal. On the other hand, 
|> if you see those guys working, they don't feel lazy, nor do they enjoy 
|> being reminded that they do produce low-quality data. (I won't talk 
|> on proteins here but the situation there is even worse). The data need
|> better MAINTENANCE! 
|> We could spend another XX M$ on both sides of the atlantic to have a 
|> staff of workers clean up the past, and cope with the flood of the future. 
|> But still, this wouldn't help. I think that there's something severely 
|> wrong with responsibilities. The researchers don't do what they should, namely 
|> take care of their own entries or areas, and correct the entries as appropriate.
|> And, for the future, the genome projects should adopt slightly more 
|> responsibility for what they produce. Just dumping thousands of low-quality
|> data entries to the databases, generated by robots, and complain afterwards
|> doesn't help. The funding agencies must understand that a genome project 
|> is USELESS (read: wasted money) if the data are not integrated well into the 
|> data sets. The coordinators of the projects must refer from cooking their 
|> own little databases as they comlain the loudest on the unability of the 
|> general database providers. We certainly don't need hundreds of small databases
|> but rather one set which is complete, and high quality. 
|> ?We ? 
|> 
|> Who are 'We' that we tolerate these duplications without doing something
|> ourselves? A change in culture is needed. 

I agree with this whole-heartedly, I think researchers should take far 
more responsibility for their entries than they do at the moment. It is a 
huge problem which I don't really want to go into here

But I don't think your conclusion has anything to do with the issue being
discussed here. The problem was not with authors failing to take 
responsibility for their data at all, but rather problems between the two
databases. Many entries have problems with the notation, the authors not
seeming to know what exon they've just sequenced is only one of the
problems, but despite that, in most cases the actual sequences held by both
databases are the same. What concerned Massimo and myself was that this is
not always the case. 

I am heartened to see that Peter Stoer from EMBL <1994May9.094735.171140@
eros.embl-heidelberg.de> thinks it is a problem worth fixing. Thank you.

Ingrid

From owner-embldatabank@net.bio.net Tue May 17 23:00:00 1994
Path: biosci!agate!howland.reston.ans.net!pipex!doc.ic.ac.uk!daresbury!bioftp.unibas.ch!rc1.vub.ac.be!rc4!mphilipe
From: mphilipe@rc1.vub.ac.be (Philippe M.)
Newsgroups: bionet.molbio.embldatabank
Subject: Different sequence is same entry
Date: 18 May 1994 13:30:23 GMT
Organization: Brussels Free Universities (VUB/ULB), Belgium
Lines: 2
Message-ID: <2rd59f$fc1@rc1.vub.ac.be>
NNTP-Posting-Host: rc4.vub.ac.be
Summary: Different sequence with same number
Keywords: Embl meca
X-Newsreader: TIN [version 1.2 PL2]

I have encountered a problem when comparing the coding sequence of the meca gene of B. subtilis (em_ba:bsmeca l06059) with the meca gene of S. aureus (em_ba:samecapb x52593). I used the software fasta. It provided two alignment results, one with the first part of the sameca sequence and one with the end of the sameca sequence. Can the problem be caused by the presence of the old as well as the revesed sequences of the sameca gene under the same id number? J.Thonnard 
 

From owner-embldatabank@net.bio.net Fri May 27 23:00:00 1994
Newsgroups: bionet.molbio.embldatabank,embnet.net-dev
Path: biosci!agate!howland.reston.ans.net!xlink.net!scsing.switch.ch!swidir.switch.ch!univ-lyon1.fr!jussieu.fr!citi2.fr!bioftp.unibas.ch!doelz
From: doelz@comp.bioz.unibas.ch (Reinhard Doelz)
Subject: BASEL temporary downtime
Message-ID: <1994May28.143220.27837@comp.bioz.unibas.ch>
Organization: EMBnet Switzerland [Basel]
X-Newsreader: TIN [version 1.2 PL2]
Date: Sat, 28 May 1994 14:32:20 GMT
Lines: 14

Due to testing of the emergency power plant the services of Basel 
University are interrupted at 

           WEDNESDAY, June 1st, 1994  2:00 - 6:00 pm  MDT

EMBnet Switzerland services (Interactive login, HASSLE, GOPHER, WWW, FTP) 
are unavailable at this time. We apologize for the inconvenience. 


-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz@urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+

