Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)

Reinhard Doelz doelz at comp.bioz.unibas.ch
Mon May 9 08:19:29 EST 1994


My apologies... this is a bit long but try to read it carefully. 


Ingrid Jakobsen (ingrid at helios.anu.edu.au) wrote:
: In article <2q8h6o$av8 at mserv1.dl.ac.uk>, Massimo Delledonne <DELLE%IPCUCSC.earn at earn-relay.ac.uk> writes:

: (Sad story deleted about EMBL and GenBank entries being diffent
: for the same Accession Number)

... and some other remarks deleted ... 

: I have also seen duplicate entries eliminated from GenBank, but kept
: on EMBL, and sequences withdrawn because corrections showed them to be
: identical to previous sequences, but retained on EMBL.

: So my solution in general is to stick to GenBank, and not use EMBL. I know
: this is a sad thing to say, but as Massimo has also found out, it just 
: doesn't seem as up-to-date. I don't know which side of the Atlantic the
: problem is on: GenBank not sending information on, or EMBL not using it.

Just as a matter of fairness, blaming anyone on examples won't help, and 
deducing that EMBL is bad is presumably a valued view in the states but 
if I claim GENBANK is bad the US wouldn't tolerate it either :-) 

It is that the both talk to each other on computer basis. Computer parsing 
programs are a mess, in particular as both databases don't agree entirely 
on their formats; i.e. parsing extends to mapping and voila - there are 
problems. The following example is an annecdote I just came accross where 
both databases have a duplicate. 

The entry GGCOL8 in EMBL (15 April 1994, updated 20-April 1994) is LOCUS 
CHKC1A206 (1-Jun-1984) in GENBANK. However, Entry GGCOL8 (9-Jun 1982, updated
6-Jul 1989) with LOCUS GGCOL8 (6-Jul 89) is only with one Accession number 
more; J00827 being the first and V00400 being the additional one. 

Both entries refer to a journal Cell 22, 887-892 (1980) - i.e., EMBL seems
to take 14 years before they do a mistake, whereas GENBANK takes only 5 years.

The sequences are entirely identical. I use the GENBANK format here to show 
differences: 
<   AUTHORS   Yamada,Y., Avvedimento,E.V., Mudryj,M., Ohkubo,H., Vogeli,G.,
>   AUTHORS   Yamada,Y., Avvedimento,E., Mudryj,M., Ohkubo,H., Vogeli,G.,

<             amplification of a dna segment containing an exon of 54 bp
>             amplification of a DNA segment containing an exon of 54 bp


I guess (or at least hope) that these are not the reason for the duplication.
The problem comes in the feature table! For the sake of completion I use EMBL
format here, in truncated form: 

FT   intron          <1. .8            |  FT   source          1. .70
FT                   /note="collagen   |  FT                   /organism="Gallu
FT   prim_transcript <1. .>70          |  FT   CDS             9. .62
FT                   /note="collagen   |  FT                   /note="exon 6"
FT   exon            9. .62            |
FT                   /number=42        |
FT                   /note="collagen   |
FT   exon            9. .62            |
FT                   /note="collagen   |
FT                   putative"         |
FT   intron          63. .>70          |
FT                   /note="collagen   |
FT   source          1. .70            |
FT                   /organism="Gallu  |

One entry says
/note="collagen helipeptide, exon 42 (AA 37 to 54);
and the other says 
/note="exon 6"
--- from the SAME reference. 

In one entry, it tells CDS, in the other, there is no CDS. Why? Simply because 
in one entry there is CDS from 9 to 62, and mat_peptide in the other entry: 

(GENBANK format again)

<      mat_peptide     9..62
<                      /partial
<                      /codon_start=1
<                      /note="collagen helipeptide, 2 (AA 37 to 54)"
>      CDS             9..62
>                      /note="exon 6;  NCBI gi: 63306."
>                      /codon_start=1
>                      /translation="GPQGPRGPPGPPGKAGED"

So what is the difference between a mat_peptide and a CDS? 
The gbrel.txt from release 82 tells us 

CDS             Sequence coding for amino acids in protein (includes
                stop codon)
mat_peptide     Mature peptide coding region (does not include stop codon)

and ftable.doc from EMBL 
CDS              Sequence coding for amino acids in protein
exon             Region that codes for part of spliced mRNA


OK, so far the documentation; but GENBANK's precise definition is contra-
dictory here as once it is _with_ and once _without_ stop codon. Well; 
it ist't quite so as the last three nucleotides are coding D as stated in 
the translation:    /translation="GPQGPRGPPGPPGKAGED" 

I haven't analyzed this systematically but I am afraid that inconsistencies 
like this make database provider's life difficult. As human intervention
is extremely expensive (manpower) and we (customers) don't want to pay the 
prediction that it will become worse in the future is a safe guess. 

You rely on BLAST searching? 
Fine. I used the peptide as described above and seqrched the 'nr' dataset
which we do in-house on all protein databases available. 

The entry scoring 
 Score = 108 (49.3 bits), Expect = 1.1e-08, P = 1.1e-08
 Identities = 18/18 (100%), Positives = 18/18 (100%)

if looked up in the result, is located at position 8 (as the only 
entirely matching entry - other irrelevant matches lead the score) and
does NOT occur in either SWISSPROT nore PIR database, but only in PATCHX
(Pfeiffer, MIPS Martinsried). Entry: patchx:M25963 ; There, we read: 
LOCUS       CHKCOLA07
DEFINITION  Chicken alpha-2 collagen gene type I gene, exons 13-15
ACCESSION   M25963
SOURCE
  ORGANISM  Gallus gallus
REFERENCE   1
  AUTHORS   Boedtker,H., Finer,M. and Aho,S.
  TITLE     The structure of the chicken alpha-2 collagen gene
  JOURNAL   Ann. N. Y. Acad. Sci. 460, 85-116 (1985)
FEATURES
     CDS             join(M25956:1548. .1617,M25956:3513. .3523,
                     M25956:4131. .4148,M25956:4783. .4818,M25959:182. .265,
                     M25961:205. .261,M25962:609. .653,M25962:755. .808,
                     M25962:1118. .1171,M25962:1539. .1592,M25962:2078. .2131,
                     M25962:2345. .2398,6. .50,287. .340,439. .483) /partial
                     /note="alpha-collagen type I;; NCBI gi: 211605."
                     /codon_start=1


Note that there's now talking on entry M25963, with both EMBL and GENBANK
versions, and this is exon 13-15, whereas the original source talked about
exon 42, and exon 6, respectively. 

A DNA comparison reveals. 

 Ggcol8 x M25963           May 8, 1994  10:23  ..

                  .         .         .
      15 CAAGGTCCTCGTGGTCCCCCTGGTCCTCCAGGAA 48
         || ||| |||| |||| |||||||     |||||
     284 CAGGGTGCTCGCGGTCTCCCTGGTGAGAGAGGAA 317


Oh well, interesting... Why don't you try a BLAST at home and see ? 
... on DNA? 

CONCLUSION
==========

I think we all agree that databases are non-optimal. On the other hand, 
if you see those guys working, they don't feel lazy, nor do they enjoy 
being reminded that they do produce low-quality data. (I won't talk 
on proteins here but the situation there is even worse). The data need
better MAINTENANCE! 
We could spend another XX M$ on both sides of the atlantic to have a 
staff of workers clean up the past, and cope with the flood of the future. 
But still, this wouldn't help. I think that there's something severely 
wrong with responsibilities. The researchers don't do what they should, namely 
take care of their own entries or areas, and correct the entries as appropriate.
And, for the future, the genome projects should adopt slightly more 
responsibility for what they produce. Just dumping thousands of low-quality
data entries to the databases, generated by robots, and complain afterwards
doesn't help. The funding agencies must understand that a genome project 
is USELESS (read: wasted money) if the data are not integrated well into the 
data sets. The coordinators of the projects must refer from cooking their 
own little databases as they comlain the loudest on the unability of the 
general database providers. We certainly don't need hundreds of small databases
but rather one set which is complete, and high quality. 
?We ? 

Who are 'We' that we tolerate these duplications without doing something
ourselves? A change in culture is needed. 

Regards
Reinhard Doelz

EMBnet Switzerland 

-- 
  +---------------------------+-------------------------------------------+
  |    Dr. Reinhard Doelz     | Tel. x41 61 2672247    Fax x41 61 2672078 |
  |      Biocomputing         | electronic Mail       doelz at urz.unibas.ch |
  |Biozentrum der Universitaet+-------------------------------------------+



More information about the Embl-db mailing list