Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)
doelz at comp.bioz.unibas.ch
Sun May 8 03:49:02 EST 1994
My apologies... this is a bit long but try to read it carefully.
Ingrid Jakobsen (ingrid at helios.anu.edu.au) wrote:
: In article <2q8h6o$av8 at mserv1.dl.ac.uk>, Massimo Delledonne <DELLE%IPCUCSC.earn at earn-relay.ac.uk> writes:
: (Sad story deleted about EMBL and GenBank entries being diffent
: for the same Accession Number)
... and some other remarks deleted ...
: I have also seen duplicate entries eliminated from GenBank, but kept
: on EMBL, and sequences withdrawn because corrections showed them to be
: identical to previous sequences, but retained on EMBL.
: So my solution in general is to stick to GenBank, and not use EMBL. I know
: this is a sad thing to say, but as Massimo has also found out, it just
: doesn't seem as up-to-date. I don't know which side of the Atlantic the
: problem is on: GenBank not sending information on, or EMBL not using it.
Just as a matter of fairness, blaming anyone on examples won't help, and
deducing that EMBL is bad is presumably a valued view in the states but
if I claim GENBANK is bad the US wouldn't tolerate it either :-)
It is that the both talk to each other on computer basis. Computer parsing
programs are a mess, in particular as both databases don't agree entirely
on their formats; i.e. parsing extends to mapping and voila - there are
problems. The following example is an annecdote I just came accross where
both databases have a duplicate.
The entry GGCOL8 in EMBL (15 April 1994, updated 20-April 1994) is LOCUS
CHKC1A206 (1-Jun-1984) in GENBANK. However, Entry GGCOL8 (9-Jun 1982, updated
6-Jul 1989) with LOCUS GGCOL8 (6-Jul 89) is only with one Accession number
more; J00827 being the first and V00400 being the additional one.
Both entries refer to a journal Cell 22, 887-892 (1980) - i.e., EMBL seems
to take 14 years before they do a mistake, whereas GENBANK takes only 5 years.
The sequences are entirely identical. I use the GENBANK format here to show
< AUTHORS Yamada,Y., Avvedimento,E.V., Mudryj,M., Ohkubo,H., Vogeli,G.,
> AUTHORS Yamada,Y., Avvedimento,E., Mudryj,M., Ohkubo,H., Vogeli,G.,
< amplification of a dna segment containing an exon of 54 bp
> amplification of a DNA segment containing an exon of 54 bp
I guess (or at least hope) that these are not the reason for the duplication.
The problem comes in the feature table! For the sake of completion I use EMBL
format here, in truncated form:
FT intron <1. .8 | FT source 1. .70
FT /note="collagen | FT /organism="Gallu
FT prim_transcript <1. .>70 | FT CDS 9. .62
FT /note="collagen | FT /note="exon 6"
FT exon 9. .62 |
FT /number=42 |
FT /note="collagen |
FT exon 9. .62 |
FT /note="collagen |
FT putative" |
FT intron 63. .>70 |
FT /note="collagen |
FT source 1. .70 |
FT /organism="Gallu |
One entry says
/note="collagen helipeptide, exon 42 (AA 37 to 54);
and the other says
--- from the SAME reference.
In one entry, it tells CDS, in the other, there is no CDS. Why? Simply because
in one entry there is CDS from 9 to 62, and mat_peptide in the other entry:
(GENBANK format again)
< mat_peptide 9..62
< /note="collagen helipeptide, 2 (AA 37 to 54)"
> CDS 9..62
> /note="exon 6; NCBI gi: 63306."
So what is the difference between a mat_peptide and a CDS?
The gbrel.txt from release 82 tells us
CDS Sequence coding for amino acids in protein (includes
mat_peptide Mature peptide coding region (does not include stop codon)
and ftable.doc from EMBL
CDS Sequence coding for amino acids in protein
exon Region that codes for part of spliced mRNA
OK, so far the documentation; but GENBANK's precise definition is contra-
dictory here as once it is _with_ and once _without_ stop codon. Well;
it ist't quite so as the last three nucleotides are coding D as stated in
the translation: /translation="GPQGPRGPPGPPGKAGED"
I haven't analyzed this systematically but I am afraid that inconsistencies
like this make database provider's life difficult. As human intervention
is extremely expensive (manpower) and we (customers) don't want to pay the
prediction that it will become worse in the future is a safe guess.
You rely on BLAST searching?
Fine. I used the peptide as described above and seqrched the 'nr' dataset
which we do in-house on all protein databases available.
The entry scoring
Score = 108 (49.3 bits), Expect = 1.1e-08, P = 1.1e-08
Identities = 18/18 (100%), Positives = 18/18 (100%)
if looked up in the result, is located at position 8 (as the only
entirely matching entry - other irrelevant matches lead the score) and
does NOT occur in either SWISSPROT nore PIR database, but only in PATCHX
(Pfeiffer, MIPS Martinsried). Entry: patchx:M25963 ; There, we read:
DEFINITION Chicken alpha-2 collagen gene type I gene, exons 13-15
ORGANISM Gallus gallus
AUTHORS Boedtker,H., Finer,M. and Aho,S.
TITLE The structure of the chicken alpha-2 collagen gene
JOURNAL Ann. N. Y. Acad. Sci. 460, 85-116 (1985)
CDS join(M25956:1548. .1617,M25956:3513. .3523,
M25956:4131. .4148,M25956:4783. .4818,M25959:182. .265,
M25961:205. .261,M25962:609. .653,M25962:755. .808,
M25962:1118. .1171,M25962:1539. .1592,M25962:2078. .2131,
M25962:2345. .2398,6. .50,287. .340,439. .483) /partial
/note="alpha-collagen type I;; NCBI gi: 211605."
Note that there's now talking on entry M25963, with both EMBL and GENBANK
versions, and this is exon 13-15, whereas the original source talked about
exon 42, and exon 6, respectively.
A DNA comparison reveals.
Ggcol8 x M25963 May 8, 1994 10:23 ..
. . .
15 CAAGGTCCTCGTGGTCCCCCTGGTCCTCCAGGAA 48
|| ||| |||| |||| ||||||| |||||
284 CAGGGTGCTCGCGGTCTCCCTGGTGAGAGAGGAA 317
Oh well, interesting... Why don't you try a BLAST at home and see ?
... on DNA?
I think we all agree that databases are non-optimal. On the other hand,
if you see those guys working, they don't feel lazy, nor do they enjoy
being reminded that they do produce low-quality data. (I won't talk
on proteins here but the situation there is even worse). The data need
We could spend another XX M$ on both sides of the atlantic to have a
staff of workers clean up the past, and cope with the flood of the future.
But still, this wouldn't help. I think that there's something severely
wrong with responsibilities. The researchers don't do what they should, namely
take care of their own entries or areas, and correct the entries as appropriate.
And, for the future, the genome projects should adopt slightly more
responsibility for what they produce. Just dumping thousands of low-quality
data entries to the databases, generated by robots, and complain afterwards
doesn't help. The funding agencies must understand that a genome project
is USELESS (read: wasted money) if the data are not integrated well into the
data sets. The coordinators of the projects must refer from cooking their
own little databases as they comlain the loudest on the unability of the
general database providers. We certainly don't need hundreds of small databases
but rather one set which is complete, and high quality.
Who are 'We' that we tolerate these duplications without doing something
ourselves? A change in culture is needed.
| Dr. Reinhard Doelz | Tel. x41 61 2672247 Fax x41 61 2672078 |
| Biocomputing | electronic Mail doelz at urz.unibas.ch |
|Biozentrum der Universitaet+-------------------------------------------+
More information about the Embl-db