Different entry is same sequence (Re: same sequence is different in EMBL and GENBANK)

Ingrid Jakobsen ingrid at helios.anu.edu.au
Tue May 17 01:44:02 EST 1994

I hope I am just being paranoid this week, but what you wrote feels 
somewhat like a flame, and I feel I have to defend myself in a few
places. My apologies if this is all old hat now, but this post only 
arrived at our site yesterday:

In article <1994May9.131929.1805 at comp.bioz.unibas.ch>, doelz at comp.bioz.unibas.ch (Reinhard Doelz) writes:
|> My apologies... this is a bit long but try to read it carefully.

I have done that, and I get the feeling you didn't do me the 
courtesy of reading what I wrote carefully. My post was much shorter.
|> Ingrid Jakobsen (ingrid at helios.anu.edu.au) wrote:
|> : I have also seen duplicate entries eliminated from GenBank, but kept
|> : on EMBL, and sequences withdrawn because corrections showed them to be
|> : identical to previous sequences, but retained on EMBL.
|> : So my solution in general is to stick to GenBank, and not use EMBL. I know
|> : this is a sad thing to say, but as Massimo has also found out, it just 
|> : doesn't seem as up-to-date. I don't know which side of the Atlantic the
|> : problem is on: GenBank not sending information on, or EMBL not using it.
|> Just as a matter of fairness, blaming anyone on examples won't help, and

I didn't blame anyone based on examples. I concluded that EMBL was less
reliable than GenBank based on examples, sure, but I was careful to say
that I didn't know where the blame actually lay.

It may be that EMBL overall is more reliable than GenBank. In that case,
I have seen a very unrepresentative sample because in every case I have 
seen the GenBank entry is "better" - more recently corrected or whatever.

I didn't make this decision based on some idea that GenBank ought to be
better, I went and chased down the original references. 

|> deducing that EMBL is bad is presumably a valued view in the states but 
|> if I claim GENBANK is bad the US wouldn't tolerate it either :-)

I should point out that I am not located in the states, I am based in
Australia. Most Australian researchers have no idea which database their
"loyalities" should be with, EMBL or Genbank or DDBJ. I went into this
with absolutely no opinion either way, as I mentioned, I reached my 
personal decision after considerable running to and from the library.

I would also like to point out that I find the current situation with
two US databases ridiculous, and I admire the fact that numerous countries
can co-operate on EMBL. This is much more sensible in my opinion. 

|> It is that the both talk to each other on computer basis. Computer parsing 
|> programs are a mess, in particular as both databases don't agree entirely 
|> on their formats; i.e. parsing extends to mapping and voila - there are 
|> problems. The following example is an annecdote I just came accross where 
|> both databases have a duplicate. 
|> The entry GGCOL8 in EMBL (15 April 1994, updated 20-April 1994) is LOCUS 
|> CHKC1A206 (1-Jun-1984) in GENBANK. However, Entry GGCOL8 (9-Jun 1982, updated
|> 6-Jul 1989) with LOCUS GGCOL8 (6-Jul 89) is only with one Accession number 
|> more; J00827 being the first and V00400 being the additional one. 
|> Both entries refer to a journal Cell 22, 887-892 (1980) - i.e., EMBL seems
|> to take 14 years before they do a mistake, whereas GENBANK takes only 5 years.

In other words, your opinion of the databases is also based on examples, or
in fact only _one_ example. So why can't I have my opinion on the databases
based on the number of examples I have seen? And I merely decided that GenBank
was more reliable than EMBL, you seem to claim from one example that GenBank
generally takes five years to make mistakes.

I am prepared to believe that computer parsing may be the problem, I have
no experience with it. It just suggests to me that we are looking at a
big waste of resources with each side trying to duplicate the work of the
other. I don't think the problem can be dealt with unfortunately, as I 
don't think either side is going to let the other become the major database

|> I haven't analyzed this systematically but I am afraid that inconsistencies 
|> like this make database provider's life difficult. As human intervention
|> is extremely expensive (manpower) and we (customers) don't want to pay the 
|> prediction that it will become worse in the future is a safe guess. 

I can only agree with this. But consider the expense of the "manpower" wasted
as thousands of researchers waste their time on faulty database entries, as
exemplified by Massimo Delledone, whose bad experience started the whole
|> You rely on BLAST searching? 
|> Fine. I used the peptide as described above and seqrched the 'nr' dataset
|> which we do in-house on all protein databases available. 
|> The entry scoring 
|>  Score = 108 (49.3 bits), Expect = 1.1e-08, P = 1.1e-08
|>  Identities = 18/18 (100%), Positives = 18/18 (100%)
|> if looked up in the result, is located at position 8 (as the only 
|> entirely matching entry - other irrelevant matches lead the score)

The reason is that the entries were sorted by Poisson probabilities, 
rather than high scores, which is always a problem when searching with
short sequences. As far as I can remember, that option can be reset.

The leading matches incidentally are also collagen genes, which hardly
makes them "irrelevant". 

|> does NOT occur in either SWISSPROT nore PIR database, but only in PATCHX
|> (Pfeiffer, MIPS Martinsried). Entry: patchx:M25963 ; There, we read: 
|> LOCUS       CHKCOLA07
|> DEFINITION  Chicken alpha-2 collagen gene type I gene, exons 13-15
|> ACCESSION   M25963
|>   ORGANISM  Gallus gallus
|>   AUTHORS   Boedtker,H., Finer,M. and Aho,S.
|>   TITLE     The structure of the chicken alpha-2 collagen gene
|>   JOURNAL   Ann. N. Y. Acad. Sci. 460, 85-116 (1985)
|>      CDS             join(M25956:1548. .1617,M25956:3513. .3523,
|>                      M25956:4131. .4148,M25956:4783. .4818,M25959:182. .265,
|>                      M25961:205. .261,M25962:609. .653,M25962:755. .808,
|>                      M25962:1118. .1171,M25962:1539. .1592,M25962:2078. .2131,
|>                      M25962:2345. .2398,6. .50,287. .340,439. .483) /partial
|>                      /note="alpha-collagen type I;; NCBI gi: 211605."
|>                      /codon_start=1
|> Note that there's now talking on entry M25963, with both EMBL and GENBANK
|> versions, and this is exon 13-15, whereas the original source talked about
|> exon 42, and exon 6, respectively. 
|> A DNA comparison reveals. 
|>  Ggcol8 x M25963           May 8, 1994  10:23  ..
|>                   .         .         .
|>          || ||| |||| |||| |||||||     |||||
|> Oh well, interesting... Why don't you try a BLAST at home and see ? 
|> ... on DNA?

What are you trying to say here? I had a look at the DNA sequences and it
is in fact M25962 CHKCOLA06 (which is exons 7 to 12, marginally closer to
exon 6) which contains the match to GGCOL8. The reason you get M25963 in
the protein search is because that is the entry that contains the amino
acid translation of all the entries listed under CDS above. The PATCHX
(and incidentally Genpept) use this accession number because that is 
the entry where the translation was put. 

You can find M25962 and M25963 on EMBL too, with the same notation, it
just doesn't provide the translation, which might be a good thing, or 
a bad thing, you decide...
|> ==========
|> I think we all agree that databases are non-optimal. On the other hand, 
|> if you see those guys working, they don't feel lazy, nor do they enjoy 
|> being reminded that they do produce low-quality data. (I won't talk 
|> on proteins here but the situation there is even worse). The data need
|> better MAINTENANCE! 
|> We could spend another XX M$ on both sides of the atlantic to have a 
|> staff of workers clean up the past, and cope with the flood of the future. 
|> But still, this wouldn't help. I think that there's something severely 
|> wrong with responsibilities. The researchers don't do what they should, namely 
|> take care of their own entries or areas, and correct the entries as appropriate.
|> And, for the future, the genome projects should adopt slightly more 
|> responsibility for what they produce. Just dumping thousands of low-quality
|> data entries to the databases, generated by robots, and complain afterwards
|> doesn't help. The funding agencies must understand that a genome project 
|> is USELESS (read: wasted money) if the data are not integrated well into the 
|> data sets. The coordinators of the projects must refer from cooking their 
|> own little databases as they comlain the loudest on the unability of the 
|> general database providers. We certainly don't need hundreds of small databases
|> but rather one set which is complete, and high quality. 
|> ?We ? 
|> Who are 'We' that we tolerate these duplications without doing something
|> ourselves? A change in culture is needed. 

I agree with this whole-heartedly, I think researchers should take far 
more responsibility for their entries than they do at the moment. It is a 
huge problem which I don't really want to go into here

But I don't think your conclusion has anything to do with the issue being
discussed here. The problem was not with authors failing to take 
responsibility for their data at all, but rather problems between the two
databases. Many entries have problems with the notation, the authors not
seeming to know what exon they've just sequenced is only one of the
problems, but despite that, in most cases the actual sequences held by both
databases are the same. What concerned Massimo and myself was that this is
not always the case. 

I am heartened to see that Peter Stoer from EMBL <1994May9.094735.171140@
eros.embl-heidelberg.de> thinks it is a problem worth fixing. Thank you.


More information about the Embl-db mailing list