Vector contamination and the sequence databases

reisner at ee.su.oz.au reisner at ee.su.oz.au
Sun Jan 26 22:51:16 EST 1992


	Recently [Nature 355(1992)211; Jan.16] Lopez, Kristensen and 
Prydz 'examined sequence data from ... the EMBL sequence database ... 
[and] found 78 sequences with a total of 81 occurrences of vector 
sequences not documented in the features table.'  They go on to point 
out that 41 of these entries were submitted during 1990-91 (through 
release 27; May 1991).

	This 'Scientific Correspondence' is not the first time the issue 
has been raised.  The fact that it is still a problem, and one that 
may grow in seriousness as more of the large sequencing projects come 
on stream, suggests that it is worthy of comment and discussion on the 
network, by all interested parties and most particularly by the 
managers of the large publicly accessible databases.  The two 
following paragraphs are suggestions put forward by the authors in 
their 'Scientific Correspondence'.

Alex Reisner
Australian Genomic Information Centre

	'How can we avoid errors like these finding their way into the 
databases?  Most sequence software packages contain some sort of 
vector screening.  Thus a simple screening of raw sequence data before 
assembly, as well as screening of assembled sequences against vector 
databases before submission should be a simple task.  Another 
possibility is that database administrators should screen all 
submitted sequences against the vector database.  But is it part of 
their job to function in this way?  One thing they could do is to add 
a question to the sequence submission forms: "Has this sequence been 
checked for the presence of vector sequences?".  Then the submitters 
would at least have been made aware of the possibility of 
contamination of the data.  Also, it probably would be useful if 
submitters were encouraged to include the vector used in the feature 

	'What should be done to contaminated sequences [already in the 
databases]?  First, such sequences constitute a very small percentage 
of the total database.  In many cases, it would be enough to include 
the presence of vector data in the feature table of the sequence.  In 
several instances, we presume that complete removal of the sequence 
from the database would be the best line of action.  Again it is 
understandable if the database administrators are reluctant to perform 
this cleaning-up: preferably, the submitters themselves should be made 
aware of the doubtful nature of their submissions and given the 
opportunity to rectify the data or withdraw their submission. (A list 
of the contaminated sequences we found is available on request from 
us, and has been submitted to the EMBL Data Library.)

Rodrigo Lopez*
Tom Kristensen
Hans Prydz

The Biotechnology Centre of Oslo
University of Oslo
P.O. Box 1125
Blindern N-0316, Oslo Norway

*The Norwegian EMBLnet node'

More information about the Embl-db mailing list

Send comments to us at biosci-help [At] net.bio.net