VECTOR CONTAMINATION IN DATABASE SEQUENCES
Recently [Nature 355(1992)211; Jan.16] Lopez, Kristensen and
Prydz 'examined sequence data from ... the EMBL sequence database ...
[and] found 78 sequences with a total of 81 occurrences of vector
sequences not documented in the features table.' They go on to point
out that 41 of these entries were submitted during 1990-91 (through
release 27; May 1991).
This 'Scientific Correspondence' is not the first time the issue
has been raised. The fact that it is still a problem, and one that
may grow in seriousness as more of the large sequencing projects come
on stream, suggests that it is worthy of comment and discussion on the
network, by all interested parties and most particularly by the
managers of the large publicly accessible databases. The two
following paragraphs are suggestions put forward by the authors in
their 'Scientific Correspondence'.
Alex Reisner
Australian Genomic Information Centre
_____________________________________________________________________
'How can we avoid errors like these finding their way into the
databases? Most sequence software packages contain some sort of
vector screening. Thus a simple screening of raw sequence data before
assembly, as well as screening of assembled sequences against vector
databases before submission should be a simple task. Another
possibility is that database administrators should screen all
submitted sequences against the vector database. But is it part of
their job to function in this way? One thing they could do is to add
a question to the sequence submission forms: "Has this sequence been
checked for the presence of vector sequences?". Then the submitters
would at least have been made aware of the possibility of
contamination of the data. Also, it probably would be useful if
submitters were encouraged to include the vector used in the feature
table.
'What should be done to contaminated sequences [already in the
databases]? First, such sequences constitute a very small percentage
of the total database. In many cases, it would be enough to include
the presence of vector data in the feature table of the sequence. In
several instances, we presume that complete removal of the sequence
from the database would be the best line of action. Again it is
understandable if the database administrators are reluctant to perform
this cleaning-up: preferably, the submitters themselves should be made
aware of the doubtful nature of their submissions and given the
opportunity to rectify the data or withdraw their submission. (A list
of the contaminated sequences we found is available on request from
us, and has been submitted to the EMBL Data Library.)
Rodrigo Lopez*
Tom Kristensen
Hans Prydz
The Biotechnology Centre of Oslo
University of Oslo
P.O. Box 1125
Blindern N-0316, Oslo Norway
*The Norwegian EMBLnet node'