"vector" sequence in databases

JAB5 at VAXA.YORK.AC.UK JAB5 at VAXA.YORK.AC.UK
Tue Oct 22 04:24:00 EST 1991


Dear colleagues,
  I have been compiling sequences from the databases which contain
"vector" sequences for a while now, and would be grateful if any
others could be sent which I may have missed (for instance I have
not been able to face doing lambda sequences yet). I hope to submit
a summary for publication- I presented a talk to the Genetical Society
last April which went down well.
  Most of them can be explained quite simply whereas other classes are
 more difficult to rationalise and I believe that some (involved
with rearrangements and amplifications in certain cancerous cell lines)
 could have more sinister implications.
All of these sequences can be analysed at various levels to provide
useful information. It is not (and I believe should not be) the brief of
staff at EMBL/Genbank to "correct" entries-they have their hands full as
it is. EMBL now highlight(most of) these entries with the qualifier "putative
vector sequence" in the features table. But as I have said, it is
difficult to explain some of these as simple cloning/sequencing
artefacts. I have been sequencing for 10 years now and am struggling
to analyse some of these. For example, short imperfect matches can be
significant at a statistical level but difficult to spot. The following
2 examples are quite long matches (for this class, although others are
over 2kb!!!); note in particular the transposition of the same 4 bases
in the second example.

GTTTTTCCATACGTGCCGCCCCGCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTT
*********** *  ******* *************************************** *
GTTTTTCCATAGGCTCCGCCCC-CCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGG-T

GGCGAAAGGG  from within coding region of a bacillus gene
*******
GGCGAAACCC  pBR322


CTTATAAATCAAAAGAATAG-CCGAGATAGGGTTGAGTGTTGTTAGCCTTTGGAAC-AGA HUMAN GENE
******************** ***********************    ******** ***
CTTATAAATCAAAAGAATAGCCCGAGATAGGGTTGAGTGTTGTTCCAGTTTGGAACAAGA M13


It is the authors responsibility to analyse their sequences. Most
universities/institutes have JANET/INTERNET facilities and even the most
remote or poor probably have PCs around. "Vector contamination" is the
simplest probable error to screen for. It is important for workers to
also search the REVERSE COMPLEMENT STRAND of their sequence against a
data base. I believe that it is this simple ommission which has caused
the publication of many sequences before they were ready.
I hope that this has been of value and look forward to reading replys.

Jim Brannigan
jab at uk.ac.york.yorvic (Unix)
JAB5 at UK.AC.YORK.VAXA  (Vax)
0904-432566
0904-410519 (FAX)

Chemistry Dept.
University of York,
Heslington, York YO1 5DD UK.



More information about the Bioforum mailing list