Hello,
in the course of studying repetitive sequences under statistical aspects,
i frequently encountered sequence stretches that originate from the multiple
cloning site and adjacent regions of vectors like pUC/pBR type or the like.
For getting an estimate of the number of 'vectorial contamination' in EMBL and
Genbank, i ran a fasta search against EMBL-primate section using whole pUC19 as
a probe and got more than 20 suspect scores.
A closer look at the top scorers revealed that most of them show their
homology in the region adjacent to the multiple cloning site. In at least one
of the sequence entries found, the pBR-part is annotated.
I am not sure if the occurence of vector sequences in primate genes is
desirable and think that perhaps these stretches should be removed in future
releases but if someone finds arguments for retaining them.. feel free to
reply.
Best regards,
Kay Hofmann
P.S.
here are my favourite puc-containers (in EMBL-notation)
hst1418
hshpv16c
hsgp3a11
hsgp3a19
hspk1
tstgl5
hscol140
hstrnsdu
hsalpl13
hstcpbsj
hsifnb3
hsmb1
hssatmyd
hspk1
hsth22ma
hsprola
hsrnpc