The EBV problem promised -Hope it's not too hellish!
HUENSD at vax1.computer-centre.birmingham.ac.uk
HUENSD at vax1.computer-centre.birmingham.ac.uk
Mon Feb 18 18:12:00 EST 1991
As requested, here is the info on a hopefully interesting problem in the
application of computing techniques to finding a possible coding sequence
in a region. Sorry that the description is a bit long but I hoped to give
a description of the problem without all the jargon used in EBV research
(Every group has a different name for similar things! Impenetrable or what?)
EBV is a herpesvirus that infects both B-lymphocytes and some epithelia. It is
strongly associated with endemic Burkitt's lymphoma, nasopharyngeal carcinoma
and is the causative agent of infectious mononucleosis (glandular fever). Up
to 95% of humanity is infected, mostly asymptomatically.
The virus has a number of modes of existence. It can infect B-lymphocytes and
when it does so it can be in passive latency, where it appears to express only
the EBNA-1 protein (HS4 coordinates 107950-109872) which is required for
episomal maintenance. In this state, it is invisible to the immune system. It
can also be in active latency when it expresses 8 proteins which, at least in
vitro act to immortalise and transform the B-lymphocyte. Six of these are
transcribed off two promoters, one at 11305 and the other in the BamHI-W repeats
at 14352, 17424, etc. Very complex splicing to distant ORFs results in formation
of mRNAs resulting in these distinct proteins. The other two proteins(LMP and
a separate set of promoters.
When infecting epithelia, a different set of genes are expressed. EBNA-1 is
present but the two promoters (11305 and 14352 etc) are inactive. As EBNA-1
is transcribed from these promoters in active latency, it must be produced
from another promoter here. Evidence suggests this lies left of 62461. LMP
and TP are often present using the same promoters as in lymphocytic infection.
An additional 4.8 kb message exists that runs over the the region spanning
The virus can enter lytic cycle in both lymphocytes and epithelia and about
a hundred genes are eventually transcribed in this phase to produce the viral
particle. Most of these genes are intronless.
In epithelial infection resulting in nasopharyngeal carcinoma, an abundant novel
4.8 kb message exists. Incomplete cDNAs have been cloned for this message. Their
content is described below:-
(N.B. - The reference sequence for the virus in HS4 is based on the B95-8
prototype isolate. This isolate has a number of variations chief of which is a
deletion at 152012/152013 in the region of interest. Fortunately, the deleted
region has since been from the Raji isolate, sequenced and deposited by the
original authors in HS4RAJI. There is some uncertainty as to the exact sequence
at the border of the deletion.)
The cDNAs were isolated from a library of the NPC C15 cell line which has small
but significant sequence differences. Not all sequence info was divulged in the
original publication . The following was reconstructed:-
Transcript runs rightward (toward higher sequence no.)
5' most extent of available clones: at least as far as 9231 in HS4RAJI.
(based on mapping in . Longest clone not actually completely sequenced)
1st observed intron - start: 10630 in HS4RAJI.
end: 155724 in HS4.
2nd observed intron - start: 157184 in HS4.
end: Not given in publn. Sequence variation made it
difficult to place in the HS4 sequence.
3rd observed intron - start: 157386 in HS4.
end: 159083 in HS4.
poly-A : 161013 in HS4.
An intron observed by one group but not other: 160068-160238 in HS4. See .
(Other group inferred sequence of this region.
Reported sequence variation between C15 isolate and prototype strain
in this region:
Deletions relative to B95-8: 155730 in HS4.
156074 in HS4.
Proposed ORFs in message by authors :
1 ATGGCCGGAG CTCGTCGACG GGCAAGGTGC CAGCGTCAGC AGGATGCGCC
51 TATAGCGCCC GGCCTCCTCC CCTGTCGACC AGAGGACGCA GGATATCTGC
101 AGGATCAGGT CAGCCTCGTT GGTGGCCGTG GGGAAGCCCT CCTCCCCCAG
151 ACACTCGATA TCGAAGGCCA GGGCCTGGTA GGAGGGCCAG GAGCTGTCTT
201 CACGCCGGAC CGAGAGGTCG CCCACCTCAC AGTCGTACTC GAGCTCGGCG
251 TACGAGTCCC GGTGCTGGAG GCGGGGGATG GCGCGGCGGC AGCTGTACCA
301 GCCAAAGGTG ACAAAGTCAT TGTCCAGGAC AAAGCGGCGC GTGGCATCCA
351 CGTTGGCCTC AAAGATCCGA CACCGTGCTT GTCTTGCAGC CACGTGGCCA
1. This frame is only possible because of the C15 deletion relative to
B95-8 at 155730 which generates a frameshift to extend the ORF. However,
the deletion at 156074 in HS4 truncates this ORF of some approx 100 residues.
1 TGAGCCCCCG GGTACGCTGT AGAAGCTGTT GAAGGAGGTC TCTATCCAGT
51 CGCTCGGCTC GATGCCTGGC CATATCAGGG AAGTCAGGAA TGCCTTCTGG
101 TGGGGCAGCG TACCTGCGGC GTCACAGCAG CGAGCCAGGG CCACGTTGCT
151 GGGTGGGGGA AAGAGCCCGC TCTCCTCCGC CAGGGGCCCC GTGATGAAGG
201 TGTACAGGCT GTGCGTCAGC GCGTGCAGGT GCTCCGAGCT CAGGGTCTGG
251 GTAAACAGGT GTGTTTTGAT GTACTTGGAA TTCTCAAAGG CGGCACCCTC
301 GCCGGCGCGC CTGTCCTCCC AGGGACCCGA GACGAAGGCC CGTCTGTAGA
351 GGAAGTGGTT GCGCATGCGG GCCAGCTCCC AGTAGACCAC GTCCCCCCAG
401 ACGCGCAGGC ACAGGGTCTC GGTCAGGGTC TCGCTCTGTT GCGCCAGGCA
451 GGACTGCAGC TTGGCCAGAC CCTCGGTGGC CACCTGGCGC AGGTACTGCT
501 CCTTGCGCTT GAGCGCGTCC GAGAGGGCGC CGGACGGGCC GGGCTCTCGT
551 GCCCCAGCCG GCCGGGGCAC CTCCGGGCTC TCCCGGGACG CCTCCTCCTC
601 GCCTCGGCCC AACCGCTGCA TGGCTCGGTT GAGCCGCGTG TACAGCTCGT
651 TCCTCTTTTG CAGGATGGCC CGGTACTGGG GGTGCGCCGT GAAGGCGGCG
701 GCGCAGTCCG CCTTCAGCGC CTCCACCGCG TCGCCCGAGG AGCTGTAGAC
751 CCCGCCGCAG AAGAGCCGCT CCGTGGCCCC GGGAGCCACG GCGTCAAACA
801 GGTGAGTCAG CCTTGCCCCC GCCAGCGCCT CCTCGCAGGC CCCCCGCACC
851 AGGGCCAGGC GACGCTCCCG GGCAAACAGG GCAGAGAGGC GGGAATGGCC
901 GCCACCCTCC CCCTGCCCCG TTGCACCGAT AGCATGGCCG CCAGAGTTCC
951 AATAGAGGAG CTCCGAGAGC TCCGCCACCT CCGGGGGCAC TGTCGAGAAG
1001 ACGTTGTAGG TGTCCAGCGC TCTGGTCGCC CCCTCTGCCT CCGGCCGCCC
1051 CGGGCCCGGG ACCGCGCCCT CCTCTGGGCC GCCCGGCCTC GCCTTCTCCT
1101 CAGCCTCCAA CAGGTGCCCG AGCCCCGCCT GGCGGACTTC ATTCTCAAAC
1151 AGTCCCGAGA CCGGCTCCGG ATTCACCGGC ACCGCCAGGT GGTTACAGGA
1201 GACGTGGGTC CCCTCTGCCG TGGAAGGGTT GCCGTGGTTG GGCAGAACCA
1251 TCAGCTCGCC CACACAGCGC CAGCAGGGCA CAGAGGTGAT GTAGAGGCGC
1301 GGGTCTGGGA TGGGACTTAC GCCCCGAAAG CGGCCCAGCA GATCCAGGGC
1351 CCGTTCCAGG CTCTCCAGCC CCATGGTGTG AGACATGCAA TAAAACACGC
1401 TATTGATTCT CTTCATTAA
1. This region doesn't have a ATG until some two-thirds of the way thru'!
One suggestion is that a non-AUG start is used.
2. Optional splice described above occurs with intron spanning 493-663.
This splice is in-frame and removes a number of residues from the ORF.
As mentioned above, the prototype sequence has a deletion that has since been
sequenced in another isolate. This deletion actually removes a lytic origin
of replication that is almost identical to the one at the other end of the
genome. Region 3565-4609 in the HS4RAJI sequence are virtually identical to
52654-53697 in the HS4 sequence. I mention this because the origin of
on the left end is known to be a promoter/enhancer region and has bidirectional
transcripts extending away from it. However, while the leftward promoter of
the origin is maintained in the HS4RAJI sequence, the homology at the righthand
end ends some tens of nucleotides before where the rightward promoter is
expected to be placed.
The sequence variation observed between the various isolates may or may not
be real. The trouble is that the sequence is based on the B95-8 isolate which
has been passaged as an actively latent cell line for around 30 years. The
above region is not transcribed in this lymphoid line. Raji has also been
maintained similarly for almost as long. The cDNA was cloned from the C15
NPC tumour line which at least expresses this region though it too has been
on the go for a long time - this time as a nude mouse xenograft.
(When the genome project gets going, I hope they won't sequence DNA
obtained from cell lines. I suspect any gene not transcribed in a particular
tissue may well undergo changes on prolonged passaging as a cell line with
attendant problems described here - the EBV episomal maintenance system has
perhaps the highest fidelity of all viral episomal maintenance systems known).
Also, if the mRNA is truly 4.8 kb as claimed, then the mapping data would
appear to suggest that the inferred sequence is virtually full length
(maybe even too long!). The actual sequenced stuff extends 1.3 kb less
in 5' direction.
One other point is that much of the 3' end of this region is littered with
virtually back-to-back leftward ORFs that are known to be expressed as protein.
This includes virtually the entire extent of the sequenced part of the mRNA!
If codon preference methods are to be used, there are clear differences in the
codon usage of genes expressed in latency and those expressed in lytic cycle.
(see ) Latent messages tend to have high AT content at third position.
When I refer to HS4 coordinates here, I assume the presence of the entire
genome in one file. GCG users will have this split into two files, HS4 and
HS4-2 with the first 100000 bp in first file.
1. Where is the 5' end of message (or at least of 5'-most exon)?
2. Are any of the proposed ORFs reasonable given their location so far from
5'end of mRNA? cLF1 is around 1.3 kb away, cBALF1 much further.
3. Given possible frameshift errors, is there a plausible reconstruction at
5' end that will give a more reasonable message?
4. What is the likelihood that this message doesn't encode protein at all
but just serves either as an anti-sense transcript or just as a doggone
no-good message designed to confuse molecular biologists!?
I will be meeting the people who actually did the work next month so I may be
to provide the answers to the above questions as approached by bench methods.
They have had over a year to work since they published this stuff.
 Hitt,M.M.,Allday,M.J.,Hara,T.,Karran,L.,Jones, M.D.,Busson,T.,Tursz,T.
,Ernberg,I.,Griffin,B.E. (1989) EBV gene expression in an NPC-related
Tumor. EMBO J. 8:2639-2651.
Tursz,T.,Rabb-Traub,N. (1990) Novel Transcription from the Epstein-Barr
Virus Terminal EcoRI Fragment, DIJhet, in a Nasopharyngeal Carcinoma.
J. Virol. 64:4948-4956.
 Karlin,S.,Blaisdell,B.E.,Schachtel,G.A. (1990) Contrasts in Codon Usage
between Latent versus Productive Genes of Epstein-Barr Virus. J. Virol.
More information about the Biomatrx