anyone looked at long-range patterns in DNA sequence?

James W. Fickett jwf at temin.lanl.gov
Mon Mar 22 13:08:43 EST 1993


The following is a summary of a recent article,

Base Compositional Structure of Genomes
J.W. Fickett, D.C. Torney, and D.R. Wolf
Genomics 13, 1056-1064 (1992)

========================================

The large-scale structure of many genomes is currently under intense
scrutiny.  The prevailing model for the large-scale base compositional
structure of vertebrate genomes is the Isochore model (see, e.g., G.
Bernardi, Annual Review of Genetics 23, 637-661, 1989).  In this model the
genome is partitioned into large (on the order of megabases) regions called
isochores, the C+G content within each isochore is roughly constant, and the
transition from one isochore to the next is posited to be rather sharp.

We have recently shown (Genomics 13, 1056-1064, 1992) that the Isochore
model should probably be significantly refined.  We divided the human
genomic sequences in GenBank into successive, nonoverlapping 1000 base
windows, and recorded the C+G content of each.  We found that the C+G
content does tend to persist over tens of kilobases (that is, the
correlation coefficient for C+G content in pairs of windows tens of
kilobases apart is positive with high confi- dence).  On the other hand, we
also found that the variation in C+G content within the span of a few tens
of kilobases was such that human DNA cannot reasonably be modeled by common
homogeneous stochastic processes.  This rules out, again with very high
confidence, the most natural mathematical meaning of the statement that C+G
content is roughly constant within an isochore.

Thus we propose a somewhat different style of model, in which C+G content,
instead of varying abruptly in large steps, varies slowly and essentially
continuously.  In mathematical terms, we propose that human DNA may
reasonably be modeled by a Walking Markov (WM) process, in which the local
C+G content is determined by a random walk in C+G content space, and the
sequence itself is generated by a process whose parameters depend on the
current C+G value.  The random walk takes very small (.15 percent of base
content) steps, but takes a step at every base.  The WM model is consistent
with sequences currently in GenBank.  The C+G content distribution obtained
for the whole genome could be accounted for by further constraints on the WM
model.

The Isochore model was, biologically speaking, rather mysterious:  what
biological constraints would cause the C+G content to be held constant over
megabase regions?  On the other hand the WM model is very natural.  The
imperfect persistence of C+G content may be, in a naively simple analogy,
similar to the clumps of related organisms one sees in any naturally
occurring ecosystem.  Some credibility is lent to the analogy by the
well-known duplication of repeti- tive elements, together with the
concentration of different repetitive elements in light and dark bands of
Giemsa- stained chromosomes (cf.  J.R.  Korenberg and M.C.  Rykowski, Cell
53, 391-400, 1988; L.  Manuelidis and D.C.  Ward, Chromosome 91, 28-38,
1984), and by gene duplication or exon shuffling and the concentration of
genes in G+C rich regions (G.  Bernardi ibid).

Removing the mystery from isochores may also remove the mystery from the
long range correlations in DNA reported by C.-K.  Peng et al.  (Nature 356,
168-170, 1992).  As pointed out by S.  Nee (Nature 357, 450, 1992), simply
having large regions of markedly different base composition (which occur
naturally under WM) is probably enough to give the long range correlations
of Peng et al.

Perhaps surprisingly, the Walking Markov model, with parameters that give
both lower persistence and variation, fits the E.  coli genome as well.
Thus it seems quite possible that this model describes universal properties
of the genomes of living organisms.

Many models for DNA, including the Isochore model, treat C and G as
indistinguishable, and likewise A and T.  While detailed models that
describe the full base composition are desirable (and the WM model does so)
we have also shown that this simplification is well founded.  There is a
strong tendency for DNA to be strand symmetric, i.e.  in large win- dows the
likelihood is high that there will be nearly the same number of occurrences
of each base on each strand.  For example, in 1000 base human windows, the
correlation of C content with G content is 0.51 +/- 0.04 with 95% confi-
dence.

Many genetic properties have been posited to be related to base composition,
e.g.  gene expression, replication, and recombination.  Thus the analysis of
sequence for function will likely need to be pursued against this backdrop
of vari- able base composition.  We have shown, for example, that over half
of the variation in analytical measures commonly used to detect genes (c.f.
Fickett and Tung, Nucleic Acids Research, 20, 6441) is due to C+G content
variation.  Cali- brating such measures separately for different ranges of
C+G content can eliminate 20% of the errors made in coding region
prediction.



More information about the Comp-bio mailing list