genome sizes and number of genes summary

Tom Schneider toms at fcs260c2.ncifcrf.gov
Wed Aug 26 16:21:49 EST 1992

In article <9208252323.AA00711 at evolution.genetics.washington.edu>
joe at GENETICS.WASHINGTON.EDU (Joe Felsenstein) writes:
|Tom Schneider, in the midst of his posting on genome sizes writes
|> that the genome size is 4673600, while the number of genes is 3237.
|> This gives:
|>     Rfrequency = log2(4673600/3237) = 10.5 bits per site.
|What does Rfrequency do for you?  Is it intended to measure the potential
|information content per site?  I would have thought that would be
|2*(4673600/3237) = 2888 bits per locus, or 2 bits per site (counting the
|four symbols A, C, G, T as equally frequent).
|I guess I should have been reading the information theory group.
|Joe Felsenstein, Dept. of Genetics, Univ. of Washington, Seattle, WA 98195
| Internet:         joe at genetics.washington.edu     (IP No.
| Bitnet/EARN:      felsenst at uwavm

(I'm cross posting, so I didn't edit your posting.)

Technically, Rfrequency is a measure of the reduction of entropy required for
the recognizer (the ribosome in this case) to go from the state of being
anywhere on the genome (ie, log2(G), where G = 4673600) to the state of it
being at any one of the functional sites (ie log2(gamma) where gamma = 3237).
This decrease is:

   Rfrequency = log2(G) - log2(gamma)
              = log2(G/gamma)
              = -log2(gamma/G)
              = -log2("frequency of sites in the nucleic acid")

hence the name.  I originally called it Rpredicted, because it is a kind of
prediction of the information needed to locate the sites.  It is NOT the
potential information content of the sites because one does not use any
sequence data to calculate it.

The measured information content of a site (ie, the sequence conservation at
the site) is called Rsequence.  It is measured for the same state function, but
uses the sequences.  I won't go more into that one here; there are a couple
papers on it if you are interested.

Your calculation is interesting, as you doubled the genomic size.  In many
cases this is the proper thing to do, namely when one is considering binding
sites on DNA.  However, ribosomes work on single stranded RNA, and in E. coli,
just about the entire genome is transcribed, but only from one strand.  Hence
the number of potential ribosome binding sites is just 4.7e6.

Further, you have to take the logarithm of the number of choices made
to get your answer in bits.  So you should have said:

log2(2*(4673600/3237)) = log2(2888) = 11.5 bits per site

Naturally, if you increase the search region by a factor of 2, you increase the
information needed to find the objects by 1 bit, since a bit defines the choice
between two equally likely things.  Once again, this calculation does not take
into account that there are 4 bases.  (That is important for Rsequence.)

The surprise comes when one compares Rfrequency to Rsequence; they are not
always the same.  Therein lie lots of interesting stories...  My first data
showed a number of cases where Rs~1*Rf, then one where Rs ~ 2*Rf and we just
published a case of Rs ~3*Rf!  We now have a paper in press where Rs ~ Rf - 1
bit (donor splice sites have more than 1 bit less Rs than acceptors, but
acceptors equal Rf).  I'll let you see if you can figure out what the biology
of these might be...

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net