Is the genome like a computer program?
rrobbins at GDB.ORG
Fri Apr 14 08:41:43 EST 1995
This is an area to which I have given some thought (and in fact have even
written about it a bit and have added a discussion of the subject to a
course I teach at Johns Hopkins).
The genome is like a mass storage device (with properties not shared by
current electronic mass-storage devices), but the programs that it encodes
differ significantly from current computer technology.
Each parent contributes one set of mass-storage devices (23 chromosomes)
so that the resulting single cell has two full sets of the programs (that
may, and do, differ from each other).
Although the genetic code is without doubt a digital code, the level of
parallelism that exists in the expression of these codes is such that many
aspects of the system have more in common with an analog system than with
a digital one. If we assume that the transcription of a gene, followed
by the synthesis of an enzyme is the spawning of an executing process,
then each single cell has millions of programs executing in a truly
parallel (i.e., independent execution, no time sharing) mode.
By the time the single cell produced by the union of sperm and egg has
grown into an adult, there are probably well in excess of 10**13 cells,
each with two copies of the mass-storage devices, and each running
somewhere in excess of 10**5 to 10**6 processes in parallel. Many of the
spawned processes affect the launching of other processes, so the level of
feed-back control is high. Hormones produced in one cell may affect the
expression of genes (i.e., the launching of processes) in other cells, so
in principle the processes running in any of the 10**13 cells could affect
the launching o processes in any of the other 10**13 cells. Proteins may
interact with each other, so processes may affect processes, and there is
certainly a lot of inter-process interaction.
Taken all together then, the expression of the human genome involves the
simultaneous expression and (potential) interaction of something probably
in excess of 10**18 parallel processes.
Given this, it is not likely that intuitions derived from an understanding
of the operation of programs on present computer architectures will
generalize well to the expression of the genome.
Differences also exist in the mass-storage devices as well. The genome
can be thought of as a mass-storage device based on a linked-list
architecture, rather than a physical platter. All addressing is
associative, with multiple read heads scanning the device in parallel,
looking for specific START LOADING HERE signals. When such a signal is
encountered, the read head starts transcribing DNA and continues doing so
until a STOP LOADING HERE signal is encountered. [The resulting transcript
is like a *.EXE file, rather than a *.COM file. On Intel systems, *.COM
files are perfect memory images that may be loaded verbatim into memory
and then execute when control is passed to the first byte. *.EXE files,
on the other hand, are a mixture of instructions to the loader and
instructions to be executed. After the loader instructions are executed,
the program (now different from that stored on the mass-storage device) is
placed in memory and control passed to it.]
Genome programs execute on a virtual machine that is defined by some of
the genomic programs that are executing. Thus, in trying to understand
the genome, we are trying to reverse engineer binaries for an unknown CPU,
in fact for a virtual CPU whose properties are encoded in the binaries we
are trying to reverse engineer.
We do know that "genomic op codes" are probabilistic, rather than
deterministic. That is, when control hits a particular op code, there is
a certain probability that a certain action will occur. This applies to
the associative addresses on the mass storage device as well. Intuitions
from current hardware suggests that this would make for intermittent,
jerky behavior of the system. However, in such a massively parallel
system, probabilistic op codes actually smooth out the behavior of the
system by providing some buffering capacity (in the chemical, not computer
I/O, sense of buffering).
I could go on, but I hope my general point is made. I offer some
additional comments below in response to your specific inquiries.
On 14 Apr 1995, Gary Welz wrote:
> I'm writing a speculative article about the large scale structure of the genome.
Does anyone besides me think
> that an organism's genome can be regarded as a computer program? I mean that
its structure can be presented as
> a flowchart with genes as objects connected by logical terms like "and" and "or?"
Of course, conditional
> activity in the genome - the analog of the "while" loop - has been studied for
Flow charts describe the behavior of a non-parallel machine. Although
some aspects of a massively parallel system can be expressed
(metaphorically) as a flow chart of linearly executing steps, care must be
taken in interpreting that flow chart.
> One development that might support this point of view is the recent demonstration
(reported in last week's
> Science) that the eyeless gene can be inserted into various parts of the
chromosome of a fly and cause it to
> have eyes grow on different parts of its body. Is eyeless a free standing
genetic object that can be plugged
> into any syntactically correct sequence and function as though it belonged there
naturally? If so, what is the
> nature of the programming language that makes this possible?
The expression of single processes (like MAKE HUMAN GROWTH HORMONE or MAKE
HUMAN INSULIN) are fairly free-standing and can be inserted into almost
any syntactic interpreter, with the result that the desired protein is in
fact synthesized. This is the heart of much of the bio-tech industry.
However, the parallel nature of genomic expression is such that major
developmental steps, like making eyes, involve so many processes acting in
concert that individual processes cannot be seen as free standing. This
may seem to contradict the findings you mention above, but explaining the
subtleties involve would require more time and space than is available
> Dr. Gene Stanley and others at Boston U. and Harvard Medical School have done
statistical analyses of
> non-coding DNA sequences (published in Physical Review Letters a few months ago)
that suggest that there may be
> linguistic structures, i.e. words within them. Are some of these non-coding
sequences the terms of the genetic
> programming language?
There is an entire body of literature that treats the linguistic
properties of DNA. If you are really interested, you should read a lot of
it before jumping to quick interpretations. However, you should also bear
in mind that DNA involves the coding of a "language" on a mass-storage
device, it is not the direct expression of a language. As an analogy,
consider the differences that programs designed to detect linguistic
features would exhibit of they were run first against, say, War and Peace
as straight ASCII text and then second, against the byte stream obtained
from a hard disk on which War and Peace was stored as, say, a WordPerfect
file with all of the embedded WordPerfect formatting codes and with lots
of file fragmentation thrown in. I expect that the program would detect
linguistic features when run against the byte stream from the hard disk,
but they would be neither a clean set of features of English nor a clean
set of features from WordPerfect formatting codes, but rather a mixture of
the two, with some confusion thrown in due to the disk fragmentation.
> If this is interesting to you, or if you think its bogus, let me know.
I think it's really interesting, but I also think that we need to be
careful not to oversimplify the analogies. When I was the program officer
for Database Activities in Biology at NSF, I had many inquiries from
computer scientists who had acquired a Scientific-American-article level
of appreciation of genetics and who assumed that the well-known table
showing the determination of the sequence of amino-acids in proteins by
the sequence of nucleotides in DNA was more or less equivalent to the
table of op codes for some CPU. This level of understanding led to some
very simplistic proposals.
A not-that-bad-reversal of the analogy would have someone thinking that an
understanding of the ASCII code (41h = A, 42h = B, etc) is all that would
be required to understand the workings of, say, an Intel Paragon or a
At the same time, I think that bringing computer-science insights to bear
on the challenge of understanding genome operation has some potentially
huge payoffs. For example, it would be really interesting to think about
the file-allocation-table system for a mass storage device that behaves
like a redundant linked list, with only associative addressing (i.e., no
physical addressing by sector-offset, but instead only addressing by
offsets from recognizable landmarks). It would also be interesting to
think about the computational properties that might emerge in a system with
probabilistic op codes and with as much parallelism as biological
Robert J. Robbins
Bioinformation Infrastructure Program
Office of Health and Environmental Research
United States Department of Energy
19901 Germantown Road
Germantown, MD 20874-1290
robbins at er.doe.gov
(301) 903-6488 (secretary / Joanne Corcoran)
(301) 903-8521 (fax)
Robert J. Robbins
Laboratory for Applied Bioinformatics
Johns Hopkins University
2024 E. Monument Street
Baltimore, MD 21205
rrobbins at gdb.org
(410) 955-9705 (office secretary)
(410) 614-0434 (fax)
More information about the Biochrom