For many months, several years I believe, I've advocated mapping
genomes completely (that part already done for a dozen species and in
the works for a dozen more and then more being planned) and then
matching every segment of DNA without regard to supposed boundaries of
genes or codons etc., to thereby get a true picture of the evolutionary
history of DNA without the bias of preconceived ideas of genes etc.
Then a few days ago, while catching up with back issues of SCIENCE, in
the issue of 2003.Jul.04, on page 53, I found this wonderful quote:
"Given the intricacies of RNA editing, complex regulatory networks,
genetic redundancy, and molecular pathways, it is meaningless to
identify any one concrete matural object as the gene." Although that
sounds extreme, I believe it's the right way of thinking. Comments?
As for guessing the number of "genes" (appx. 24000 or maybe 30000) in
the human genome, I suppose it's a good thing the guesses in the
correct range were so sparse that a winner could be picked without
getting into vicious arguments about what exactly counted as one gene
vs. two genes.
Ignoring the definitions of a gene in terms of coding for phenotype,
considering only the definitions of a gene in terms of evolution by
mutation (point, crossover, duplicate, delete, etc.) and natural
selection, a year or two ago I devised a fuzzy definition of a gene,
basically any segment of genome (usually DNA, sometimes RNA such as in
viruses) which is long enough that it doesn't arise by chance but short
enough that it can last many generations before accidently being split
via meiosis crossover or chance mutation, specifically that the
exact-copy fecundity of that segment of genome is greater than 1. This
definition of course has genes within genes within genes simply because
a wide range of lengths of genome segment satisfy the definition.
Indeed an entire chromosome can count as a single gene if crossing over
is infrequent in the particular sequence or the chromosome is
sufficiently short that it misses crossing over most of the time and
simultaneously the mutation rate is low enough to miss that particular
chromosome most of the time. (Whether there in fact is any chromosome
of any living species satisfying those conditions, I don't know,
probably not, but maybe?)
How to perform matching calculations on such a varying length of
overlapping and inclusive "genes"? My idea for the past many months
(few years) has been overlapping segments of power-of-two lengths
feeding into ProxHash (a hashing function from data space into
high-dimensional real-number space, satisfying the mathematical
property of "continuous" i.e. epsilon-delta you all remember from
pre-Calculus and metric spaces of abstract algebra). In that way only
four-fold coverage for any given power of two (length of genome
segment) is needed to assure that mis-phase won't prevent matching.
Larger powers of two (gs-length) can be used to efficiently trace large
unchanged genome-segments through a few generations before they are
mutated, thereby tracking a whole set of codons etc. simultaneously as
they co-replicate, while smaller powers of two (gs-length) can be used
at greater cost to trace shorter genome-segments, even smaller than a
single phenotype-gene, through more generations. After building a set
of nearest-neighbor (in hash space) links between gs in our database,
and also links between whole and part (adjacent powers of two
gs-length), software can then look at the gs at each end of a link to
find the actual match of identical base sequences, i.e. establish
pointwise alignment whereever an exact match occurs, and then fill in
the gaps whenever a SNP occurs. At this point we have vectors of
exact-alignment links, which can be tracked as groups forward and
backward in time, and thereby easily identify where insertions (from
copies, or from retro-coding) or deletions have occurred, and set up
fuzzy links to show paths through such changes. By combining these
various kinds of links, we obtain a directed graph of alignment
stretching from the current time all the way back to the last common
ancestor pool of all life on Earth. Perhaps we'll discover that all
present-day DNA, coding and non-coding, aligns directly from a very
small pool of maybe five or ten codons that were in life 3900 million
years ago, except for one medium-size segment of DNA that suddenly came
to Earth appx. 3400 million years ago via some meteor from Mars.
---