Please be informed that GeneDoc version 2.0.0 has been released
to GeneDoc's web site:
Its been 6 months since the last release, and there has been
plenty of work done. Lots of internal work on the program has been
done so that there are no limitation or troubles on the amount
of data that can be read or displayed. Screen routines are faster
and printer routines work better. New printer options, Metafile
support and much improved fasta and .msf file read routines. The
toolbar and shading methods are much more flexible and intuitive
Two major biological features have been added. Secondary
Structure shading and Super Family Group support.
The Secondary Structure shading routines have been built on top
of a generalized User Definable set of shading routines, so
you can now use GeneDoc to shade sequences any way you can
File Read routines for many of the Secondary Structure prediction
output files created at EMBL are built in. GeneDoc reads DSSP
files, also available at EMBL. Support for files from a new
database of Secondary Structure information from PSC is
provided. Use the User Defined routines to shade sequences with
any prediction program.
GeneDoc now provides support for organizing sequences in groups
and do some nifty shading based on the groups. Apply structure
information to each group, do group conserved, contrast and
PCR contrast shading.
I've included a more complete discussion of some of the new
features in GeneDoc below.
ketchup at cris.com
----- Written by Dr. Hugh Nicholas <nicholas at psc.edu> -------
Group and Structure Features of GeneDoc.
Increasingly families of genes and proteins are organized into
superfamilies. Database organizations often use a the percent
of residue identity as the criterion for distinguishing whether
a pair of sequences should be assigned to the same family or
different families within a superfamily. Perhaps a more useful
criterion is whether or not a gene duplication has taken place in
the common evolutionary history of the two sequences.
Evolutionary biologists use this classification and refer to
homologous sequences that have only speciation events and not
have gene duplication events in their common evolutionary history
as being orthologous. Homologous sequences that have a gene
duplication event in their common evolutionary history are
referred to as paralogous.
Orthologous genes or proteins generally carry out the same
biochemical and physiological functions while paralogous proteins
generally carry out similar but related functions. For instance,
mammalian myoglobins, which carry oxygen within cells are a
orthologous family. They are part of a superfamily that includes
the alpha hemoglobins and the beta hemoglobins, both of which are
also orthologous families. These three homologous families are
part of the same paralogous superfamily, as are other globin genes
Families and superfamilies can be organized around functional
criteria as well as evolutionary criteria and sequence divergence.
Although all sixty plus transfer RNA sequences in E. coli are
paralogous with each other they can be grouped together by their
twenty amino acid acceptor activities.
The group functions in GeneDoc are designed to allow users to work
with and analyze groups based on any of the above criteria or any
user determined criteria for dividing a set of sequences into
groups. The first step in working with groups is to access the
groups configuration dialog. The group configuration dialog can
be accessed either by selecting the "edit sequence groups" item
on the groups menu or by clicking the groups button, the button
marked with an upper case G on the upper toolbar. The group
configuration dialog allows the user to allocate the sequences to
groups and to select a color to be associated with each group.
For the purposes of most of the GeneDoc group analyses sequences
that are not explicitly assigned to a group will be treated as if
each unassigned sequence constitutes the only member of its own
group. These implicit groups will not be analyzed but the
sequences will be used in the analyses of the defined groups as
"other" groups and thus they will contribute to the analysis.
The group analyses result in different shading for individual
groups. These shadings highlight different degrees of different
kinds of conservation of residues or properties within and
between groups. One analysis, referred to as the Dstat analysis,
measures how different the groups are from one another and
whether this difference is statistically significant. The Dstat
analysis presents its results as a graph and numerical values.
The simplest analysis is performed by the "shade group conserved"
entry on the Groups menu. This analysis highlights positions
within each group that are completely conserved, that is there
is only one residue at that position within the group. This
highlighting is done in the color assigned to the group in the
group definition dialog. This measurement of conservation within
the groups does not take into account any equivalency groups, even
if they are active. The second thing this analysis does is to
highlight the positions that are completely conserved across all
of the groups, that is there is only one residue at that position
for all of the sequences in the alignment. This part of the
analysis does take the equivalency groups into account if they
are in effect. The final action is to compute a consensus
sequence based on the entire alignment.
The most useful information derived from this analysis is to
identify for the user the regions of the alignment where
structural or functional requirements may have been relaxed or
eliminated (or alternatively added as the group evolved a new
function) for some groups relative to others. For this kind of
information to be reliable the conserved groups need to be both
large and from a diverse range of organisms. Otherwise the
observed conservation may simply be the result of a small data
set with highly dependent observations.
A more stringent analysis is performed by the "shade group PCR
contrast" entry on the Groups menu. Sites highlighted by this
analysis meet two criteria. First is that a single residue is
completely conserved within the group. Second this conserved
residue does not appear, at that position, in any sequence
outside of the group in which it is conserved. This analysis
marks unique sequence features of the group that can be useful
in defining a group motif and possibly in defining a primer
sequence to be used in a polymerase chain reaction (PCR)
amplification of the gene.
The "shade group contrast" entry on the Groups menu performs an
analysis similar to that of the "shade group PCR contrast" entry.
This analysis makes use of the scoring table designated for
alignment scoring to divide scores for pairs of amino acids into
three classes, positive, negative, and neutral. The positive
scores are those that are positive numbers in the similarity
scores form of the table. Similarly, the negative scores are
those that are negative numbers in the similarity scores form of
the table, while the neutral scores have a zero score. The
scores are stored in GeneDoc as distance or dissimilarity scores
and hence must be converted to the similarity form. This is done
by subtracting the score for a pair of sequence residues in the
table from a constant called the zero cost distance, stored with
the table. Thus the largest distances become negative
similarities and small distances become positive similarities.
The interpretation of the scoring tables is that positive
similarities are conservative substitutions and are favored over
random substitutions in the evolutionary process relating the
The analysis performed by the "shade group contrast" entry on the
Groups menu is less restrictive about the degree of conservation
within the group than is . All of the sequence residues found at
a position within the group are required to have a positive
similarity score with each other, and thus to be conservative
substitutions. This analysis is, however, more restrictive than
is the analysis performed by the "shade group PCR contrast" entry
on the Groups menu when dealing with residues outside the group.
The residues outside of the group must have a negative similarity
score with every residue from within the group, thus they are not
allowed to be either conservative or neutral substitutions. An
example of using this kind of analysis to study the recognition
of transfer RNAs by aminoacyl tRNA synthetase enzyme can be found
in McClain and Nicholas, 1987. Nicholas et al., 1987 describes
using the contrasts to plan site directed mutagenesis experiments
to confirm the analysis of the tRNAs.
The analysis called the Dstat analysis is the Kolmogorov-Smirnov
test for the equality of two distributions. The Dstat analysis
is accomplished by first selecting a region (or all) of the
alignment for use in the test calculations. Then you can either
select the analysis under the Dstat menu or click the Dstat tool
bar button. The Dstat toolbar button is near the right end of
the upper toolbar and is marked by a pair of "S" shaped curves
representing the cumulative distributions used in the test. As
noted above, the Dstat analysis is a statistical test of whether
the groups defined by the user are significantly different from
The first step in the test is to compute an alignment score for
each pair of sequences over the region selected by the user.
These scores are the partitioned into two distributions. The
first distribution is composed entirely from scores where both
of the sequences used to compute the score are members of
different user defined groups. This is called the between groups
distribution. The second distribution is composed entirely from
scores where both of the sequences used to compute the score are
members of the same user defined group. Note that this includes
scores from every group with two or more sequences. This
distribution is called the within groups distribution.
These two distributions are plotted as cumulative distributions.
That is the score is plotted versus the fraction of the scores
in the distribution that are less than or equal to the score
being plotted. The Kolmogorov-Smirnov D statistic (Dstat) is
the maximum difference between the two distributions (along the
fractional axis). Recent advances in the understanding of the
distribution of values taken on by Dstat allow us to compute its
one-sided significance probability. The one-sided significance
probability is used rather than the two-sided significance
probability because we are only interested in the case where the
between groups distribution is composed of larger scores than the
within groups distribution. The other situation, where the
within groups distribution is composed of larger scores than the
between groups distribution corresponds to either convergent
evolution or some sort of selection in favor of divergence,
situations that are not usually part of the hypothesis.
The Kolmogorov-Smirnov test was selected instead of the more
common Students t test or the F test because it is sensitive
to both the location of the distributions along the scores axis
and to the shape of the distribution. Students t test is
sensitive to only the location of the distributions and the F
test is sensitive only to differences in the variance of the
distribution, only one of several aspects affecting the shape of
the distributions. Thus the Kolmogorov-Smirnov test can find the
distributions to be different when either Students t test or the
F test might have failed. Because of this it is necessary for
the user to examine the plot carefully to determine the exact
nature of the differences in the two distributions being tested.
The user should exercise care that the biological hypothesis
being examined should lead to the type of difference actually
Examples of testing biological hypotheses with sequence data and
the Kolmogorov-Smirnov test can be found in Nicholas and Graves,
1983 and in Nicholas and McClain, 1995. The Nicholas and Graves
paper contains an extended discussion of formulating
Kolmogorov-Smirnov tests that correspond to different kinds of
The structure groups and shading facility provide an extremely
powerful and flexible set of tools for integrating sequence
information with structural information. The facility is
flexible enough to allow the user to display almost any kind of
information as color codes along the sequence. Such states can
include the obvious secondary structure state of the residue in
the three dimensional structure. Less obvious properties like
the solvent accessible surface area of the residue or its side
chain can also be displayed. The fraction of the side chain in a
polar environment is another characteristic that is sometime
One of the most powerful kinds of integration of structure and
sequence information allows the user to visually examine the
variation in structure or some structural property that occurs as
the sequence varies in a series of homologous proteins. This
same display allows the user to adjust the alignment based on
information derived from the varying structures of a series of
Alternatively, you can have several copies of the same sequence
in the alignment by adding additional copies with different names
through the sequence import facility. Remember that each copy
must have its own distinct name. These copies of the same
sequence can all be highlighted using a different property.
This allows you to easily visualize possible correlation of
properties or of properties with sequence or structure.
Another use for multiple copies of a single sequence in the
alignment is to contrast predictions of structure and properties
with that observed in an experimentally determined three
A wide spread use of the structure shading is to project the
structure or properties from a sequence of known structure onto
sequences whose structures have not be experimentally determined.
This is accomplished by combining the structure shading facility
with the group facility. At the same time you use the structure
shading dialogue to associate a structure file with a specific
sequence in the alignment you can designate that sequence as the
master sequence of a structural group. This allows you to
associate several sequences with a single structure file.
These associated sequences will be shaded with the same colors in
the same column of the alignment as the master sequence regardless
of the sequence residue present in thhat position of the sequence.
Thus this is essentially a low resolution homology modeling
All of these features can be combined into a very informative
display for studying structure-function relationships in the
following procedure. First, associate a structure file with each
sequence in your alignment whose structure has been determined by
X-ray crystallography or NMR. Make these sequences the master
sequence for a group of the most closely related sequences in the
alignment. Ideally the sequences within each group should have
common biochemical properties such as substrate specificity, while
different groups can have different substrate specificities. Use
the sequence editing facility to put sequences in the same group
on adjacent rows of the alignment. Put the group master sequence
at the top of each group. Then set the display mode to
differences mode and make sure this is applied to all of the
This yields a display with all of the group master sequences,
that is the sequences with known structures, displayed with all
of their residues. The other sequences in each group have a dot
displayed where they have the same sequence residue as the group
master. Sequence residues that are different from the group
master sequence are shown. This highlights substitutions within
each group that are presumably successful site directed
mutagenesis experiments performed by nature. Differences between
groups may be associated with the change in substrate specificity
or other property.
It can be very helpful to have exactly the same alignment in a
second window highlighted in group contrast mode. This
combination of displays can be a very powerful tool for examining
structure-function relationships by integrating a large amount
of information in an easy to comprehend format and presentation.
Nicholas, H.B. Jr., and Graves, S.B. 1983.
Clustering of transfer RNA by cell type and amino acid specificity.
Journal of Molecular Biology, vol. 171, pp. 111 - 118.
Nicholas, H.B., Jr., Chen, Y-M., and McClain, W.H. 1987.
Comparisons of transfer RNA sequences.
Computer Applications in the Biosciences, vol. 3, p. 53.
McClain, W.H. and Nicholas, H.B,Jr. 1987.
Discrimination between transfer RNA molecules.
Journal of Molecular Biology, vol. 194, pp. 635 - 642.
Nicholas, H.B. Jr. and McClain, W.H. 1987.
An algorithm for discriminating transfer RNA sequences.
Computer Applications in the Biosciences, 3, pp. 177 - 181.
Nicholas, H.B. Jr. and McClain, W.H. 1995.
Searching tRNA Sequences for Relatedness to Aminoacyl tRNA
Journal of Molecular Evolution, vol. 40, pp. 482-486.