GeneDoc 2.0.0 Released: Alignment Editor for Windows

Ketchup Ketchup at
Sun Mar 2 23:02:02 EST 1997

Greetings All, 

Please be informed that GeneDoc version 2.0.0 has been released
to GeneDoc's web site:

Its been 6 months since the last release, and there has been 
plenty of work done. Lots of internal work on the program has been 
done so that there are no limitation or troubles on the amount 
of data that can be read or displayed. Screen routines are faster 
and printer routines work better. New printer options, Metafile 
support and much improved fasta and .msf file read routines. The 
toolbar and shading methods are much more flexible and intuitive 
to use.

Two major biological features have been added. Secondary 
Structure shading and Super Family Group support. 

The Secondary Structure shading routines have been built on top 
of a generalized User Definable set of shading routines, so 
you can now use GeneDoc to shade sequences any way you can 

File Read routines for many of the Secondary Structure prediction 
output files created at EMBL are built in. GeneDoc reads DSSP 
files, also available at EMBL. Support for files from a new 
database of Secondary Structure information from PSC is 
provided. Use the User Defined routines to shade sequences with 
any prediction program.

GeneDoc now provides support for organizing sequences in groups 
and do some nifty shading based on the groups. Apply structure 
information to each group, do group conserved, contrast and  
PCR contrast shading.

I've included a more complete discussion of some of the new 
features in GeneDoc below. 


Karl Nicholas
ketchup at

----- Written by Dr. Hugh Nicholas <nicholas at> -------

Group and Structure Features of GeneDoc.

Increasingly families of genes and proteins are organized into 
superfamilies.  Database organizations often use a the percent 
of residue identity as the criterion for distinguishing whether 
a pair of sequences should be assigned to the same family or 
different families within a superfamily.  Perhaps a more useful 
criterion is whether or not a gene duplication has taken place in 
the common evolutionary history of the two sequences.  
Evolutionary biologists use this classification and refer to 
homologous sequences that  have only speciation events and not 
have gene duplication events in their common evolutionary history 
as being orthologous.  Homologous sequences that have a gene 
duplication event in their common evolutionary history are 
referred to as paralogous.

Orthologous genes or proteins generally carry out the same 
biochemical and physiological functions while paralogous proteins 
generally carry out similar but related functions.  For instance, 
mammalian myoglobins, which carry oxygen within cells are a 
orthologous family.  They are part of a superfamily that includes 
the alpha hemoglobins and the beta hemoglobins, both of which are 
also orthologous families.  These three homologous families are 
part of the same paralogous superfamily, as are other globin genes 
and proteins.

Families and superfamilies can be organized around functional 
criteria as well as evolutionary criteria and sequence divergence.  
Although all sixty plus transfer RNA sequences in E. coli are 
paralogous with each other they can be grouped together by their 
twenty amino acid acceptor activities.

The group functions in GeneDoc are designed to allow users to work 
with and analyze groups based on any of the above criteria or any 
user determined criteria for dividing a set of sequences into 
groups.  The first step in working with groups is to access the 
groups configuration dialog. The group configuration dialog can 
be accessed either by selecting the "edit sequence groups" item 
on the groups menu or by clicking the groups button, the button 
marked with an upper case G on the upper toolbar.  The group 
configuration dialog allows the user to allocate the sequences to 
groups and to select a color to be associated with each group.  
For the purposes of most of the GeneDoc group analyses sequences 
that are not explicitly assigned to a group will be treated as if 
each unassigned sequence constitutes the only member of its own 
group.  These implicit groups will not be analyzed but the 
sequences will be used in the analyses of the defined groups as 
"other" groups and thus they will contribute to the analysis.

The group analyses result in different shading for individual 
groups.  These shadings highlight different degrees of different 
kinds of conservation of residues or properties within and 
between groups.  One analysis, referred to as the Dstat analysis, 
measures how different the groups are from one another and 
whether this difference is statistically significant.  The Dstat 
analysis presents its results as a graph and numerical values.

The simplest analysis is performed by the "shade group conserved" 
entry on the Groups menu.  This analysis highlights positions 
within each group that are completely conserved, that is there 
is only one residue at that position within the group.  This 
highlighting is done in the color assigned to the group in the 
group definition dialog.  This measurement of conservation within 
the groups does not take into account any equivalency groups, even 
if they are active.  The second thing this analysis does is to 
highlight the positions that are completely conserved across all 
of the groups, that is there is only one residue at that position 
for all of the sequences in the alignment.  This part of the 
analysis does take the equivalency groups into account if they 
are in effect.  The final action is to compute a consensus 
sequence based on the entire alignment.

The most useful information derived from this analysis is to 
identify for the user the regions of the alignment where 
structural or functional requirements may have been relaxed or 
eliminated (or alternatively added as the group evolved a new 
function) for some groups relative to others.  For this kind of 
information to be reliable the conserved groups need to be both 
large and from a diverse range of organisms.  Otherwise the 
observed conservation may simply be the result of a small data 
set with highly dependent observations.

A more stringent analysis is performed by the "shade group PCR 
contrast" entry on the Groups menu.  Sites highlighted by this 
analysis meet two criteria.  First is that a single residue is 
completely conserved within the group.  Second this conserved 
residue does not appear, at that position, in any sequence 
outside of the group in which it is conserved. This analysis 
marks unique sequence features of the group that can be useful 
in defining a group motif and possibly in defining a primer 
sequence to be used in a polymerase chain reaction (PCR) 
amplification of the gene.

The "shade group contrast" entry on the Groups menu performs an 
analysis similar to that of the "shade group PCR contrast" entry. 
 This analysis makes use of the scoring table designated for 
alignment scoring to divide scores for pairs of amino acids into 
three classes, positive, negative, and neutral.  The positive 
scores are those that are positive numbers in the similarity 
scores form of the table.  Similarly, the negative scores are 
those that are negative numbers in the similarity scores form of 
the table, while the neutral scores have a zero score.  The 
scores are stored in GeneDoc as distance or dissimilarity scores 
and hence must be converted to the similarity form.  This is done 
by subtracting the score for a pair of sequence residues in the 
table from a constant called the zero cost distance, stored with 
the table.  Thus the largest distances become negative 
similarities and small distances become positive similarities.  
The interpretation of the scoring tables is that positive 
similarities are conservative substitutions and are favored over 
random substitutions in the evolutionary process relating the 

The analysis performed by the "shade group contrast" entry on the 
Groups menu is less restrictive about the degree of conservation 
within the group than is .  All of the sequence residues found at 
a position within the group are required to have a positive 
similarity score with each other, and thus to be conservative 
substitutions.  This analysis is, however, more restrictive than 
is the analysis performed by the "shade group PCR contrast" entry 
on the Groups menu when dealing with residues outside the group.  
The residues outside of the group must have a negative similarity 
score with every residue from within the group, thus they are not 
allowed to be either conservative or neutral substitutions.  An 
example of using this kind of analysis to study the recognition 
of transfer RNAs by aminoacyl tRNA synthetase enzyme can be found 
in McClain and Nicholas, 1987.  Nicholas et al., 1987 describes 
using the contrasts to plan site directed mutagenesis experiments 
to confirm the analysis of the tRNAs.

The analysis called the Dstat analysis is the Kolmogorov-Smirnov 
test for the equality of two distributions.  The Dstat analysis 
is accomplished by first selecting a region (or all) of the 
alignment for use in the test calculations.  Then you can either 
select the analysis under the Dstat menu or click the Dstat tool 
bar button.  The Dstat toolbar button is near the right end of 
the upper toolbar and is marked by a pair of "S" shaped curves 
representing the cumulative distributions used in the test.  As 
noted above, the Dstat analysis is a statistical test of whether 
the groups defined by the user are significantly different from 
each other.

The first step in the test is to compute an alignment score for 
each pair of sequences over the region selected by the user.  
These scores are the partitioned into two distributions.  The 
first distribution is composed entirely from scores where both 
of the sequences used to compute the score are members of 
different user defined groups.  This is called the between groups 
distribution.  The second distribution is composed entirely from 
scores where both of the sequences used to compute the score are 
members of the same user defined group.  Note that this includes 
scores from every group with two or more sequences.  This 
distribution is called the within groups distribution.

These two distributions are plotted as cumulative distributions.  
That is  the score is plotted versus the fraction of the scores 
in the distribution that are less than or equal to the score 
being plotted.  The Kolmogorov-Smirnov D statistic (Dstat) is 
the maximum difference between the two distributions (along the 
fractional axis).  Recent advances in the understanding of  the 
distribution of values taken on by Dstat allow us to compute its 
one-sided significance probability.  The one-sided significance 
probability is used rather than the two-sided significance 
probability because we are only interested in the case where the 
between groups distribution is composed of larger scores than the 
within groups distribution.  The other situation, where the 
within groups distribution is composed of larger scores than the 
between groups distribution corresponds to either convergent 
evolution or some sort of selection in favor of divergence, 
situations that are not usually part of the hypothesis.

The Kolmogorov-Smirnov test was selected instead of the more 
common Student’s t test or the F test because it is sensitive 
to both the location of the distributions along the scores axis 
and to the shape of the distribution.  Student’s t test is 
sensitive to only the location of the distributions and the F 
test is sensitive only to differences in the variance of the 
distribution, only one of several aspects affecting the shape of 
the distributions.  Thus the Kolmogorov-Smirnov test can find the 
distributions to be different when either Student’s t test or the 
F test might have failed.  Because of this it is necessary for 
the user to examine the plot carefully to determine the exact 
nature of the differences in the two distributions being tested.  
The user should exercise care that the biological hypothesis 
being examined should lead to the type of difference actually 

Examples of testing biological hypotheses with sequence data and 
the Kolmogorov-Smirnov test can be found in Nicholas and Graves, 
1983 and in Nicholas and McClain, 1995.  The Nicholas and Graves 
paper contains an extended discussion of formulating 
Kolmogorov-Smirnov tests that correspond to different kinds of 
biological hypotheses.

The structure groups and shading facility provide an extremely 
powerful and flexible set of tools for integrating sequence 
information with structural information.  The facility is 
flexible enough to allow the user to display almost any kind of 
information as color codes along the sequence.  Such states can 
include the obvious secondary structure state of the residue in 
the three dimensional structure.  Less obvious properties like 
the solvent accessible surface area of the residue or its side 
chain can also be displayed.  The fraction of the side chain in a 
polar environment is another characteristic that is sometime 

One of the most powerful kinds of integration of structure and 
sequence information allows the user to visually examine the 
variation in structure or some structural property that occurs as 
the sequence varies in a series of homologous proteins.  This 
same display allows the user to adjust the alignment based on 
information derived from the varying structures of a series of 
homologous proteins.

Alternatively, you can have several copies of the same sequence 
in the alignment by adding additional copies with different names 
through the sequence import facility.  Remember that each copy 
must have its own distinct name.  These copies of the same 
sequence can all be highlighted using a different property.  
This allows you to easily visualize possible correlation of 
properties or of properties with sequence or structure.

Another use for multiple copies of a single sequence in the 
alignment is to contrast predictions of structure and properties 
with that observed in an experimentally determined three 
dimensional structure.

A wide spread use of the structure shading is to project the 
structure or properties from a sequence of known structure onto 
sequences whose structures have not be experimentally determined.  
This is accomplished by combining the structure shading facility 
with the group facility.  At the same time you use the structure 
shading dialogue to associate a structure file with a specific 
sequence in the alignment you can designate that sequence as the 
master sequence of a structural group.  This allows you to 
associate several sequences with a single structure file.  
These associated sequences will be shaded with the same colors in 
the same column of the alignment as the master sequence regardless 
of the sequence residue present in thhat position of the sequence.  
Thus this is essentially a low resolution homology modeling 

All of these features can be combined into a very informative 
display for studying structure-function relationships in the 
following procedure.  First, associate a structure file with each 
sequence in your alignment whose structure has been determined by 
X-ray crystallography or NMR.  Make these sequences the master 
sequence for a group of the most closely related sequences in the 
alignment.  Ideally the sequences within each group should have 
common biochemical properties such as substrate specificity, while 
different groups can have different substrate specificities.  Use 
the sequence editing facility to put sequences in the same group 
on adjacent rows of the alignment.  Put the group master sequence 
at the top of each group.  Then set the display mode to 
differences mode and make sure this is applied to all of the 

This yields a display with all of the group master sequences, 
that is the sequences with known structures, displayed with all 
of their residues.  The other sequences in each group have a dot 
displayed where they have the same sequence residue as the group 
master.  Sequence residues that are different from the group 
master sequence are shown.  This highlights substitutions within 
each group that are presumably successful site directed 
mutagenesis experiments performed by nature.  Differences between 
groups may be associated with the change in substrate specificity 
or other property.

It can be very helpful to have exactly the same alignment in a 
second window highlighted in group contrast mode.  This 
combination of displays can be a very powerful tool for examining 
structure-function relationships by integrating a large amount 
of information in an easy to comprehend format and presentation.

Nicholas, H.B. Jr., and Graves, S.B.  1983.  
Clustering of transfer RNA by cell type and amino acid specificity.
Journal of Molecular Biology, vol. 171, pp. 111 - 118.

Nicholas, H.B., Jr., Chen, Y-M., and McClain, W.H.  1987.  
Comparisons of transfer RNA sequences.  
Computer Applications in the Biosciences, vol. 3, p. 53. 

McClain, W.H. and Nicholas, H.B,Jr.  1987.  
Discrimination between transfer RNA molecules.  
Journal of Molecular Biology, vol.  194, pp. 635 - 642.

Nicholas, H.B. Jr. and McClain, W.H.  1987.  
An algorithm for discriminating transfer RNA sequences.  
Computer Applications in the Biosciences, 3, pp. 177 - 181.

Nicholas, H.B. Jr. and McClain, W.H.  1995.  
Searching tRNA Sequences for Relatedness to Aminoacyl tRNA 
Synthetase Families.  
Journal of Molecular Evolution, vol. 40, pp. 482-486.

More information about the Bio-soft mailing list