Protein variability software
btf at t10.lanl.gov
Thu Sep 14 13:33:20 EST 2000
> William COHEN wrote:
> Does anyone know a software which can be use to calculate
> a consensus sequence AND the variability in aminoacid at
> each position of the sequence from 100 or
> 200 protein sequences ???
Many multiple sequence alignment programs can
calculate a consensus. The variability is more difficult
to analyze correctly. The best software I am aware of
(but I am quite sure there are many more programs I am
not aware of) is DNA-RATES by Gary Olsen.
I like this method, because it takes into account
the phylogentic tree, as well as the sequences. For example
if you have this alignment:
Fish1 QAAMQMLKDS ILEEAAEWDRI
Fish2 QAARQMLKDS LLEEAAEWDRI
Fish3 QAALQMLKDS INEEAAEWDRI
Mouse QAADQMLKDT LNEEAAEWDRI
Rat QAADQMLKDT INEEAAEWDRI
Human QAAGQMLKDT LLEEAAEWDRI
Cat QAADQMLKDT INEEAAEWDRI
Dog QAADQMLKDT INEEAAEWDRI
you can see that column 10 is about equally "variable"
as column 11: 10 is 3 S and 5 T; 11 is 3 L and 5 I.
But column 10 had just one mutation event, the fish
all have S and the mammals all have T. Column
11 seems to mutate back and forth between I and
L in both the fish and mammalian lineages.
Only a program that considers both the tree
and the sequences, can tell you what the mutation
rates are in each column.
Brian Foley, PhD
HIV Genetic Sequences and Immunology Databases
Gary J. Olsen
August 23, 1993
The DNArates program takes a set of sequences and a corresponding phylogenetic
tree and produces and maximum likelihood estimate of the rate of nucleotides
substitution at each sequence position.
Input is read from standard input. The format is very much like that of the
fastDNAml program. The first line of the input file gives the number of
sequences and the number of bases per sequence. Also on this line are the
requested program option letters. Any auxiliary data required by the options
follow on subsequent lines. Either the user must specify the empirical base
frequencies (F) option, or immediately preceding the data matrix there must be
a line of data with the frequencies of A, C, G and T. Next, the program
expects a data matrix. The first 10 characters of the first line of data for a
given sequence in interpreted as the name (blanks are counted). Elsewhere in
the data matrix, blanks and numbers are ignored. The default data matrix
format is interleaved. If all the data for a sequence are on one input line,
then interleaved and noninterleaved are equivalent. Following the data matrix
there must be a line with the number of user-specified trees for which rates
are to be estimated (as with the U option is fastDNAml). The rest of the input
file is one or more user-specified trees with branch lengths (as with the U and
L options in fastDNAml).
More information about the Bio-soft