Dirichlet mixture paper available via ftp and on the web

Kimmen Voronov Sjolander kimmen
Wed Nov 8 20:42:03 EST 1995

The Computational Biology group at the University of California, Santa Cruz,
is pleased to announce the following paper available via ftp and on the web.


``Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant
  Protein Sequence Homology",

        Sjolander, K.,  Karplus, K., Brown, M., Hughey, R.,  Krogh, A., 
        Mian, I.S., and Haussler, D.


URL:    http://www.cse.ucsc.edu/research/compbio/dirichlet.html

ftp:    ftp.cse.ucsc.edu

        (A nine-component Dirichlet mixture estimated on the Blocks database is
        also available at this site.)



This paper presents the mathematical foundations of Dirichlet mixtures,
which have been used to improve database search results for homologous
sequences, when a variable number of sequences from a protein family
or domain are known.  We present a method for condensing the information
in a protein database into a  mixture of Dirichlet densities.
These  mixtures are designed to  be combined with observed amino acid
frequencies, to form  estimates of expected amino acid probabilities
at each position in a profile, hidden Markov model, or other statistical model.
These estimates give a statistical model greater generalization capacity,
such that remotely related family members can be more reliably recognized
by the model.  Dirichlet mixtures have been shown to outperform substitution
matrices and other methods for computing these expected amino acid
in database search, resulting in fewer false positives and false negatives
for the families tested.  This paper corrects a previously published formula
for estimating these expected probabilities, and  contains complete derivations
of the Dirichlet mixture formulas, methods for optimizing the mixtures to match
particular databases, and suggestions for efficient implementation.

More information about the Bio-soft mailing list