Statistical significance of short consensus sequences.

Fri Sep 3 19:56:21 EST 1993

Greetings bio-soft netters.  I have cloned a yeast gene encoding a protein invo
lved in transcriptional control which lacks any significant homology to previou
sly identified proteins when the entire coding sequence is searched against Gen
Bank.  However,
the predicted protein has two short basic stretches at its N-terminus that, whe
n searched against GenBank, individually are distantly related to helix-loop-he
lix (fos, jun, and some plant HLH proteins) DNA-binding domains and to helix 1
of paired family
homeobox transcription factors.  In order to gain some perspective on how homol
ogous these sequences were I have aligned both regions with a wide variety of H
LH and homeo box proteins.  Although both segments seem to conform to a pseudo
consensus for HLH
and homeo domains, respectively, I am bothered by the fact that I cannot valida
te statistically the significance of the homologies because the sequences are s
o short.  What I am looking for is a program for Mac of IBM PCs (preferably Mac
) that allows
multiple short sequences to be input and a consensus and some statistical measu
re of the significance of the homology of the query sequence with the other ali
gned sequences to be output.  In simple terms, I have a query sequence and 10-1
2 known examples
of a known protein structural motif and I need to know what the significance of
 the homology is.  Is there such a program in existence for the Mac (preferably
) or the IBM PC that is available for very minimal or no cost?  Any information
 is greatly

