New FASTA release: fasta31t12.shar.Z

William R. Pearson wrp at alpha0.bioch.virginia.edu
Tue Sep 15 12:54:55 EST 1998


A new version of the FASTA3 package has been released:

	ftp://ftp.virginia.edu/pub/fasta/fasta31t12.shar.Z

This version corrects some problems with tfasta/x/y3 with the "-3" and
"-i" options, and improves prss3 statistical estimates.

More importantly, fasta31t12 provides a new capability for accurate
statistical estimates from searches of databases of related proteins.
Previous versions of the fasta2 and fasta3 programs based statistical
estimates on the distribution of similarity scores from the presumed
unrelated sequences in that were compared during the database search.
This empirical approach works well for searches of SwissProt, OWL, and
NR, where most of the sequences in the database are unrelated to the
query sequence, but it cannot be used to estimate the statistical
significance of a match when a database of related proteins is
searched.

Fasta31t12 provides a new statistics option: -z 11, which bases the
statistical estimates on shuffled versions of the sequences in the
database.  Thus, if you compare a glutathione transferase to a library
with 100 glutathione transferases, fasta3 -z 11 will calculate the
similarity of the query to each of the 100 library entries, but it
will estimate the significance of the similarity scores based on 100
similarity scores that are produced by comparing the query sequence to
a randomly shuffled version of each of the 100 library entries.  As a
result, the search takes twice as long (twice as many comparisons are
performed) and the histogram displays the distribution of shuffled
sequence scores.

This approach works particularly well useful when searching nrl3d -
the database of sequences whose structures are known - because nrl3d
is highly redundant.

The current version does not provide for "windowed" shuffling of
sequences, nor does it shuffle translated DNA sequences as codons;
these capabilities will be incorporated soon.  In addition, it relies
on the user to recognize that few unrelated sequences are available
for statistical estimates.  Nonetheless, you may find it very useful
when performing secondary searches on databases of selected protein
families.

Bill Pearson




More information about the Bio-soft mailing list