Looking for B.L.A.S.T. benchmark data sets

Kevin Karplus karplus at bray.cse.ucsc.edu
Thu Sep 20 03:59:36 EST 2001


On Wed, 19 Sep 2001 09:44:22 GMT, Joseph Dale <jdale at uclink.berkeley.edu> wrote:

>.. it's probably because as a computer scientist, benchmarking
>raises many warning flags in my mind. It is an utterly black art. And it
>gets even blacker and messier in the real world (i.e., applications as
>opposed to straight-up "how many ops per second?" type questions).

This is a real concern---but benchmarking is very useful when you have a
specific application you want to run, and wish to compare different
vendor's configurations on that application.  I assume that is what
John Hinsdale had in mind.


>> Specifically I need database(s) that represent typical usage and take
>> at least several minutes (rather than seconds) to run.  The data can
>> be old and public-domain; it just has to exercise the BLAST tools in a
>> way they typically are used.
>
>Well, what is "typical"? Does "typical" usage vary among the individual
>BLAST tools, and how? Does "typical" usage vary with the particular
>investigation for which the BLAST results are being used? Which specific
>BLAST types are you interested in? Does "typical" usage necessarily
>imply datasets which require "several minutes (rather than seconds) to
>run"? How do you plan to analyze the run time data? Might not a given
>statistical approach to the analysis be appropriate regardless of
>specific numbers (minutes vs. seconds)?

Two problems here: one John Hinsdale's and one Joseph Dale's.

Assuming that my guess about the purpose of the benchmarking is
corrrect, and John Hinsdale wants to evaluate different servers, then
Joseph is correct in jumping on the word "typical".  What I do
(searching all of the NR protein database from NCBI with protein
sequences using blastp) is typical for me, but not typical for people
with EST data or people searching a single microbial genome.  A search
like this takes time dependent on the length of the query sequence,
but a single query is typically about 2-3 minutes. A big cause of
variation is the number of hits found, since the alignment of the
found hits is substantially slower than the initial search.  This is
particularly noticeable if you pick an immunoglobulin or an HIV coat
protein as the query, since there will be 1000s of hits.

Joseph's mistake was in forgetting that timing measurements on
application programs have coarse quantification problems.  A long run
has much less relative error than a short one.  There are more likely
to be systematic errors in timing short runs that mask true
differences in hardware performance.  (Of course, if the true workload
involves 1000s of short runs, then the correct mix to time is 1000s of
short runs, so Joseph wasn't really wrong.)

>> Any help would be greatly appreciated; anyone out there who could
>> spare a few  hours over the course of a week or two, working w/ me by
>> Email or phone, I would also be willing to compensate say $100 - $150
>> an hour or whatever for your help.

The tiny amount of advice I gave here is free---but I am available for
small amounts of paid consultation.

-- 
Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
Professor of Computer Engineering, University of California, Santa Cruz
Affiliations for identification only.





More information about the Bio-soft mailing list