[Computational-biology] Re: processor system

Scott Harper Scott.Harper at AdaptiveGenomics.com
Tue May 9 15:09:09 EST 2006


On Mon, 24 Apr 2006 13:00:54 -0700, Kevin Karplus wrote:
[...] 
> 
> The real problems in computer architecture for bioinformatics have to do
> with handling the data, not the computation.  We need high-performance
> data architectures, which the computer engineering community has not paid
> nearly enough attention to.

I agree that the main issue in bioinformatic processing is data movement.
The paragraph above may be a bit misleading, however.  There is an active
community of researchers investigating the issues of parallel processing
for bioinformatics.  All of the efforts consider both data movement and
processing issues.

Academically, efforts like MPIBlast and BeoBlast tend to target clusters
of standard servers.  Clusters running this type of software are already
in active use by research centers around the world.  Unfortunately, data
movement can be difficult in clusters, hindering realization of potential
performance. Of course, large clusters also tend to be expensive to build,
house, and maintain.

Commercially, companies like Adaptive Genomics, Cray Inc, Paracel,
TimeLogic, and (to some extent) Starbridge Systems have all made inroads
into parallel processing for bioinformatic data.  Paracel and Starbridge
have recently left the commercial bioinformatic arena, but the efforts
of all of these companies have resulted in significant boosts to
biosequence data processing rates. For example, Cray has benchmarked a
Smith-Waterman algorithm (SSEARCH34) on a single 64-bit AMD Opteron at 100
million cell updates / sec (MCUPS). Aligning the Human X and Y genes at
this rate would take (154824267*57701691 cells) / (100 MCUPS) seconds, or
about 2.8 years.  An algorithm implemented on the new Cray XD1 claims to
speed up this analysis by 28 times, dropping the alignment time to 36.5
days.  Tests run on the latest base model HyperSeq system from Adaptive
Genomics show a reduction of this wait time to less than 30 hours.  Both
of these improvements were the result of designs that synergistically
considered basic processing needs and data flow through the systems. 
Clearly architectures have been moving forward with a combined approach to
processing and data flow that provides significant improvements to
bioinformatic processing.

Addressing the original poster's question, the bioinformatic community
does have an interest in multiprocessing.  Whether that interest
originates at bioinformatic companies or end users may be something of a
question.  If GMU can develop a system to increase processing rates beyond
those provided by the current crop of sequence alignment systems, I'm
sure someone would be interested. System results should at least be
publication-worthy, even if they fail to attract corporate attention. 
Sequence databases are always growing in size, and someone is always
looking for a faster alignment system.

Some references for interested parties:
MPIBlast : http://mpiblast.lanl.gov/
BeoBlast :
  http://bioinformatics.fccc.edu/software/OpenSource/beoblast/beoblast.shtml

Adaptive Genomics : http://www.adaptivegenomics.com
Cray Inc : http://www.cray.com/products/xd1/smithwaterman.html
TimeLogic : http://www.timelogic.com/

-- 
 . Dr. Scott Harper
 . Adaptive Genomics Corp.
 . 620 N. Main St, Suite 103
 . Blacksburg, VA 24060
 . Scott.Harper at AdaptiveGenomics.com, 540-552-2700



More information about the Comp-bio mailing list