1193 IBM's New Gene-Searching Algorithm Available Over Internet Jul 6
More Select News
more at hpcwire.ans.net
Mon Jul 12 05:19:38 EST 1993
IBM's New Gene-Searching Algorithm Available Over Internet Jul 6
Yorktown Heights, N.Y. -- IBM researchers have devised a powerful new way
to search for the human genome using an object-recognition technique borrowed
from computer vision.
IBM said that even on a single workstation, the new approach is more
efficient than existing methods -- including those developed for massively
The heart of the technique, presented Tuesday at the Intelligent Systems
for Molecular Biology conference in Bethesda, Md., is an algorithm -- called
FLASH (Fast-Lookup Algorithm for Sequence Homology). FLASH was developed by
Andrea Califano and Isidore Rigoutsos at the the IBM Research Division's
Thomas J. Watson Research Center in Yorktown Heights.
The algorithm's powerful search capabilities are currently being made
available free of charge to genome scientists around the world through the
A genome is the complete set of DNA, or genes, that determines the
characteristics of a living being. DNA contains the blueprint for proteins,
which make up living things. Biochemical geneticists who sequence or catalog
the DNA and proteins of various living organisms routinely search public
genetic archives -- such as "Genbank," managed by the National Institutes of
Health, to see if newly identified sequences are similar in structure to
known ones, which may be an indication of similar functional properties. Such
knowledge helps in the study of the evolutionary relatedness of living
organisms, as well as inherited and other types of diseases.
Today, databases like Genbank contain all DNA sequenced to date -- some 100
million nucleotides and amino acids from DNA and proteins. By the end of the
century, however, that number is expected to grow to more than two billion.
Today's most advanced computer techniques for searching employ a scanning
algorithm that must scan the entire contents of a database to find similar
"This is akin to looking up a name in a huge telephone directory with
listings assembled not alphabetically but at random," IBM noted in a release.
As Genbank's database grows, the time it takes to access its contents with
scanning-based techniques will increase significantly. It currently takes the
fastest available conventional scanning methods about five minutes to process
100 megabytes of DNA sequences, and relevant ones are often missed; to scan 2
gigabytes (the expected size of the human genome) would take many hours.
By contrast, FLASH is structured as an indexed algorithm and can find 99
percent of all sequence similarities within a few seconds, the researchers
That time will not increase appreciably as the current genetic databases
approach their target sizes, they added. This could significantly help
researchers involved in the genome project to cope with the ever-increasing
rate at which new sequences are found and deciphered.
FLASH is being offered initially to all members of the genome project
community for searches of protein sequences on the EMBL/SwissProt protein
database. The system is currently being expanded to include the complete
Genbank database. Requests for use of the system can be directed via
electronic mail to dflash at watson.ibm.com.
The FLASH algorithm is part of a general class of algorithms, pioneered at
the IBM T.J. Watson Research Center, that can be used to search very large
databases containing diverse information. Such retrieval capabilities
include, among others, finding molecules of similar shape or structure for
drug design, text searches, and visual object recognition.
Copyright 1993 HPCwire.
Rob Harper E-mail: harper at convex.csc.fi
Center for Scientific Computing Molbio/software: harper at nic.funet.fi
Tietotie 6, P.O. Box 405 Telephone: +358 0 457 2076
SF-02101 Espoo Finland Fax: +358 0 457 2302
More information about the Bioforum