[Bio-software] Lucene outperforms MySQL, BerkeleyDB, and PostgreSQL for genome map searches

Don Gilbert gilbertd at bio.indiana.edu
Fri Sep 2 17:03:23 EST 2005


Lucene outperforms MySQL, BerkeleyDB, and PostgreSQL for 
genome map database searches.

GBrowse (Generic Genome Browser, http://www.gmod.org/) is a widely
used program for displaying maps of genome data in
biology/bioinformatics. One need it serves is helping biologists
quickly and easily locate features of interest among 10s of millions
of genome features for an organism.

Lucene and the Lucegene project using it, find a good application for
rapidly and easily searching the complex, diverse and large volume of
genome data.  These are useful for searching genome sequences,
literature and experimental data, interactions among genes, as well
as other categories of genome informations.  Lucegene leverages
the speed, high-volume capability and data-source adaptability of Lucene
for searching the multi-gigabyte bioinformatics databases.

Though focused more on text searches and less on numerics, the
opposite of relational databases, Lucene is capable also at numeric
searches such as the demanding use with genomes for displaying
quickly to biologists the locations of their favorite genes and other
features among millions of features spread across 100 millions of
possible locations.

  Time (seconds) for GBrowse web display, 30 iterations 
  at different map locations on fruitfly (dmel) genome
  ----------------------------------------------------------
                        Server3      Server2     Relative
    GBrowse-Adaptor   Mean    SE    Mean   SE   time (ave.)  
  dmel_lucegene_500k   5.4   0.15   1.86  0.05    100   
  dmel_lucene_500k     6.1   0.13   2.23  0.05    117    
  dmel_mysql_500k      7.9   0.31   2.14  0.06    128  
  dmel_bdb_500k        8.3   0.53   4.10  0.32    187  
  dmel_chadofc_500k   25.9   0.91   9.86  0.77    510  
  ----------------------------------------------------------
  
This uses a 500kb map range; differences increase with map range.
These all use the same data. Most of the response time is used in
drawing maps, once features are extracted from the database. However
adaptor speed is one factor that can improve rapid displays. There are
slight differences in displays due to configurations and how adaptor
works, but no significant differences in the data returned by
adaptors. Lucene and MySQL indices are cross-platform shared here.
BerkeleyDB and Postgres cannot be, and had to be regenerated for each
server. Server2 is x64-Solaris-10 (yr2005), Server3 is ppc-MacOSX-10.3
(yr2004).

The fastest adaptor here, Lucegene, has algorithms tuned for genome
map range searches. The simple lucene adaptor is comparable directly
to the mysql and berkeleydb adaptors in operation, using Lucene as
persistant searchable data storage without Lucene-optimized functions.

These results, while not dramatic in the speed differences but for the
slow  Chado Postgres adaptor, add to the other values for this
cross-platform, Java-based system, even when combined with Perl-based
tools such as GBrowse. One important but difficult to measure factor
is the cost of management, where genome data are frequently updated
from diverse sources.  Installing Lucene for this use is a simple
matter of adding the Java library to map software.  Lucene databases
are easy to create from source data, and can be copied and shared
across computer systems, where compiled software and binary databases
usually need to be re-generated by informaticians.

GBrowse Perl Adaptor key: 
  lucegene -  lucegene.pm GFF   (Lucene v1.9; Java 1.4/1.5)
  lucene   -  simple lucene.pm GFF (Lucene v1.9; Java 1.4/1.5)
  bdb      -  berkeleydb.pm GFF (BerkeleyDB v4.2)
  mysql    -  mysqlopt.pm GFF   (MySQL v4.0x)
  chadofc  -  chado.pm DAS, modified for flybase Chado db (Postgres v7 & 8)
These are available through GMOD projects for use with GBrowse.

Preliminary tests suggest that Lucene may outperform Lion Bioscience's
SRS at basic bio-databank search and retrieval, such as with Uniprot
database.

See also
http://sourceforge.net/mailarchive/forum.php?thread_id=8094404&forum_id=31947
http://www.gmod.org/, http://www.gmod.org/lucegene/, 
and http://lucene.apache.org/

The archive at ftp://ftp.eugenes.org/eugenes/gbrowse/ 
has a set of Lucene indices of genomes for Worm, Yeast, Rice,
and 9 Fruitfly species, along with Gbrowse configuration files. You
should be able to copy these, add to Gbrowse the Lucene-lite and
Lucegene adaptors, and display the genomes from your favorite
server computer.

Example servers with these data and comparisons to other
GBrowse adapators (Chado-Pg, MySQL, BerkeleyDB) are here:
 http://server2.eugenes.org/gbrowse/  (Sun-Solaris-x64)
 http://server3.eugenes.org/gbrowse/  (Apple-MacOSX-ppc)

--
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/



More information about the Bio-soft mailing list