LuceGene Document/Object Search and Retrieval for Genome Databases

Don Gilbert gilbertd at
Thu Apr 22 13:06:59 EST 2004

GMOD: LuceGene
Document/Object Search and Retrieval for Genome Databases
20 April 2004


This is an open-source document/object search and retrieval system
specially tuned for bioinformatics text databases and documents. It is
part of the GMOD (Generic Model Organism Database) project,, and also 

LuceGene is similar in concept to the widely used, commercially
successful, bioinformatics program SRS (Sequence Retrieval System).
It is built on top of the open-source Lucene package,
Though written in Java language, it can be used from command-line
shells, and performs well that way (current uses include Perl CGI's
calling lucegene). Lucene is used by LuceGene un-changed, but LuceGene
adds Lucene class overrides for biology data.

It includes common text search features: booleans, phrases, word
stemming, fuzzy and field range searches, relevance ranking. Lucene is
comparable to the index/search methods used by web-indexing systems such
as Glimpse, Exite, Alta-vista, and Google.

LuceGene additions include Data input adaptors for HTML; XML (e.g.
MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic
output formats for XML, HTML via XSLT, Text, Spreadsheet. Numeric Range
search primitive (added April 2004).

It is being tested and used to search/retrieve from 100,000s data and
document objects in the FlyBase and euGenes collection: genes,
references, sequences and XML annotations, Medline abstracts and
HTML, PDF and text documents.

Public services using LuceGene (Apr 2004)

euGenes multi-organism gene search/retrieval

Daphnia/wFleaBase search for sequences, Medline abstracts, web documents

FlyBase Annotated sequence bulk-retrieval service

FlyBase Apollo annotation data web service


LuceGene requires Java 1.4 or later  to compile and run.
The Java Ant build system is supported for compiling sources.
The Jakarta Lucene project library is included with this package, as
are other required java libraries.  It may also be found

Currently these alpha distribution files are available -
 lucegene-1.2-src.jar : sources, documents, configuration for base
lucegene software with indexing methods for biology data
 lucegene.war : binary distribution, for webapp (Tomcat) uses

See the repository for gmod/lucegene. 
It is also available as part of the ARGOS genome database
replication system at

- Don Gilbert
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at

More information about the Bio-soft mailing list