LuceGene Document/Object Search and Retrieval for Genome Databases
gilbertd at bio.indiana.edu
Thu Apr 22 13:06:59 EST 2004
Document/Object Search and Retrieval for Genome Databases
20 April 2004
This is an open-source document/object search and retrieval system
specially tuned for bioinformatics text databases and documents. It is
part of the GMOD (Generic Model Organism Database) project,
http://www.gmod.org/lucegene/, and also
LuceGene is similar in concept to the widely used, commercially
successful, bioinformatics program SRS (Sequence Retrieval System).
It is built on top of the open-source Lucene package,
Though written in Java language, it can be used from command-line
shells, and performs well that way (current uses include Perl CGI's
calling lucegene). Lucene is used by LuceGene un-changed, but LuceGene
adds Lucene class overrides for biology data.
It includes common text search features: booleans, phrases, word
stemming, fuzzy and field range searches, relevance ranking. Lucene is
comparable to the index/search methods used by web-indexing systems such
as Glimpse, Exite, Alta-vista, and Google.
LuceGene additions include Data input adaptors for HTML; XML (e.g.
MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic
output formats for XML, HTML via XSLT, Text, Spreadsheet. Numeric Range
search primitive (added April 2004).
It is being tested and used to search/retrieve from 100,000s data and
document objects in the FlyBase and euGenes collection: genes,
references, sequences and XML annotations, Medline abstracts and
HTML, PDF and text documents.
Public services using LuceGene (Apr 2004)
euGenes multi-organism gene search/retrieval
Daphnia/wFleaBase search for sequences, Medline abstracts, web documents
FlyBase Annotated sequence bulk-retrieval service
FlyBase Apollo annotation data web service
LuceGene requires Java 1.4 or later to compile and run.
The Java Ant build system is supported for compiling sources.
The Jakarta Lucene project library is included with this package, as
are other required java libraries. It may also be found
Currently these alpha distribution files are available -
lucegene-1.2-src.jar : sources, documents, configuration for base
lucegene software with indexing methods for biology data
lucegene.war : binary distribution, for webapp (Tomcat) uses
See the cvs.sourceforge.net repository for gmod/lucegene.
It is also available as part of the ARGOS genome database
replication system at
- Don Gilbert
-- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/
More information about the Bio-soft