LuceGene Document/Object Search and Retrieval for Genome Databases

Don Gilbert gilbertd at bio.indiana.edu
Thu Apr 22 13:06:59 EST 2004


GMOD: LuceGene
Document/Object Search and Retrieval for Genome Databases
20 April 2004

Description

This is an open-source document/object search and retrieval system
specially tuned for bioinformatics text databases and documents. It is
part of the GMOD (Generic Model Organism Database) project,
http://www.gmod.org/lucegene/, and also
http://eugenes.org:8081/gmod/lucegene/ 

LuceGene is similar in concept to the widely used, commercially
successful, bioinformatics program SRS (Sequence Retrieval System).
It is built on top of the open-source Lucene package,
http://jakarta.apache.org/lucene/
Though written in Java language, it can be used from command-line
shells, and performs well that way (current uses include Perl CGI's
calling lucegene). Lucene is used by LuceGene un-changed, but LuceGene
adds Lucene class overrides for biology data.

It includes common text search features: booleans, phrases, word
stemming, fuzzy and field range searches, relevance ranking. Lucene is
comparable to the index/search methods used by web-indexing systems such
as Glimpse, Exite, Alta-vista, and Google.

LuceGene additions include Data input adaptors for HTML; XML (e.g.
MedLine); FlyBase flatfile; Biosequences (GenBank, EMBL, etc.) Basic
output formats for XML, HTML via XSLT, Text, Spreadsheet. Numeric Range
search primitive (added April 2004).

It is being tested and used to search/retrieve from 100,000s data and
document objects in the FlyBase and euGenes collection: genes,
references, sequences and XML annotations, Medline abstracts and
HTML, PDF and text documents.

Public services using LuceGene (Apr 2004)

euGenes multi-organism gene search/retrieval
  http://eugenes.org:7072/search/

Daphnia/wFleaBase search for sequences, Medline abstracts, web documents
  http://eugenes.org:7182/search/

FlyBase Annotated sequence bulk-retrieval service 
  http://flybase.net/cgi-bin/gnoseqbatch

FlyBase Apollo annotation data web service 
  http://flybase.net/apollo/

Requirements

LuceGene requires Java 1.4 or later  to compile and run.
The Java Ant build system is supported for compiling sources.
The Jakarta Lucene project library is included with this package, as
are other required java libraries.  It may also be found
from http://jakarta.apache.org/lucene/
      
Downloads

Currently these alpha distribution files are available -
 lucegene-1.2-src.jar : sources, documents, configuration for base
lucegene software with indexing methods for biology data
 lucegene.war : binary distribution, for webapp (Tomcat) uses

See the cvs.sourceforge.net repository for gmod/lucegene. 
It is also available as part of the ARGOS genome database
replication system at
  rsync://eugenes.org/argos/common/java/lucegene/ 
  http://eugenes.org:8081/gmod/lucegene/

- Don Gilbert
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd at indiana.edu--http://marmot.bio.indiana.edu/
---





More information about the Bio-soft mailing list