[Bio-software] Text indexing and retrieval from Genbank files
okane at cs.uni.edu
Sun Sep 11 10:48:17 EST 2005
Demo and source code (GPL license) are at:
MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries)
See O'Kane, K.C.; & Lockner, M.J, Indexing genomic sequence libraries,
Information Processing & Management, Volume 41, Issue 2 , March 2005,
Kevin C. O'Kane
University of Northern Iowa
Cedar Falls, IA 50614-0507
okane at cs.uni.edu
Overview and Background
This system, MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries),
is an implementation and toolkit to integrate multiple, very large, genomic,
data bases into a unified data repository through open-source components and
provide fast, web-based access to the results. This system differs from existing
genomic text retrieval systems in that:
1. it derives the index terms from the text rather than manual assignment;
2. it is fully built upon open-source components;
3. it is coded in an extended implementation language that supports multi-dimensional data bases;
4. it can index in a matter of hours the largest genomic data sets.
While natural language text indexing can be approached from the perspective of assignment to
pre-existing categories and hierarchies such as the National Library of Medicine MeSH (Medical
Subject Headings) (Hazel, 1997), derivative indexing is better able to adapt to changes in a
rapidly evolving discipline as the terms are dynamically extracted directly from the source
material rather than waiting for manual analysis. Existing keyword based genomic retrieval
systems are primarily based on assignment indexing whereas the approach taken here is based
on derivative indexing, where both queries and documents are encoded into a common intermediate
representation and metrics are developed to calculate the coefficients of similarity between
queries and documents. Documents are ranked according to their computed similarity to the
query and presented to the user in rank order. Several systems employing this and related
models have been implemented such as Smart (Salton, 1968, 1971, 1983, 1988), Instruct
(Wade, 1988), Cansearch (Pollitt, 1987) and Plexus (Vickery, 1987a, a987b). More recently,
these approaches have been used to index Internet web pages and provide collaborative
filtered recommendations regarding similar texts to book buyers at Amazon.com (Linden, 2003).
In this system, genomic accessions are represented by vectors that reflect accession content
through descriptors derived from the source text by analysis of word usage (Salton,1968, 1971,
1983, 1988; Willett, 1985; Crouch, 1988). This approach can be further enhanced identifying
clusters of similar documents (El-Hamdouchi et al.,1988, 1989). Likewise, term-term co-occurrence
matrices can be developed that can be used to identify similar or related terms and these
can be automatically included into queries to enhance recall or to identify term clusters.
Other techniques based on terms and queries have also been explored (Salton, 1988; Williams, 1983).
The vector model is rooted in the construction of document vectors that consisting of the
weights of each term in each document. Taken collectively, the document vectors constitute a
document-term matrix whose rows are document vectors. A document-term matrix can have millions
of rows, more than 22 million in GenBank case, and thousands of columns (terms), more than
500,000 in GenBank. This yields a matrix with potentially trillions of possible elements which
must be quickly addressable not by numeric indices but also by text keys. Additionally, to
enhance retrieval speed, an inverted matrix of the same size is needed which doubles the storage
requirements. Fortunately, however, both matrices are very sparse.
Evaluation of retrieval effectiveness from a data set of this size is clearly difficult as there
are few benchmarks against which to compare the results. However, NCBI distributes a file of
keyword phrases with GenBank gbkey.idx (502,549,211 bytes in 2004). This file contains submission
author assigned keyword phrases and associated accession identifiers. Of the 48,023 unique keys
in gbkey.idx (after removal of special characters and words less than three characters in length),
26,814 keys were the same as the keys selected by MARBL. The 21,209 keys that differed were, for
the most part, words of very high or low frequency that the system rejected due to preset thresholds.
Alternatively, the MARBL system identified and retained a highly specific 501,614 terms,
many of which were specific codes used to identify genes.
We compared MARBL with BLAST by entering the nucleotide sequence of a Bacillus anthrtacis
bacteriophage that was of interest to a local researcher. BLAST retrieved 24 accessions.
The highest scoring accession was the correct answer, while the remainder were noise. When
we entered the phrase anthracis & bacteriophage to the MARBL retrieval package, only one
accession was retrieved, the same one that received the highest score from BLAST. BLAST
took 29 seconds, MARBL retrieval took 10 seconds. It should be noted, however, that BLAST
searches are not based on keywords but on genomic sequences.
Copies of all software in source code form are available from: http://www.cs.uni.edu/~okane.
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res., 25,
American National Standards Institute, Inc. (1995). ANSI/MDC X11.4.1995 Information Systems . Programming Languages - M,
American National Standards Institute, 11 West 42nd Street, New York, New York 10036.
Barker, W.C., et. al. (1999). The PIR-International Sequence Database, Nucleic Acids Research, 27(1) 39-43.
Barnett, G.O. & Greenes, R.A. (1970). High level programming languages, Computers and Biomedical Research, 3, 488.497.
Bowie, J & and Barnett, G. O. (1976). MUMPS . an economical and efficient time.sharing language for information management,
Computer Programs in Biomedicine, 6, 11.21.
Crouch, C.J. (1988). An analysis of approximate versus exact discrimination values, Information processing and Management,
El.Hamdouchi, A.; & Willet, P. (1988). An improved algorithm for the calculation of exact term discrimination values,
Information Processing and Management, 24(1) 17.22.
El.Hamdouchi, A.; and Willet, P. (1989). Comparison of hierarchic and agglomerative clustering methods for document
retrieval, The Computer Journal, 32(3) 220.227.
Hazel, P. (1997), Perl Compatible Regular Expression Library, Cambridge: University of Cambridge Computing Service.
Linden, G., & Smith, B. (2003). Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Distributed Systems
O'Kane, K.C. (1980). An RT.11 single user standard MUMPS interpreter, MUMPS Users' Group Quarterly, 10, 5.6.
O'Kane, K.C. (1983). A portable hybrid MUMPS development system host, Proc IEEE Computer Society 7th International Computer
Software Applications Conference, 60.65.
O'Kane, K.C. (1992). A language for implementing information retrieval software, Online Review, 16(3) 127-137.
O'Kane, K.C. (1999). An M Compiler for Internet server applications, M Computing, pp 11-17, 7(1) 11-17.
Pearson, W. R. (2000). Flexible sequence similarity searching with the FASTA3 program package Methods Mol.Biol., 132, 185-219
Pollitt, S. (1987). Cansearch: an expert systems approach to document retrieval, Information Processing and Management,
Salton, G. (1968). Automatic Information Organization and Retrieval, New York: McGraw.Hill Book Company.
Salton, Gerard; and McGill, Michael J.; Introduction to Modern Information Retrieval, McGraw.Hill Book Company (New York,
Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading
Salton, G. (editor) (1971). The SMART Retrieval System: Experiments in Automatic Document Processing, Englewood Cliffs, New
Shpaer, E. G., Robinson, M, et al. (1996). Sensitivity and Selectivity in Protein Similarity Searches: Comparison of
Smith-Waterman in Hardware, Genomics, 38, 179-191.
Sleepycat Software, Inc. (2001). Berkeley DB, Indianapolis, IN: New Riders, ISBN 0-7357-1064-3
Smith, T. F., & Waterman, M.S. (1981) Identification of common molecular subsequence, J. Mol. Biol., 147(1), 195-197.
Thure, E. & Argos, P. (1993a). SRS an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci., 9,
Thure, E. & Argos, P. (1993b). Transforming a set of biological flat file libraries to a fast access network. Appl. Biosci.,
U.S. National Library of Medicine (2003), 2003 MeSH, Annotated Alphabetic List, 8600 Rockville Pike, Bethesda, MD 20894, NTIS
Order number: PB2003-96480. (http://www.nlm.nih.gov/mesh/meshhome.html)
Vickery, A.; and Brooks, H.M. (1987a). PLEXUS: an expert system for referral, Information Processing and Management, 23,
Vickery, A.; & Brooks, H.M. (1987b). A reference referral system using expert system techniques, Journal of Documentation,
Wade, S.J.; and Willett, P. (1988). INSTRUCT: a teaching package for experimental methods in information retrieval. Part III.
Browsing, clustering and query expansion, Program, 22, 44.61.
Willett, P. (1985). An algorithm for the classification of exact term discrimination values, Information Processing and
Management, 21(3), 225.232.
Williams, J.H. (1963). A discriminant method for automatically classifying documents, Proc 1963 Fall Joint Computer
Wu, C.H., et. al. (2003). The Protein Information Resource, Nucleic Acids Res., 31(1), 345-347.
More information about the Bio-soft