improvement suggestions

Don Gilbert gilbertd at bio.indiana.edu
Sat Jul 26 13:23:59 EST 1997



Suggestions for future versions of SRS

I think SRS is great software for many genome informatics needs.  

We are now using it as the main search engine in the FlyBase project 
(http://flybase.bio.indiana.edu/, see e.g., the Genes section searches).

Based on using SRS for Genbank and related sequence data at IUBio and for 
the Drosophila data of quite a variety, I have several suggestions

These are in rough order of importance to me, and possibly to others.
Maybe some of these are possible now and I haven't looked hard enough.
If not, I will try to add some of them and pass code on to Thure 
and colleagues.

 -- Sequence output format
    -- default should always be the native format, untouched by SRS (current
      return of genbank data is a bogus format that can't be interpreted well,
      it is missing the ORIGIN line, the sequence data is in EMBL not GENBANK
      style; maybe part of this is icarus indexing mistakes).
    -- offer GENBANK and PIR/CODATA output formats as primary standard
      sequence formats
  
 -- Query symbol neutrality
    The symbols that SRS now requires in queries for operations and parsing
    clash with symbols used above (in unix and http command strings) and below
    (in biological data).  Especially because of the latter, it is difficult
    to use escape characters to do the kinds of queries needed.
    
    There should be query-time switches for getz, wgetz and such that let the
    caller set symbols used for query parsing, including &|![]={}-.  
    At the least offer query-time symbol swapping, so that any single parsing
    symbol can be changed to another in meaning.  The high ascii set would
    make a good option.  But it would also be nice to allow strings, such
    as _AND_ for &, _OR_ for |, _OPEN_PHRASE_ for [, _CLOSE_PHRASE_ for ],
    etc. in queries.
    
 -- Case sensitive searches 
    This should be available as a query-time, user choice for any field.
    Perhaps there should be an index-time switch that will say if a field
    has case sensitive potential, if it is compute expensive at query-time.
 
 -- Index numeric ranges
    For example, a map range such as "123-456" should be indexed so it
    can be queried as a numeric range. Query such as 124, 234, 345 should 
    all match such a range. Several ranges per field must be possible.  
    In WAIS, we just stored the text string of such a field, and did a numeric
    range test at query time.
    
 -- Cache query results and use that for quick lookups of next page data.
    wgetz, and other srs query drivers, offer a page of results for a given
    query, plus additional page links.  These additional page links redo
    the same query at a sometimes large cpu cost.  It would be nice to have
    the full match set for each query cached (for maybe an hour, in SRSTMP:)
    and used to serve multipage requests of same query.
    
 -- Relevance ranking 
    Allow fields to store word counts per record in indexes, 
    and use these counts for one form of relevance calculation.  Relevance
    ranking can markedly improve the usability of query results, where those
    with the most query words (or however defined as most relevant) are sorted
    to the top of the results list.  Relevance ranking has been standard in
    WAIS and related text indexing.

 -- Lists of words to ignore in indexing
    Use lists/files of common words to ignore at indexing (a, and, the, ...).  
    Let the icarus parsing script read such a list from common file/data and
    apply to storing indices from any particular fields.  Maybe we 
    can do this now in the rich icarus; if so an example would be nice. 
    
--
-- d.gilbert--biocomputing--indiana u--bloomington--gilbertd at bio.indiana.edu




More information about the Bio-srs mailing list