Suggestions for SRS use w/ nonsequence gene data

Thure Etzold etzold
Thu Aug 17 10:08:52 EST 1995

Hi Don,

most of the changes you suggest will be covered by the new parser which 
we currently are integrating into SRS will be almost a revolution;-)
The new parser is called Icarus (Interpreter for Commands And RecUrsive Syntax) 

>Here are some suggestions for SRS that have arisen from trying to use
>it with Drosophila genome data:
>1) indexer interface: needs to permit indexing of any character/symbol set.
>   Drosophila genes use just about the full ASCII printable symbol set
>   (and would use more if possible).

this is no problem for the indices ...any character is allowed ...but
alphabetical characters are converted to lowercase - is that a problem?
to have case conversion simplifies matters greatly but i can make this
optional. Some characters can be a problem for the parser ...i think peter
rice figured out how to specify a '/' ('////'?) 
..tried this with the new parser ...the '/' is ok there

The other problem is that certain characters have a meaning in the SRS query
language, eg, "&|!" are logical operators - i will allow quoting the search
words so that these characters don't interfere with the SRS syntax. 

>   It would be nice to allow also adding data filter functions to the indexer
>   that would convert various computer format data to data suitable for
>   indexing -- e.g., convert special codes like '&bgr;-tubulin' into english 
>   equivalents 'beta-tubulin' for the indexing.

Icarus has an easy interface to call and integrate C functions 
so you can do the following:

/&..;/ <rep s:decode(:$ct)>;

starts with a regular expression for eg '&br;' - the command 'rep'
inside '<' and '>' replaces the current token ($ct - the match of the regular
expression) by the string (argument 's') that is returned by function 'decode'
which receives the current token ($ct) as argument. Decode can be your 
C -function which is declared somewhere in the syntax.

>   It would help if data parsing language for indexer was not
>   as difficult to write accurately.  If I don't spend a lot of
>   time testing a new parsing, I can't be confident that the indexer
>   is getting everything it should.  The recent example of missing
>   2,000 of 9,000 entries in the flygene data due to the symbol "\" in 
>   gene names makes this point.

yes that has been a problem ...testing if icarus will be MUCH easier
can insert print statements anywhere or call a trace option - also the syntax
can be put into a single file so that the recompilation of all .sdl files
is not required after every change.

>2) query interface: needs to permit any character/symbol set to be valid
>   data in the query, and query symbols should be configurable. 
>   Use of words instead of symbols as query operators
>   should be optional at least, and by my preference they would be default.
>   E.g., a query like this should be possible:  
>      databank1  fieldA  some/*![-]()=+messy&^%!%@*#string  
>      and
>      databank2  fieldB  another%^#*&@P)!Q(@string
>      but not
>      databank3  fieldC  more#*@(#P*#strings

ok good idea ...of course the quoting of search words should help but,
yes, it could be a nice idea to let people define there own symbols or
words for the operators

>3) output interface:  needs to allow addition of post-processor functions
>   to convert data to various human-usable formats.  This is done now in
>   part for sequence data and for adding html links, but not in a
>   general way that would allow addition output formatting per database
>   w/o rewriting the basic SRS code.

This is what where we cracked our heads with Icarus ...we have added the
concept of "tasks" which let you specify things to be done only if 
a certain task is selected ...this works out beautifully for, eg, adding
hypertext links (this means that the file hyperlink.sdl will go!)

>   Here is roughly how I did it for flybase data, but it is a hack
>   not a general solution.  Example outputs show this formatted output 
>   from iubio server, versus the computer "star code" output from the 
>   sanger server.
here is some Icarus code that does the conversion of the first line of
your example and converts

>*a &bgr;Tub97EF


Gene symbol                  : betaTub97EF

the syntax for the original line is:

gene-sym <et wrt> = '*a /[^\n]+/;

"et" means that before parsing a toketable with the name "gene-sym" is
created. "wrt" means that the entire line is written as a token into that
table. This token is essentially the whole data-field to be used for
printing, extracting indexable keys ...and so on - and you can use it as input
for the conversion:

display-gene-sym <in:gene-sym task:display>
       '*a <p:"Gene symbol          : "> /[\n ]+/ <p:decode(:$ct)>;

this requests the gene-sym token as input and is executed only if
the task "display" is set - 'p' stand for "print". the regular expression
between the '/' describes the gene symbol itself which is the submitted
to the "decode" function and printed.

The integration of Icarus will still take some time but we hope to finish
it in ~2-3 months. The problem is of course, as you said, that it is very
tempting to move things that happen now inside C-code into Icarus!


Thure Etzold                                   | EMBL
E-mail: etzold at              | Postfach 10.2209
Tel: (49) 6221 387529                          | 69012 Heidelberg
Fax: (49) 6221 387517                          | Germany

More information about the Bio-srs mailing list