Degenerate/regex pattern Database Searching Software

Harry Mangalam mangalam at uci.edu
Mon Feb 17 20:25:19 EST 1997


Hi All,

   Before I spend more than few hours on coding this, does anyone know
if it
exists?

To wit:  A program (any platform) that will search files or databases
for
very degenerate sequences (with errors) and will further do proximity
matching on those hits that are returned.  So perhaps I want to search
Genbank or other database for 3 patterns that are essentially regular
expressions (WITH a certain number of errors allowed): ie 

pattern a = cyrr{4,7}gcnt{,7}gat        (1 error allowed in any
position) 
pattern b = tgga{2,5)gyrtg              (no errors allowed) 
pattern c = gtn{4,7}<tgagt>t{1,4}gc     (1 error allowed outside of <>)

where the {m,M} notation follows the regex rules for m=minimum #,
M=Maximum#
as for programs like the grep family and <> indicate patterns that must
be
preserved with no errors.

and further, I want the hits to be reported (graphically, if possible)
only if:

pattern A is < 3000 bases from pattern B
pattern B is > 2000 bases from pattern C

I know of Wu and Manber's amazing/approximate agrep which does most of
the
above, but without the variable spacing and is mostly strucutured for
line
searching, although you can define larger records and Jim Knight
included a
'stripped down' version thereof (grepseq - optimized for biosequences)
in his
very nice 'seqio' pkg, but it doesn't do much of the other proximity
matching and it also doesn't handle regex's, as well as some odd,
compiled-
in restrictions on the number of patterns that it will return (although
easily changed and recompiled).

There's also the 'findpatterns' that's part of GCG, but grepseq
seems to be far superior to it, even in it's unadulterated form.

Will Entrez do this?  I haven't checked lately, although I certainly
will.

What else am I missing?  There are some reasonably straightforward ways
of
combining some of the functions in agrep and grepseq, along with writing
the
glue code that will handle the combinatorial stuff, but if someone has
already done this, I'd rather know now :)

Also, if this isn't available already (freely or not), is anyone (else)
interested in this sort of thing?

Cheers Harry




More information about the Bio-soft mailing list