Looking for some test data

Bruce W. Watson watson at wsinpi01.win.tue.nl
Wed Mar 9 09:08:42 EST 1994

Hi all,
   I've recently finished implementing a pattern matching toolkit, all of whose
algorithms have been proven correct. I now need some input test data to measure
the performance of some of the algorithms. I've tested the algorithms on English
input, and I now want to test the algorithms on genetic information. From what
I understand, the input string is a long string (over the alphabet a,c,t,g);
the keywords to search for are (shorter) strings over the same alphabet.
   What I want to know is:
- How long is a typical input string?
- How long is a typical keyword?
- In general, do you search for all occurrences of a keyword in the input, or
  just the first occurrence?
- Do you search for occurrences of one of a set of keywords, or just one keyword
  in the input string?
- If you search with sets of keywords, are all of the keywords in the set of
  similar lengths, or of widely differing lengths?
- If you search with sets of keywords, how many keywords are in the set?
- What algorithm do you presently use in your software? (Knuth-Morris-Pratt,
  Boyer-Moore, Aho-Corasick, Commentz-Walter...etc.)
- Can you provide me with some test data?

Thanks for all responses,

Bruce Watson                     || favourite oxymoron: "-- rather, it simply 
watson at win.tue.nl                ||   complicates our implementation." from  
watson at stack.urc.tue.nl          || C++ Primer, 2nd ed. (p.501) by S. Lippman 

More information about the Biochrom mailing list