Looking for some test data
Bruce W. Watson
watson at wsinpi01.win.tue.nl
Wed Mar 9 09:08:42 EST 1994
I've recently finished implementing a pattern matching toolkit, all of whose
algorithms have been proven correct. I now need some input test data to measure
the performance of some of the algorithms. I've tested the algorithms on English
input, and I now want to test the algorithms on genetic information. From what
I understand, the input string is a long string (over the alphabet a,c,t,g);
the keywords to search for are (shorter) strings over the same alphabet.
What I want to know is:
- How long is a typical input string?
- How long is a typical keyword?
- In general, do you search for all occurrences of a keyword in the input, or
just the first occurrence?
- Do you search for occurrences of one of a set of keywords, or just one keyword
in the input string?
- If you search with sets of keywords, are all of the keywords in the set of
similar lengths, or of widely differing lengths?
- If you search with sets of keywords, how many keywords are in the set?
- What algorithm do you presently use in your software? (Knuth-Morris-Pratt,
Boyer-Moore, Aho-Corasick, Commentz-Walter...etc.)
- Can you provide me with some test data?
Thanks for all responses,
Bruce Watson || favourite oxymoron: "-- rather, it simply
watson at win.tue.nl || complicates our implementation." from
watson at stack.urc.tue.nl || C++ Primer, 2nd ed. (p.501) by S. Lippman
More information about the Biochrom