In article <slrnbo5a5o.ho7.tploetz at korsakov.TechFak.Uni-Bielefeld.DE>,
Thomas Ploetz wrote:
> developing a sequence classification system I am looking for some
> data sets for training the system as well as for testing it. Are there
> some broader accepted data sets available for protein sequence classification
> domain? I know, I can create my own using all the public databases of
> protein sequences and dividing them into disjoint training and test sets,
> but I want to compare my system to different systems and therefore
> a standard sample set would be better. In other research fields, like
> automatic speech recognition, several standard data sets exist.
What is the PURPOSE of your sequence classification? The tools and
methods that work well for finding DNA regulatory regions may be
completely unsuited for protein fold recognition.
I am somewhat familiar with test sets for fold recognition, but they
tend to be revised frequently. In fact, the use of static benchmarks
has pretty much disappeared from the fold-recognition field, in favor
of continuous blind prediction tests like LiveBench. For static
testing when developing a method, the SCOP database has been popular.
--
Kevin Karplus karplus at soe.ucsc.eduhttp://www.soe.ucsc.edu/~karplus
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Professor of Computer Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.