IUBio

datasets for sequence classification

Kevin Karplus karplus at bray.cse.ucsc.edu
Sat Oct 25 18:29:18 EST 2003


In article <slrnbo5a5o.ho7.tploetz at korsakov.TechFak.Uni-Bielefeld.DE>, 
	Thomas Ploetz wrote:
> developing a sequence classification system I am looking for some
> data sets for training the system as well as for testing it. Are there
> some broader accepted data sets available for protein sequence classification
> domain? I know, I can create my own using all the public databases of
> protein sequences and dividing them into disjoint training and test sets,
> but I want to compare my system to different systems and therefore
> a standard sample set would be better. In other research fields, like
> automatic speech recognition, several standard data sets exist.

What is the PURPOSE of your sequence classification?  The tools and
methods that work well for finding DNA regulatory regions may be
completely unsuited for protein fold recognition.  

I am somewhat familiar with test sets for fold recognition, but they
tend to be revised frequently.  In fact, the use of static benchmarks
has pretty much disappeared from the fold-recognition field, in favor
of continuous blind prediction tests like LiveBench.  For static
testing when developing a method, the SCOP database has been popular.


-- 
Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Professor of Computer Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.




More information about the Comp-bio mailing list

Send comments to us at biosci-help [At] net.bio.net