[Protein-analysis] Re: pdb-l: About PDB Files and Secondary Structures

Kevin Karplus via proteins%40net.bio.net (by karplus from soe.ucsc.edu)
Wed Apr 23 12:20:59 EST 2008



Narges Habibi wrote

> I'm doing a project on "Protein Contact Map Prediction" and I use some
> features for nueral network's input, including Secondary Structure of a
> given Amino Acid. There are several ways:
> 
> 1- getting dssp file for each pdb file (from ftp server)
> 2- extracting from pdb file (The HELIX and SHEET and TURN section)
> 3- getting ss file from www.pdb.org (as I see the given sequences in this
> file don't match with the pdb files, why?)
> 
> What do you suggest? What method is more accurate?

None of the above.

Predicting contact maps using known structure is cheating.  You should
be predicting the local structure, not extracting it from known
structures.  Any way that data from known structures can creep into
your inputs invaliates your testing, and makes it impossible to say
with confidence that your method does anything useful.  Given the
rather low-quality of contact prediction at the current state of the
art, even small amounts of information from the real structure can
make a big difference.

The following paper by my student is a pretty good summary of the the
best method as of CASP7---improvements since then have been modest:

George Shackelford and Kevin Karplus.
Contact Prediction using Mutual Information and Neural Nets.
Proteins: Structure, Function, and Bioinformatics,
69(S8):159-164, 2007. (CASP7 sepcial issue).
doi:10.1002/prot.21791

I see a lot of "prediction" work that is complete garbage, because the
authors fooled themselves by using data that could only come from
knowing the real structures.  The even more common problem is
insufficient separation of train and test sets, in which computer
scientists assume that the random partition of a data set is all that
is needed---but the sta sets we have aren't independent samples, so
one has to go to some effort to ensure that the test set does not
contain examples that are very close to training set examples.

------------------------------------------------------------
Kevin Karplus 	karplus from soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate Director, Bioinformatics
(Senior member, IEEE)	(Board of Directors & Chair of Education Committee, ISCB)
Affiliations for identification only.



More information about the Proteins mailing list