complete cross validation question

Yu Wai Chen ywc at mrc-lmb.cam.ac.uk
Fri Nov 5 08:20:10 EST 1999


Dear all,

I would like to ask some questions on how one should perform a complete
C.V.

I have a data set of only 2500 reflections, have to refine a ~400 atoms
model.  I have been using 20% of data (just about 500) for C.V. so that
the statistics is more meaningful.  And so my refinement goes OK.  How I
am at a latter stage when I intend to use all my data so that I can
refine individual B-factors.  And I have partitioned my dataset into 10
non-overlapping cv sets each omitted 10%.

How should I actually carry on with, say, SA?  I mean if one have only
one C.V. data set, one would run a S.A. with several trials and get the
model out of the one with lowest Rfree.  Now if I do S.A. (say 5 trials
on each c.v. set) for the 10 c.v. sets, I get 50 S.A. results.  I
suppose I should use all the 50 Rfree's to estimate the mean Rfree and
its s.d.?  But which model do I pick then for further refinement?  Shall
I still pick the one with lowest Rfree?

I am contemplating another approach, that is to switch off c.v. at this
stage and use all the data for refinement.  And then do a posteriori
Rfree with a final cycle of S.A. when refinement is finished.

Please comment.

-- 
===================================================================
Yu Wai CHEN, Ph.D. ..................   email:ywc at mrc-lmb.cam.ac.uk
 Centre for Protein Engineering,             tel:+44-(0)1223-402148
 MRC Centre, Hills Rd, Cambridge CB2 2QH, UK fax:+44-(0)1223-402140
 WWW homepage: http://www.mrc-cpe.cam.ac.uk/~ywc




More information about the X-plor mailing list