P.G. Korning, S.M. Hebsgaard, P. Rouze and S. Brunak made a database
called Araclean for use in training their gene predictor.  they made sure
all the sequences in the database are correct and "real".

Cleaning the GenBank Arabidopsis thaliana data set,
  P.G. Korning, S.M. Hebsgaard, P. Rouze and S. Brunak,
  Nucl. Acids Res., 24, 316-320, 1996.
it is on the web at

but it is not very current.

Curt Palm

>       I'm trying to assemble a dataset of KNOWN coding sequence for Arab
>  proteins.  I need it for a class project.  I've pulled out all the ESTs
>  from Genbank for Arab but I am unsure of the quality of the sequence and
>  many of it has Ns in it.  Much of the sequence in the nrdb is for
>  putative/similar/hypothetical proteins.  Is there somewhere I can download
>  this kind of dataset?
>  						Thanks, Paul
