[Genbank-bb] Human ESTs

Cavanaugh, Mark (NIH/NLM/NCBI) [E] via genbankb%40net.bio.net (by cavanaug At ncbi.nlm.nih.gov)
Fri Jan 12 16:41:27 EST 2007


One potential approach would be to use the BCP-style dumps
of the NCBI dbEST database, which can be found at:

	ftp.ncbi.nih.gov
	repository/dbEST

The approach might be:

- download the full bcp dump for the 'library' table

- identify all libs for organism "Homo sapiens"

- use the resulting id_libs to screen the full bcp dump
  of the 'est' table

  if you find that an EST's id_lib matches one of the
  Homo sapiens id_libs, then that one should be kept/processed

- from the Homo sapiens ESTs, use their id_est keys to identify
  the sequences that are of interest among the sequence.full.* files

And then apply a similar approach for the daily 'delete'
and 'insert' BCP dumps .

There's a README in the directory that might be a good 
place to start.

Note that there are nearly 8 million human ESTs in dbEST .

Fairly big job... So if you're going to do it, then
you might want to consider just building a complete local
copy of dbEST, if you have the resources for it.

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS



>-----Original Message-----
>From: Seth Johnson [mailto:johnson.biotech At gmail.com] 
>Sent: Monday, December 18, 2006 11:00 AM
>To: genbankb At magpie.bio.indiana.edu
>Subject: [Genbank-bb] Human ESTs
>
>Hi all,
> 
>We are creating a local database of human sequences for 
>high-throughput pipeline.  So, I have a question regarding the 
>availability of sequences by organism.  Is there some way I 
>can get, for example, just Homo Sapiens ESTs without parsing 
>hundreds of files that comprise an EST GenBank release? 
> 
>
>-- 
>Best Regards,
>
>
>Seth Johnson
>Senior Bioinformatics Associate
>
>Fx: (775) 251-0358 
>



More information about the Genbankb mailing list