Formatting EBML full release for blast
rls at ebi.ac.uk
Mon Aug 20 06:12:22 EST 2001
There are quite a number of different ways of achieving this but it all
depends on what you want to do. First of all you have to identify which
blast you are using. There are two: NCBI-blast and WU-Blast. The former
uses a program called formatdb, as Keith pointed out. The latter uses
(depending on which version you use) pressdb or xdformat. With either of
these blast distros you don't need to have a one large blast database for
EMBL. You create individual blast databases for each of the files in the
release and create alias or 'farm' files to search groups of these.
Information regarding how to use NCBI's formatdb to create multivolume
databanks can be found at:
Here you will see how to create a .nal (for nucleic acid db's) and .pal
(for protein sequence db's) file to suit your needs wrt EMBL. These 'farm'
file are what you need. The entire human division of EMBL currently
comprises 8 files. Create the 8 fasta files for each one of them (not one
large fasta file for all of them!). Then run formatdb on each of these and
call each of them hum1, hum2, hum3 and so on using the -n parameter of
formatdb. Then, create a human.nal file which will contain:
# HUMAN EMBL blast DB
DBLIST hum1 hum2 hum3 hum4 hum5 hum6 hum7 hum8
In order to use these w/ blast you would type:
% blastall -p blastn -d human -i myseq.na ....
In the case of WU-Blast (version 2.x): This version support virtual
databanks. These can be refered to as groups from the command line. If you
create the databanks using, for example, each of the 8 humX.dat files from
8 individual fasta files (created with, for example, EMBOSS's seqretall
see: http://www.emboss.org/ - please note that there are also blast
formatting utilities in EMBOSS such as dbiblast for producing WU-
BLAST/NCBI-BLAST style indices :-)) You can refer to these using WU-BLAST
in the following way:
% blastp "hum1 hum2 hum3 hum4 ..." myseq.pep ....
Hope this helps,
krb at sanger.ac.uk (Keith Bradnam) wrote in
<Pine.OSF.4.21.0108201008350.20284-100000 at caldy.sanger.ac.uk>:
>On 16 Aug 2001, Bent Nagstrup Terp wrote:
>> Could anybody please tell me how I get from having downloaded all the
>> .dat.gz's in the full release plus the cumulative update, to having a
>> "blastable" database?
>First you need to convert your sequences from EMBL format into a format
>suitable for creating BLAST databases. E.g. FASTA format.
>When you have all your sequences in one file, you can use the formatdb
>program (which comes with BLAST) to convert them into a BLAST
>database. But if you are planning to create one BLAST database containing
>everything in EMBL then this will be very, very, big.
>~ Keith Bradnam - WormBase group: http://wormbase.sanger.ac.uk/
>~ The Sanger Centre, Wellcome Trust Genome Campus
>~ Hinxton, Cambridge, CB10 1SA, UK. Tel (01223) 497516
More information about the Embl-db