Formatting EBML full release for blast
rls at ebi.ac.uk
Wed Aug 29 06:52:13 EST 2001
Ok. Into the long dark tunnel....
I use programs in the EMBOSS suite for producing all the database sets we
provide on the external services from EBI. Please visit
http://www.emboss.org/ for more details about the package. The program I
specifically use for this task is called 'seqretall'. Here is a 'bare bones'
outline of what I do to create the emblnew fasta files every day:
foreach x (`cum_*.dat`)
set OUTFILE = `basename $x .dat`
seqretall -filter -osf fasta -outseq $OUTFILE -osdbname EMNEW -auto
# you can cat them all into one if you wish:
# seqretall -filter -osf fasta -osdbname EMNEW -auto >> emblnew.fasta
echo $x done.
The above will produce an individual fasta file for each cum_*.dat file. I
don't merge these together to keep files sizes below the 2Gb limit for one
of our PC farms (size < 2Gb is a characteristic of our cum_*.dat
distribution). So these are OK for our fasta services where a fasta farm
file is used to define emblnew. To produce the NCBI blast databases use
formatdb on each individual file and use a .nal file which gets created at
the same time as the blast databases for searching. The commands would be
# check for old .nal file
# if the emblnew.nal file exists delete it
# if (-e emblnew.nal) then
# mv emblnew.nal emblnew.nal.old
# create the new .nal file
echo "TITLE emblnew" >> emblnew.nal
echo -n "DBLIST " >> emblnew.nal
# watch out for different envs in BSD/SYSV systems to get the above '-n' to
foreach x (`cum*`)
# append to the .nal file each dbname
echo -n "$x " >> emblnew.nal
set OUTFILE = $x
formatdb -i $x -n $x -oT -pF
# pressdb $x $x
echo "" >> emblnew.nal
Your second question: I use the cum*.dat.gz files. It makes life easier for
me but I do check which ones are new/change in order to save bandwidth.
Sometimes all the files are new and sometimes they are not. Anyway, do check
and only build new fasta and blast files for the ones that have actually
changed when compared to the previous transfer. At the rate at which new and
updated data is coming to us all files are new at the moment. If you want to
use the daily or weekly updates I recommend you have a look at SynCron.
These tools will manage the emblnew database for you and create the
equivalent of the cumulative.dat file (which can get big!). However, you
save a lot of bandwidth if that is one problem you have to take into
account. Here is the short description of SynCron:
The SynCron tools were developed for maintaining synchronised copies of the
database updates. Using these tools it is possible to regenerate the
update file at your site reliably from daily updates. The principle of
is the following: For each update file there is a corresponding list file,
contains a record for each entry in the update file. A record consists of
entry AC, ID, Date, Version, and Division. In addition, there is a field
Action that describes the fate of the entry, whether it has to be created
updated (U) or deleted (D). With the use of these lists, it is possible to
'merge' new daily updates into an existing cumulative file.
Anyway the advantages and disadvantages of using cum*.dat or daily/weekly
update files can only be determined if all conditions such as available
network bandwidth, total disk space and file system file size limitations,
etc. are known.
I guess that was long but to answer your question: Don't merge fasta files
unless you have a method to reliably get rid of deleted entries and entries
which have been updated. The cum*.dat files are ready-made in this sense and
SynCron will do this for you at your site if you are using daily/weekly
Hope this helps,
"Bent Nagstrup Terp" <terp at kisac.cgr.ki.se> wrote in message
news:c38f022a.0108280427.2aec2edc at posting.google.com...
> rls at ebi.ac.uk (Rodrigo Lopez) wrote in message
news:<91037A5F4rlsebiacuk at newscache.ebi.ac.uk>...
> > There are quite a number of different ways of achieving this but it all
> Yeah, I've noticed that :-)
> Basically, I want to create and update a local database for NCBI's
> blast-suite - easier said than done....
> > Information regarding how to use NCBI's formatdb to create multivolume
> OK, I think I'm with you so far. I can convert embl's .dat's to fasta
> (using BioPerl - please feel free to suggest a better way of doing
> it....), run formatdb to get blastable files and build .nal's to tie
> the volumes togther.
> Next problem: how do I apply the updates? I see 2 types of updates:
> cumulative updates (i.e. cum_htg1.dat.gz) that only contains updates
> for some of the .dat's, and daily updates (i.e. r67u069.dat.gz). These
> I can also convert into fasta format, but how do I distribute the
> updated records across the correct blast db's?
> > Hope this helps,
> It did, thanks! But I'm not through the long, dark tunnel, yet :-)
More information about the Embl-db