seeking BLAST to fasta or similar
Todd Richmond
todd at andrew2.stanford.edu
Thu Aug 17 10:15:32 EST 2000
In article <8nfnh4$d89$1 at mirv.comms.unsw.edu.au>, "Neil Saunders"
<neil.saunders at unsw.edu.au> wrote:
>What I'd really like is a facility that would parse a BLAST file,
>extract the top sequences from the database and give them in Fasta format.
>Does anyone know of such a thing?
You can simplify your life somewhat by using Batch Entrez
(http://www.ncbi.nlm.nih.gov:80/Entrez/batch.html) to retrieve all of
the sequences at once from a list of accessions/GIs, in Fasta format.
To extract the GIs automatically, I check the "Use NCBI GIs" box on the
NCBI Blast server, save the Blast file in text format and run it through
the Perl script below. It goes through and looks at the E-value in the
summary at the top and extracts the GI of any hit with an expected value
of <1 x 10e-5. It's a crude hack so it may not work for every Blast
report. But if you don't have too many, you can also just use a text
editor to create the list of GIs to send to Batch Entrez.
#!perl -w
use File::Basename;
use File::Path;
open(INPUT,$ARGV[0]) || die "Can't open the Blast file $ARGV[0]: $! \n";
($name,$path,undef) = fileparse($ARGV[0]);
# truncate the name a bit to add an extension
$name = substr($name,0,20);
$outfile = ">" . $path . $name . ".GIs";
open(OUTPUT,$outfile) || die "Can't open an output file $outfile: $! \n";
while(<INPUT>)
{
# find the summary table lines and pull out the relevant values
if(m/^gi\|(\d{4,8})\|\w+\|([A-Z]+\d{4,8})\.\d+\|.*\.{3}\s{1,5}\d{1,5}\s\s
(.*)/ ){
$id = $2;
$gi = $1;
$value = $3;
# Perl doesn't recognize e-100 as a value so you have to
compensate for that
if ($value =~ m/e-1\d\d/ || $value <= 1e-05) {
$accessions{$id} += 1;
$gi{$id} = $gi;
}
}
}
select OUTPUT;
foreach $number (sort { $accessions{$b} <=> $accessions{$a} }
keys(%accessions)) {
$gi{$number} =~ s/g//;
print $gi{$number}, "\n";
}
--
Dr. Todd Richmond
Carnegie Institution of Washington
260 Panama Street Email: todd at andrew2.stanford.edu
Stanford, CA 94305 Homepage: http://cellwall.stanford.edu/todd
More information about the Methods
mailing list