seeking BLAST to fasta or similar

Todd Richmond todd at andrew2.stanford.edu
Thu Aug 17 10:15:32 EST 2000


In article <8nfnh4$d89$1 at mirv.comms.unsw.edu.au>, "Neil Saunders" 
<neil.saunders at unsw.edu.au> wrote:

>What I'd really like is a facility that would parse a BLAST file,
>extract the top sequences from the database and give them in Fasta format.
>Does anyone know of such a thing?

You can simplify your life somewhat by using Batch Entrez 
(http://www.ncbi.nlm.nih.gov:80/Entrez/batch.html) to retrieve all of 
the sequences at once from a list of accessions/GIs, in Fasta format.  

To extract the GIs automatically, I check the "Use NCBI GIs" box on the 
NCBI Blast server, save the Blast file in text format and run it through 
the Perl script below. It goes through and looks at the E-value in the 
summary at the top and extracts the GI of any hit with an expected value 
of <1 x 10e-5. It's a crude hack so it may not work for every Blast 
report. But if you don't have too many, you can also just use a text 
editor to create the list of GIs to send to Batch Entrez.

#!perl -w
use File::Basename;
use File::Path;

open(INPUT,$ARGV[0]) || die "Can't open the Blast file $ARGV[0]: $! \n";
($name,$path,undef) = fileparse($ARGV[0]);
# truncate the name a bit to add an extension
$name = substr($name,0,20);

$outfile = ">" . $path . $name . ".GIs";
open(OUTPUT,$outfile) || die "Can't open an output file $outfile: $! \n";

while(<INPUT>)
{
 # find the summary table lines and pull out the relevant values 
if(m/^gi\|(\d{4,8})\|\w+\|([A-Z]+\d{4,8})\.\d+\|.*\.{3}\s{1,5}\d{1,5}\s\s
(.*)/ ){
      $id = $2;
      $gi = $1;
      $value = $3;
      # Perl doesn't recognize  e-100 as a value so you have to 
compensate for that
         if ($value =~ m/e-1\d\d/ || $value <= 1e-05) { 
         $accessions{$id} += 1;
         $gi{$id} = $gi;
      }
   }
}

select OUTPUT;
foreach $number (sort { $accessions{$b} <=> $accessions{$a} } 
keys(%accessions)) {
   $gi{$number} =~ s/g//;
   print $gi{$number}, "\n";
}

-- 
Dr. Todd Richmond
Carnegie Institution of Washington
260 Panama Street                   Email: todd at andrew2.stanford.edu
Stanford, CA 94305                 Homepage: http://cellwall.stanford.edu/todd






More information about the Methods mailing list