Perl utility to extract entries from mulitple-fasta?

Will Fischer wfischer at sunflower.bio.indiana.edu
Mon Feb 24 18:58:15 EST 1997


Brian Osborne (bosborne at NATURE.BERKELEY.EDU) wrote:
: 
: I can remember that someone, somewhere wrote a program in
: Perl to extract individual sequences from multiple fasta files. Does
: anyone know where I can find this work?

Here's a script, "splitfasta" to do it.

------------------------ cut here ----------------------------------------
#!/usr/local/bin/perl
eval "exec /usr/local/bin/perl -S $0 $*"
      if $running_under_some_shell;

#
# split a fasta file into separate sequence files
#

$/ = '\777'; # entire input to be read in one slurp

$seqs = <>;  # read input, assigning to single string

while ($seqs =~ m/^(>[^>]+)/mg) { # match indiv. sequences by '>'s
	push(@seqs,$1);           # and store in array
}

for (@seqs) {
	# only allow characters A-Z,a-z,0-9,'_','-', and '.' in names;
	# change if you're more liberal
	/^> *([\w\-\.]+)/ && ($seq_name = $1);
	if ($seq_name) {
		open(OUTFILE,">$seq_name");
		print OUTFILE "$_";
	}
	else {
		warn "couldn't recognise the sequence name in \n$_";
	}
}
	
__END__;
------------------------ cut here ----------------------------------------
____________________________________________________________
Will Fischer			

Biology Department             		wfischer at indiana.edu
Jordan Hall                   		http://www.bio.indiana.edu/~wfischer
Indiana University            		Lab:    812-855-2549
Bloomington, Indiana 47405 USA		FAX:    812-855-6705




More information about the Bio-soft mailing list