Consed, sequence assemble

andy law at
Fri Oct 15 03:51:09 EST 1999

In article <7u5hg0$g1l$1 at>, "Mei" <hmpeng at> wrote:

> I am interested in assemble some of the EST sequences that I have downloaded
> from Entrez.  So far, I am using “csplit” command in unix, then use a perl
> script to rename files.  Finally, use a shell script to generate fake phd
> files for Consed.  This approach works well if I have less than 100
> sequences, because csplit only split up to 99 files.  I’d like to know how
> to split and rename the fasta file according to the gi numbers in the
> definition lines when I have large number of sequences to assemble.  A hint
> in how to write a perl script for this purpose will be greatly appreciated.

The following *should* do what I *think* you need. No guarantees
whatsoever though.


Andy Law


#!/usr/bin/perl -w
# Andy Law 15th October 1999
# Do with this what you will. No restrictions. If you can make money from it,
# then good luck to you

use strict;

# Grab all the input (from STDIN) and strip any end of line characters
# Exit if nothing was supplied
# Die if the first line doesn't begin with a '>'
my (@lines) = <>; chomp(@lines);
exit 0 unless scalar (@lines);
die "First line doesn't begin with a '>'" unless $lines[0] =~ /^>/;

# For each line in turn, look for a leading '>'
# If there is one, strip out the first characters in that line
# after the >, stopping just before the first space, |, ;, : or /
# character. This is the file name.
# Open a file with that name for writing into.
# Note that this will overwrite previous versions with the same name (
# as identified by our method above. You could get smart here by looking
# for the existence of the file and adding a counter until you found an
# 'empty slot'
# Write the contents of the line into the file we just opened
my $line;
foreach $line (@lines) {
    if ($line =~ /^>/) {
        my $seqname = $line;
        $seqname =~ s/^> *([^ |;:\/]+).*/$1/;
        open OUTFILE, ">$seqname" or die "Can't open file '$seqname' for
    print OUTFILE $line, "\n";

More information about the Bio-soft mailing list