PDB -> SwissPROT conversion?

Tim Cutts tjrc1 at mole.bio.cam.ac.uk
Fri Jun 5 08:28:38 EST 1998

In article <6l7588$j4n$1 at netnews.upenn.edu>,
Al Wang <alwang at blue.seas.upenn.edu> wrote:
>Does anyone know what's a good way to take a Brookhaven PDB file, extract 
>the sequence information, and save it in SwissPROT format?  If it can 
>also extract the secondary structure features, even better.  

I can't help exactly, but if it gets you half way there, here is my
perl script that extracts the sequence information from a PDB file and
creates a FASTA format sequence file from it:


# pdb2fasta (C) T J R Cutts 1998

# This script extracts the SEQRES records from a PDB file, and writes out
# separate FASTA format sequence files, one for each strand in the PDB
# file.  The title line of the FASTA file is constructed from the COMPND
# records of the PDB file, and the strand letter.

	     'A' => 'A',
	     'C' => 'C',
	     'G' => 'G',
	     'T' => 'T',
	     'U' => 'U',
	     'ALA' => 'A',
	     'ASX' => 'B',
	     'CYS' => 'C',
	     'ASP' => 'D',
	     'GLU' => 'E',
	     'PHE' => 'F',
	     'GLY' => 'G',
	     'HIS' => 'H',
	     'ILE' => 'I',
	     'LYS' => 'K',
	     'LEU' => 'L',
	     'MET' => 'M',
	     'ASN' => 'N',
	     'PRO' => 'P',
	     'GLN' => 'Q',
	     'ARG' => 'R',
	     'SER' => 'S',
	     'THR' => 'T',
	     'VAL' => 'V',
	     'TRP' => 'W',
	     'XXX' => 'X',
	     'TYR' => 'Y',
	     'GLX' => 'Z',
	     '...' => '.',
	     'END' => '*' 

die "Usage: pdb2fasta <pdbfile> <outfile>\nA separate output file will be created for each strand in the PDB file.\n"
    unless ($#ARGV==1);


open(P, "<$ARGV[0]") || die "Could not open $ARGV[0]";

# Put the file, stripped of its carriage returns, in an array.
foreach (<P>)
    push(@pdb, $_);


%strand = ();

# Extract the SEQRES and COMPND records
@seqres=grep(/^SEQRES/, @pdb);
@cmpnd=grep(/^COMPND/, @pdb);

# Create a suitable FASTA title line from the COMPND records
$cmpnd='>'.join(' ', @cmpnd);
$cmpnd =~ s/COMPND//g;
$cmpnd =~ s/\s+/ /g;

foreach $line (@seqres)
    # Split the SEQRES line from positions 19-70 into an array of residues
    @residues = split(/\s+/, substr($line, 19, 51));
    # For each of these, uppercase the residue, look up its single letter
    # equivalent in the hash at the top of this script, and then append
    # that single letter to this strand's variable for output.
    foreach $res (@residues)
	$res = uc($res);
	$strand{substr($line, 11, 1)} .= $lookup{$res};

# For each strand discovered, create a FASTA format file printed nicely
# with 60 residues per line.

foreach $key (keys %strand)
    $filename = ($key eq " ") ? "${ARGV[1]}.tfa" : "${ARGV[1]}_$key.tfa";
    open (O, ">$filename") ||
	die "Could not open $filename for writing";

    print O "$cmpnd - strand $key\n";
    for ($n = 0 ; $n < length($strand{$key}); $n+=60)
	print O substr($strand{$key}, $n, 60)."\n";

    close O;

# That's it!

Dr T J R Cutts                                        Tel: +44 1223 333596
Dept. of Biochemistry, 80 Tennis Court Rd.
Cambridge, CB2 1GA, UK

More information about the Bio-soft mailing list