Iddo Friedberg wrote:
> Does anybody know of a good, robust program, for UNIX (mine's an IRIX
> 5.2) which will accept a PDB
> file as input, and for the output deliver for each amino acid it's PDB
> position number. Including such vagaries as negative numbering (some
> files have them), and non-numerical positions (as in serine
> proteinases, where there are such things as 65A).
I suggest turning to the wonderful UNIX commandline (or shell script).
Offhand I can think of the following:
cat pdb1hdt.ent | grep '^ATOM' | cut -c18-27 |
nawk 'BEGIN {print "AA number chain"} {num=substr($0,6,6);
aa=substr($0,1,3); ch=substr($0,5,1); if(num==oldnum) {getline;}
else {printf "%3s %6s %6s\n",aa,num,ch;} oldnum=num}'
This seems to work on my machine (running IRIX 6.2) although I only
tried a couple of different PDB-files.
Note that all this has to be on one line. Alternatively you can put it
in more lines with a backslash at the end of each line. This can also be
put in a shell script instead of being run on the commandline. An
example of such a scriptfile:
----------CUT--------------------------------------
#! /bin/csh -f
cat $1 | grep '^ATOM' | cut -c18-27 | \
nawk 'BEGIN {print "AA number chain"} \
{num=substr($0,6,6); \
aa=substr($0,1,3); \
ch=substr($0,5,1); \
if(num==oldnum) \
{getline;} \
else \
{printf "%3s %6s %6s\n",aa,num,ch;} oldnum=num}'
----------CUT--------------------------------------
This script is then run with the PDB file as a parameter (e.g.:
getatoms pdb1hdt.ent
The path in the first line has to reflect the actual position of your
c-shell.
As my knowledge of the PDB-format is rather superficial you may want to
change some details in the above. I just made this script based on a few
examples that I looked at. The basic idea in the script is that the
relevant information is in the lines beginning with ATOM. On these
lines, the three letter amino acid code, the amino acid number, and (if
relevant) the chain, is indicated in fixed columns (which is why I have
used a "cut" command. The "nawk" part of the script is merely used to
produce pretty output, and to make sure that only one line is printed
for each amino acid.
Generally I find that almost all my day-to-day data processing needs can
be handled by using "nawk", "grep" and a few other UNIX commands.
Hope you can use this.
Kind regards,
Anders Gorm Pedersen
------------------------------------------------------------------------
Anders Gorm Pedersen Phone: (+45) 45 25 24 84
Center for Biological Sequence Analysis Fax: (+45) 45 93 48 08
Technical University of Denmark e-mail: gorm at cbs.dtu.dk
Building 207, DK-2800 Lyngby, Denmark
WWW: http://www.cbs.dtu.dk/gorm/home.html
------------------------------------------------------------------------