Counting tripeptide frequencies

Andrew Dalke dalke at bioreason.com
Wed Feb 10 19:08:12 EST 1999


Rich Dudley asked:
> Doe anyone know of a program (WW or Windows) that can enumerate the di-
> and tri-peptide frequencies in a protein?  Ideally, it would contstruct
> a table at the end of the input and have the sequence and number of
> occurrences.
> 

This isn't any help, but I figured may code was hard enough to
understand that I would post it anyway <grin>.

Here's a perl script for dipeptide pair counts, assuming single
letter sequences on one line per record.

perl -ne '%dict={};
  s/(..)/$dict{$1}++,$1/ge;$_=substr($_,-(length)+1);
  s/(..)/$dict{$1}++,$1/ge;
  foreach $k (keys %dict) {print "$k $dict{$k}\n"}'

ANAANOPOANO
OA 1
AA 1
NO 2
OP 1
NA 1
PO 1
AN 3

(Yeah! And "O" is the 21st Beatle^H^H^H^H^H^Hamino acid :)

For tripeptides that's:
perl -ne '%dict={};
  s/(...)/$dict{$1}++,$1/ge;$_=substr($_,-(length)+1);
  s/(...)/$dict{$1}++,$1/ge;$_=substr($_,-(length)+1);
  s/(...)/$dict{$1}++/ge;
  foreach $k (keys %dict) {print "$k $dict{$k}\n"}'

ANANAPANA
ANA 3
APA 1
NAN 1
PAN 1
NAP 1

  Intuitively obvious to the most causual of observers, yes?

						Andrew




More information about the Bio-soft mailing list