Brookhaven PDB sequence data
T. Mark Reboul
MARK at CUCCFA.CCC.COLUMBIA.EDU
Wed Jul 28 14:23:28 EST 1993
Part of the answer to Bart Frank's question about the sequences of
proteins in the Brookhaven PDB....
If you connect to one of the Anonymous FTP sites where Swiss-Prot
database files live, for example:
ftp ncbi.nlm.nih.gov
(login as anonymous)
cd repository/swiss-prot
you will see the pdbtosp.txt.Z file there. Enter:
bin
get pdbtosp.txt.Z
quit
Back on your home system, use the uncompress utility to expand the
file into pdbtosp.txt, which is a text file correlating PDB entry
names with Swiss-Prot entry names (i.e., those Swiss-Prot entries
incorporating more or less the same sequence as found in PDB
entries).
NOTE: The correspondence between PDB entries and Swiss-Prot
entries is not one-to-one, as the Swiss-Prot database
contains sequences which are composites of multiple PDB
entries' sequences, with redundancies eliminated.
Use VMS SEARCH or Unix grep on that file, pdbtosp.txt, to scan for
individual associations.
If what you want to do is to run various GCG database-searching
programs against only those Swiss-Prot sequences cross-referencing
PDB entries as sources, then it is appropriate to create a GCG File
Of Sequence Names (FOSN) specifying indirectly that PDB subset of
Swiss-Prot. This isn't too hard, since the Swiss-Prot cross-
reference lines have a consistent format. The following has worked
for me in the past:
strings/menu=b sw:* "DR PDB;" bhaven.fosn
(VMS GCG version of the command). Then, when you run whichever GCG
database-search program, respond to the search scope prompt with:
@bhaven.fosn
This is the quick & dirty approach, suitable if you do not expect to
do many such searches. The reason it is Q & D is that, the way the
GCG software currently deals with a FOSN, significant CPU time will
be added to your database search as the GCG-style Swiss-Prot dataset
index file is scanned in full over and over again for each sequence
listed in the FOSN. I would say on the order of 1 second CPU time
(maybe a little less) per FOSN line will be added to your overall
database-search execution time. With 400-500 lines in that file now,
well, I guess you can do the math!
Thus, if you will need to use the PDB-referencing subset of Swiss-
Prot as database-search scope a lot, in the long run it will be much
more CPU-efficient for you to first take the bhaven.fosn file, and
build an independent GCG dataset from it, duplicating those several
hundred Swiss-Prot sequences' data inside it. (No details here.)
Related to this whole issue, a couple of years ago I asked Amos
Bairoch what decisions are made in the incorporation & cross-
referencing of PDB entries inside Swiss-Prot. For the interested
reader, I attach below what he told me back then.
Mark Reboul
Columbia-Presyterian Cancer Center Computing Facility
mark at cuccfa.ccc.columbia.edu
Past-Life Discussion with Amos Bairoch
______________________________________
The Swiss-Prot database is a joint compilation between EMBL and Dr
Amos Bairoch's group at the University of Geneva. I sent Dr Bairoch
the following question, regarding our [local] concerns about keeping
a GCG-format representation of the Brookhaven PDB online:
I know that the Swiss-Prot database contains many hundreds
of cross-references ("DR" lines) to Brookhaven PDB entries.
But does Swiss-Prot in fact contain sequences for every PDB
entry, that is, every PDB entry in a recent release from
Brookhaven?
Here is his response:
Currently (Release 19 of August 1991), SWISS-PROT is cross-
indexed to Release 56 of April 1991 of PDB, the next release
of SWISS-PROT will be cross-indexed with release 57 or 58 of
PDB.
There are a total of 655 structure entries in release 56 of
PDB, out of that total 66 are either structures of nucleic
acid, sugar, peptide antibiotics or protein structure for
which no sequence data exists. That leaves 589 entries and
there are 577 PDB entries referenced in SWISS-PROT release
19. So a total of 12 entries are not referenced: 10 of these
are Immunoglobulins for which SWISS-PROT offers a "minimal"
support (plans exists to distribute a special version of the
Kabat data bank in SWISS-PROT format), and for which I have
not attempted to cross-reference those entries. The two
others: 2HVP and 4HVP are HIV-1 protease structures which do
not correspond exactly to an existing natural sequence (as
it is the case for 1HVP and 3HVP), but are artificial
constructs.
It does my heart good to get a precise answer! Note however that he
does not actually say that an intent in Swiss-Prot policy is to
replicate sequence on all reasonable PDB entries. I brought that
concern explicitly to his attention, and here is his response:
The intent is indeed to have all (or at least the maximum)
of PDB sequences in SWISS-PROT, in general this is not a
problem as most 3D structures are entered in PDB long after
the protein sequence has become available. There are a few
exceptions of course and in some cases I had to use a PDB
entry to get a new sequence into SWISS-PROT. The big
problems are all the errors that exist in PDB sequence
records, which are due not to the fault of PDB, but rather
to the wrong sequence info which is often used to fit 3D
structure. I had a practical example of this problem this
morning: the paper from Golosinska et al. (JBC
266:15797(1991)) reports that the sequence of turkey
troponin C as reported by the crystal. studies is wrong; I
have accordingly corrected that entry for the next release,
but the PDB sequence records will of course not be modified.
I thanked him on behalf of all sequence analyzers for being so
careful in his compilation effort! Subsequently, I asked Dr Bairoch
about the multi-chain issue involved in extracting sequence from PDB
entries. He answered....
====================================================================
Well there are in fact here two different things. It is true that a
PDB entry can represent more than one protein chain. For example,
the rhinovirus coat protein structures are made of 4 protein chains,
and many proteases are crystallized with an inhibitor whose
structure is also in the PDB entry. But there is no real problem in
the representation of such data in SWISS-PROT as there are two
different cases:
1) The different chains belong to separate proteins encoded by
separate genes.
In that case the different chains will be found in separate
SPROT entries. Example:
PDB entry 1HLA represents the structure of the A-2 HLA complex:
this consists of a) beta-2-microglobulin (B2MG$HUMAN) and the
A-2 alpha chain (HA1A$HUMAN). So in that case sequence data
found in a PDB entry is present in two diff. protein entries.
2) The different chains are processed from the same precursor
protein and thus are encoded by the same gene.
In that case the separate chains are found in the same SPROT
entry. Example:
-The two chains of insulin.
-The four subunits of the HRV coat.
In the case you mention INS$PIG, the two chains (B and A) which are
shown in the PDB entry are present in the SPROT entry, separated by
the C-peptide which is cleaved off during the post-translational
processing of the insulin pre-pro-protein.
What must be realized is that SWISS-PROT contains whenever this is
possible the precursor form of a protein (complete with signal
peptide, etc.) while PDB represents the crystallized form of
processed proteins (by definition). It is always possible to find
where a mature protein starts in a SWISS-PROT entry by looking at
the feature table (SIGNAL, PROPEP, TRANSIT, CHAIN, and PEPTIDE keys
are used for that purpose).
================= End of Bairoch's final response =================
More information about the Bio-soft
mailing list