Brookhaven PDB sequence data

T. Mark Reboul MARK at CUCCFA.CCC.COLUMBIA.EDU
Wed Jul 28 14:23:28 EST 1993


Part of the answer to Bart Frank's question about the sequences of 
proteins in the Brookhaven PDB....

If you connect to one of the Anonymous FTP sites where Swiss-Prot 
database files live, for example:

	ftp ncbi.nlm.nih.gov
	(login as anonymous)
	cd repository/swiss-prot

you will see the pdbtosp.txt.Z file there. Enter:

	bin
	get pdbtosp.txt.Z
	quit

Back on your home system, use the uncompress utility to expand the 
file into pdbtosp.txt, which is a text file correlating PDB entry 
names with Swiss-Prot entry names (i.e., those Swiss-Prot entries 
incorporating more or less the same sequence as found in PDB 
entries).

NOTE:	The correspondence between PDB entries and Swiss-Prot 
	entries is not one-to-one, as the Swiss-Prot database 
	contains sequences which are composites of multiple PDB 
	entries' sequences, with redundancies eliminated.

Use VMS SEARCH or Unix grep on that file, pdbtosp.txt, to scan for 
individual associations.

If what you want to do is to run various GCG database-searching 
programs against only those Swiss-Prot sequences cross-referencing 
PDB entries as sources, then it is appropriate to create a GCG File 
Of Sequence Names (FOSN) specifying indirectly that PDB subset of 
Swiss-Prot. This isn't too hard, since the Swiss-Prot cross- 
reference lines have a consistent format. The following has worked 
for me in the past:

	strings/menu=b  sw:*  "DR   PDB;"  bhaven.fosn

(VMS GCG version of the command). Then, when you run whichever GCG 
database-search program, respond to the search scope prompt with:

	@bhaven.fosn

This is the quick & dirty approach, suitable if you do not expect to 
do many such searches. The reason it is Q & D is that, the way the 
GCG software currently deals with a FOSN, significant CPU time will 
be added to your database search as the GCG-style Swiss-Prot dataset 
index file is scanned in full over and over again for each sequence 
listed in the FOSN. I would say on the order of 1 second CPU time 
(maybe a little less) per FOSN line will be added to your overall 
database-search execution time. With 400-500 lines in that file now, 
well, I guess you can do the math!

Thus, if you will need to use the PDB-referencing subset of Swiss- 
Prot as database-search scope a lot, in the long run it will be much 
more CPU-efficient for you to first take the bhaven.fosn file, and 
build an independent GCG dataset from it, duplicating those several 
hundred Swiss-Prot sequences' data inside it. (No details here.)

Related to this whole issue, a couple of years ago I asked Amos 
Bairoch what decisions are made in the incorporation & cross- 
referencing of PDB entries inside Swiss-Prot. For the interested 
reader, I attach below what he told me back then.


	Mark Reboul
	Columbia-Presyterian Cancer Center Computing Facility
	mark at cuccfa.ccc.columbia.edu


Past-Life Discussion with Amos Bairoch
______________________________________


The Swiss-Prot database is a joint compilation between EMBL and Dr 
Amos Bairoch's group at the University of Geneva. I sent Dr Bairoch 
the following question, regarding our [local] concerns about keeping 
a GCG-format representation of the Brookhaven PDB online:

	I know that the Swiss-Prot database contains many hundreds 
	of cross-references ("DR" lines) to Brookhaven PDB entries. 
	But does Swiss-Prot in fact contain sequences for every PDB 
	entry, that is, every PDB entry in a recent release from 
	Brookhaven?

Here is his response:

	Currently (Release 19 of August 1991), SWISS-PROT is cross- 
	indexed to Release 56 of April 1991 of PDB, the next release 
	of SWISS-PROT will be cross-indexed with release 57 or 58 of 
	PDB.
 
	There are a total of 655 structure entries in release 56 of 
	PDB, out of that total 66 are either structures of nucleic 
	acid, sugar, peptide antibiotics or protein structure for 
	which no sequence data exists. That leaves 589 entries and 
	there are 577 PDB entries referenced in SWISS-PROT release 
	19. So a total of 12 entries are not referenced: 10 of these 
	are Immunoglobulins for which SWISS-PROT offers a "minimal" 
	support (plans exists to distribute a special version of the 
	Kabat data bank in SWISS-PROT format), and for which I have 
	not attempted to cross-reference those entries. The two 
	others: 2HVP and 4HVP are HIV-1 protease structures which do 
	not correspond exactly to an existing natural sequence (as 
	it is the case for 1HVP and 3HVP), but are artificial 
	constructs.
 
It does my heart good to get a precise answer! Note however that he 
does not actually say that an intent in Swiss-Prot policy is to 
replicate sequence on all reasonable PDB entries. I brought that 
concern explicitly to his attention, and here is his response:

	The intent is indeed to have all (or at least the maximum) 
	of PDB sequences in SWISS-PROT, in general this is not a 
	problem as most 3D structures are entered in PDB long after 
	the protein sequence has become available. There are a few 
	exceptions of course and in some cases I had to use a PDB 
	entry to get a new sequence into SWISS-PROT. The big 
	problems are all the errors that exist in PDB sequence 
	records, which are due not to the fault of PDB, but rather 
	to the wrong sequence info which is often used to fit 3D 
	structure. I had a practical example of this problem this 
	morning: the paper from Golosinska et al. (JBC 
	266:15797(1991)) reports that the sequence of turkey 
	troponin C as reported by the crystal. studies is wrong; I 
	have accordingly corrected that entry for the next release, 
	but the PDB sequence records will of course not be modified.

I thanked him on behalf of all sequence analyzers for being so 
careful in his compilation effort! Subsequently, I asked Dr Bairoch 
about the multi-chain issue involved in extracting sequence from PDB 
entries. He answered....

====================================================================

Well there are in fact here two different things. It is true that a 
PDB entry can represent more than one protein chain. For example, 
the rhinovirus coat protein structures are made of 4 protein chains, 
and many proteases are crystallized with an inhibitor whose 
structure is also in the PDB entry. But there is no real problem in 
the representation of such data in SWISS-PROT as there are two 
different cases: 
 
 1) The different chains belong to separate proteins encoded by 
    separate genes.
 
    In that case the different chains will be found in separate 
    SPROT entries. Example:
 
     PDB entry 1HLA represents the structure of the A-2 HLA complex: 
     this consists of a) beta-2-microglobulin (B2MG$HUMAN) and the 
     A-2 alpha chain (HA1A$HUMAN). So in that case sequence data 
     found in a PDB entry is present in two diff. protein entries.
 
 2) The different chains are processed from the same precursor 
    protein and thus are encoded by the same gene.
 
    In that case the separate chains are found in the same SPROT 
    entry. Example:
 
    -The two chains of insulin.
    -The four subunits of the HRV coat.
 
In the case you mention INS$PIG, the two chains (B and A) which are 
shown in the PDB entry are present in the SPROT entry, separated by 
the C-peptide which is cleaved off during the post-translational 
processing of the insulin pre-pro-protein.
 
What must be realized is that SWISS-PROT contains whenever this is 
possible the precursor form of a protein (complete with signal 
peptide, etc.) while PDB represents the crystallized form of 
processed proteins (by definition). It is always possible to find 
where a mature protein starts in a SWISS-PROT entry by looking at 
the feature table (SIGNAL, PROPEP, TRANSIT, CHAIN, and PEPTIDE keys 
are used for that purpose).

================= End of Bairoch's final response =================




More information about the Bio-soft mailing list