Sequences from PDB via Entrez
Steve Bryant
bryant at ray.nlm.nih.gov
Thu Feb 17 09:59:21 EST 1994
I've in the past seen messages on this board concering access to sequences
derived from the Brookhaven Protein Data Bank. I've recently finished putting
the latest PDB sequences into the Entrez database distributed by NCBI, and I
thought I'd take the opportunity to remind folks that Entrez can be an easy way
to retrieve a sequence from PDB.
PDB-derived sequences can be identified within Entrez by using the keyword
"pdb-structure". This will find either all PDB-derived protein sequences or
all PDB-derived nucleic acid sequences, depending on which category one
selects. Particular sequences within these groups may be found by pdb id-code,
"accession number" in Entrez, or by looking for protein names and the like in
"text terms". The pdb-derived entries contain "text-terms" derived from PDB
COMPOUND and SOURCE records, as well as from other PDB record types. One can
also find PDB-derived sequences by searching for descriptive names in the
Medline abstracts included with Entrez. About 90% of the citations in PDB are
linked to the corresponding Medline citation, and if you can find the paper
that reported a structure, you can then ask for the associated sequence.
Sequences may be written out of Entrez in different formats, including FASTA
sequence files.
The pdb sequence reports in Entrez combine information provided on pdb ATOM
and/or HETATM records with the explicit sequence given on SEQRES. (In about 1%
of cases ATOM/HETATM and SEQRES cannot be linked unambiguously, due to missing
data or inconsistencies. In these cases biopolymer sequences are derived from
ATOM records.) Because of this linking the sequence reports contain a fairly
rich annotation, including secondary structure, disulfide bonds, bonds to
nonpolymer groups, and descriptions of modified biopolymer residues. They also
contain the residue numbers assigned by pdb on ATOM/HETATM records, so that one
can unambiguously identify the coordinates in the pdb file that go with each
residue in the sequence.
PDB-derived sequence reports in Entrez are derived automatically from pdb
files, and I update the collection with each new release of pdb. Network
Entrez version 9.0 came out on February 10, and its database includes all
polypeptide and nucleic acid sequences on the "October, 1993" Brookhaven CD,
which I received in late January.
Steve Bryant
2/16/94
More information about the Xtal-log
mailing list