Sequences from PDB via Entrez
bryant at ray.nlm.nih.gov
Thu Feb 17 13:29:30 EST 1994
I've in the past seen messages on this board concering access to
sequences derived from the Brookhaven Protein Data Bank. I've
recently finished putting the latest PDB sequences into the Entrez
database distributed by NCBI, and I thought I'd take the opportunity
to remind folks that Entrez can be an easy way to retrieve a sequence
PDB-derived sequences can be identified within Entrez by using the
keyword "pdb-structure". This will find either all PDB-derived
protein sequences or all PDB-derived nucleic acid sequences, depending
on which category one selects. Particular sequences within these
groups may be found by pdb id-code, "accession number" in Entrez, or
by looking for protein names and the like in "text terms". The
pdb-derived entries contain "text-terms" derived from PDB COMPOUND and
SOURCE records, as well as from other PDB record types. One can also
find PDB-derived sequences by searching for descriptive names in the
Medline abstracts included with Entrez. About 90% of the citations in
PDB are linked to the corresponding Medline citation, and if you can
find the paper that reported a structure, you can then ask for the
associated sequence. Sequences may be written out of Entrez in
different formats, including FASTA sequence files.
The pdb sequence reports in Entrez combine information provided on pdb
ATOM and/or HETATM records with the explicit sequence given on SEQRES.
(In about 1% of cases ATOM/HETATM and SEQRES cannot be linked
unambiguously, due to missing data or inconsistencies. In these cases
biopolymer sequences are derived from ATOM records.) Because of this
linking the sequence reports contain a fairly rich annotation,
including secondary structure, disulfide bonds, bonds to nonpolymer
groups, and descriptions of modified biopolymer residues. They also
contain the residue numbers assigned by pdb on ATOM/HETATM records, so
that one can unambiguously identify the coordinates in the pdb file
that go with each residue in the sequence.
PDB-derived sequence reports in Entrez are derived automatically from
pdb files, and I update the collection with each new release of pdb.
Network Entrez version 9.0 came out on February 10, and its database
includes all polypeptide and nucleic acid sequences on the "October,
1993" Brookhaven CD, which I received in late January.
More information about the Bionews