[Protein-crystallography] Quick-and-dirty searches of both PDB and EMDB at PDBe

Gerard DVD Kleywegt via xtal-log%40net.bio.net (by gerard from xray.bmc.uu.se)
Fri Apr 1 10:12:03 EST 2011


Hi all,

As part of its recent winter update, the Protein Data Bank in Europe (PDBe; 
http://pdbe.org/) has improved its facility that allows for tandem searches of 
PDB and EMDB. It was designed to allow users to carry out many of their 
day-to-day searches (without the need to fill out a complex form or learn a 
special query syntax). Simply type what you are looking for, click the SEARCH 
button, and we will do our best to dig up relevant information, be it in the 
PDB, in EMDB or on our website.

QUICK ACCESS TO ENTRIES, SERVICES, SEQUENCES
--------------------------------------------

If you go to the PDBe home page (http://pdbe.org/), you will see a Google-like 
search box in the friendly green banner near the top of the page (just below 
our motto, "Bringing Structure to Biology"). You can use this search box in a 
number of ways:

- type a PDB code (e.g., 1cbs), and you will be taken directly to the summary 
page for that entry. You can type any valid code, even if it's not in the 
current release, so you can use this facility to obtain information about the 
status of entries that have not been released yet (e.g., 2yd0) or entries that 
are no longer in the archive (e.g., theoretical models).

- type a valid EMDB code (e.g., 1607) and you will be taken straight to the 
summary page for that entry.

HINT: if, instead of being taken directly to a summary page for a certain PDB 
or EMDB code, you want to actually search PDB and/or EMDB for references to 
that particular code, simply enclose it in double quotes. For instance, 
searching for 1mi6 will take you to the summary page for PDB entry 1mi6, 
whereas searching for "1mi6" will give you a set of hits in both PDB and EMDB 
that all contain a reference to 1mi6.

- type something resembling a PDBe service or resource name and chances are 
that the name will be recognised and you will be taken straight to that service 
or resource (e.g., autodep, emdep, pdbemotif, pdbepisa, pdbefold, pdbechem, 
quips, portfolio, etc.).

- you can search the protein sequences in the PDB by entering seq: (or 
sequence:) followed by a (partial) amino-acid sequence in one-letter code 
(e.g., seq:GNAAAAKKGSEQESVKEFLAKAKEDFLKKWETPSQNTA). The sequence will be 
compared to all protein sequences in the PDB using FastA, and the results will 
be presented to you for further analysis in the PDBe sequence browser (see 
http://pdbe.org/sequence).

TEXT-BASED SEARCHES
-------------------

Of course you can do general text-based searches of the PDB and EMDB as well - 
just type one or more search terms in the box and hit the SEARCH button.

- If you type a single search term and it gives hits in the PDB, you will get a 
results page with a tree structure on the left which shows in which categories 
the term was found. For instance, if you look for Jones, that could be an 
author, but it could also be part of the name of a molecule (e.g., Bence Jones 
protein). By clicking on an appropriate branch in the tree, you select only 
those entries for which the search term occurs in that data category (e.g., 
author or PDB compound).

- If you type more than one search term, only entries that contain all these 
terms will be selected as hits. For instance, if you search for "kleywegt po4" 
- without the quotes - you will get only one hit, 1CBQ. Note that if you 
enclose your search terms in double quotes, you will only get hits that match 
exactly (i.e., the complete search expression must occur somewhere in the 
entry, not just all of the keywords individually). For instance, searching for 
"HCV NS3 protease" yields 31 hits in the PDB if you enclose the terms in double 
quotes, but 177 hits if you don't.

Note that there are two tabs on the results page - one labelled "PDB entries" 
and the other "EMDB entries". If you do a search for Baumeister, you will get 
14 hits in the PDB. If you click on the "EMDB entries" tab, you will find that 
there are 10 hits in EMDB.

HINT: if you want the EMDB results tab to become active straightaway, preface 
your search term(s) by "emdb:" (without the quotes), e.g. search for 
emdb:saibil and you will immediately get the list of 56 EMDB hits.

SEARCH RESULTS
--------------

The search results are sorted by release date by default, with the most 
recently released entries at the top. This ensures that if you read an exciting 
paper about new ClpC structures, a search for clpc will give you the latest 
entries first. You can change the sort order and criterion with a drop-down 
menu.

Each entry that is found as a hit in a search is shown in a panel that contains 
useful summary information and allows you to launch various searches and 
services with a single mouse-click. If you do a search for hiv-1, for example, 
you will get many hits in the PDB and two dozen in EMDB:

- For each PDB hit you will see: the PDB code, a small image of the structure, 
the resolution (for X-ray and EM structures), the title of the entry, a set of 
PDBprints that provide at-a-glance information about the entry (see 
http://pdbe.org/pdbprints). Two action buttons are also shown: "Entry summary" 
and "Download PDB file" - when pressed, they will do what they promise. If you 
click on "More ..." (or on "Expand all ..." at the top of the results tab) you 
will see even more information, namely the release date, information about the 
publication describing the structure, possible cross-references to EMDB entries 
and four more action buttons ("Quick links to related PDBe services"), namely:
   * "Download other files" (takes you to a download page with mmCIF files, 
experimental data files, etc.),
   * "Quaternary structure" (which takes you straight to the PISA results about 
probable assemblies),
   * "Similar structures" (which will automatically launch an SSM/PDBeFold 
search of the PDB to look for structures with similar folds),
   * "Motifs and sites" (which will take you to the PDBeMotif analysis of the 
structure - this may not always work for very recent entries, but we are 
working on solving this issue).

- For each EMDB hit you will see similar information as for the PDB hits. 
Instead of a PDB file, there will be an action button to "Download header 
file". If the EM map/tomogram has been released, there will also be a button to 
download it. If you click "More ..." you will sometimes see "Other EMDB entries 
from this publication" (if one paper describes more than one EMDB entry).

NOTE: if you search for hiv-1 today, you will note that the top 2 EMDB hits 
have release dates of 29 March 2011, but the maps are not yet available for 
download. This has to do with the way entries are released in practice (EMDB 
and PDB use a weekly release cycle). Once the release date has arrived, an EMDB 
entry will be flagged (for release) on the first Thursday following the release 
date, which means the map will become available in the next weekly release 
(which will be on the first Wednesday after that Thursday).

LIMITATIONS AND SEARCH TIPS
---------------------------

As you have seen, the Google-like box in the PDBe banner allows you to carry 
out many standard searches quickly and accurately, but with some limitations, 
such as:

- you cannot use regular expressions or operators such as NOT, AND and OR
- you cannot use wildcards (e.g., searching for "vanil*" will not give any 
hits)
- if you search by author name, you will get better results if you only provide 
the surname (searching for "kleywegt gj" only returns one hit, and it's not a 
structure determined by that person)
- at present, there is no way of ranking the results by relevance (the search 
includes PDB keywords and the PubMed abstract, both of which can lead to false 
positives)

In general, the more search terms you enter, the fewer results you will get (as 
they are all required to occur). Useful search terms are (combinations of) the 
following:

- surnames of authors (e.g., rossmann, sixma, allerston, akke, walse)
- names of proteins (e.g., HMG CoA reductase, Lon protease, bacteriorhodopsin)
- (parts of) species names (e.g., "plasmodium falciparum")
- common names of chemical compounds (e.g., retinol, sildenafil, nadph)
- database identifiers from EC, PubMed, UniProt, etc. (e.g., entering 20890284 
will retrieve two structures that have that number as their PubMed identifier; 
searching for CH60_ECOLI will retrieve 30 hits, 28 of which have that as a 
UniProt identifier; searching for 1.1.1.27 will return 105 hits, 63 of which 
contain a lactate dehydrogenase with that EC number)
- (part of) a valid GO term (e.g., "intracellular protein transport", "signal 
sequence", "anchored to membrane")

If you want to carry out more sophisticated searches, in which you can specify 
that you want to search for a term in a particular category of information 
(e.g., looking for "parkinson" in an abstract rather than as an author, or 
looking for "cancer" as part of a reference rather than a keyword), you can use 
the PDBe advanced search facilities. An action button labelled "Advanced 
search" is available between the green PDBe banner and the search results.

                                    -----

We welcome your comments, bug reports and feature requests on the 
"quick-and-dirty" PDBe search facility. Please use the feedback button at the 
top of any PDBe web page.

--Gerard

---
Gerard J. Kleywegt, PDBe, EMBL-EBI, Hinxton, UK
gerard from ebi.ac.uk ..................... pdbe.org
Secretary: Pauline Haslam  pdbe_admin from ebi.ac.uk



More information about the Xtal-log mailing list