UniProt Release 2.0 Notes

Amar Kalelkar help at uniprot.org
Tue Jul 13 06:29:04 EST 2004

UniProt Release 2.0 Notes
1. Introduction
2. Database description
3. Current release contents
4. Description of changes made to UniProt since release 1.0
5. Forthcoming changes
6. How to link to UniProt
7. Feedback
8. Acknowledgments
9. Terms of use

The UniProt consortium--European Bioinformatics Institute (EBI), Swiss 
Institute of Bioinformatics (SIB) and Protein Information Resource 
(PIR)--is pleased to announce UniProt Release 2.0 (05-July-2004). 
UniProt is updated bi-weekly, and can be accessed online for searches or 
download at http://www.uniprot.org.

UniProt is a centralized resource for protein sequences and functional 
information. UniProt was created by joining together the information 
from Swiss-Prot, TrEMBL and PIR. UniProt is comprised of three 
components, each optimized for different uses:

a. The UniProt Knowledgebase (UniProt) is the central access point for 
extensive curated protein information, including function, 
classification, and cross-references. The UniProt Knowledgebase contains 
two major elements: a section containing manually-annotated records, 
based on information from the literature and curator-evaluated 
computational analysis (referred to as UniProt/Swiss-Prot); and a 
section containing computationally-analyzed records awaiting manual 
annotation (referred to as UniProt/TrEMBL). PIR-PSD entries not found in 
Swiss-Prot or TrEMBL were incorporated into the UniProt Knowledgebase, 
and bi-directional cross-references between these and Swiss-Prot or 
TrEMBL records were created to allow easy tracking. By design, the 
Knowledgebase is non-redundant, with the goal of representing all known 
information regarding a particular protein. The UniProt Knowledgebase 
aims to describe in a single record all the protein products derived 
from a certain gene (or genes if the translation from different genes in 
a genome leads to indistinguishable proteins) from a certain species. 
The UniProt Knowledgebase represents a carefully selected subset of the 
sequences found in UniParc (see below). The UniProt Knowledgebase 
provides extensive cross-references to external data collections, such 
as the corresponding nucleotide entries in DDBJ/EMBL/GenBank, 2D-PAGE 
data, protein structure databases, protein domain and family 
characterization databases, post-translational modification databases, 
species-specific data collections, and disease databases. As a result of 
this extensive cross-referencing, the Knowledgebase serves as a de facto 
hub for biomolecular information about any given protein. Each entry in 
the Swiss-Prot section of the UniProt Knowledgebase is thoroughly 
analyzed and annotated. Literature-based curation is used to extract 
experimental data, which is then added to the entry. Supplementing the 
experimental information is manually-confirmed results from various 
sequence analysis programs. The annotation includes a description of the 
properties of the protein, such as its function, any known 
post-translational modifications, domains, catalytic or other sites, 
secondary and quaternary structure, similarities to other proteins, 
diseases caused by mutations in the protein, pathways in which the 
protein is involved, sequence conflicts, and variants. Detailed 
information is available in the UniProt Knowledgebase user manual 
(http://us.expasy.org/sprot/userman.html), and in the UniProt/Swiss-Prot 
release notes (http://expasy.org/sprot/relnotes/) and the UniProt/TrEMBL 
release notes (ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt).

b. The UniProt Non-redundant Reference databases (UniRef) combines 
closely related sequences into a single record to accelerate sequence 
searches. While merging in the Knowledgebase is restricted to 
curator-assisted inclusion of reliable and stable sequence data for a 
single species, UniRef100 merges sequences automatically across 
different species and includes all UniProt Knowledgebase records. It 
also includes those UniParc records that represent sequences deemed 
over-represented and excluded from the Knowledgebase such as 
DDBJ/EMBL/GenBank WGS (Whole Genome Shotgun) coding sequence 
translations, Ensembl protein translations from various organisms and 
IPI data. The production of UniRef100 begins with the clustering of all 
records by sequence identity. Identical sequences and sub-fragments are 
presented as a single UniRef100 entry, containing the accession numbers 
of all merged entries, the protein sequence, and links to the 
corresponding Knowledgebase and archival records. UniRef90 and UniRef50 
are built from UniRef100 and are intended to provide non-redundant 
sequence collections for the scientific community to use in performing 
faster homology searches. All records having >90% or >50% identity are 
merged together into a single UniRef90 or UniRef50 entry, respectively.

c. UniProt Archive (UniParc) is a comprehensive repository of all 
publicly available protein sequences, consisting only of unique 
identifiers and sequence. While most protein sequence data is derived 
from the translation of DDBJ/EMBL/GenBank nucleotide sequences, a large 
amount of primary protein sequence data resulting from the direct 
sequencing of proteins is submitted directly to other sources, including 
Swiss-Prot, TrEMBL, and PIR-PSD; in addition, a large number of protein 
sequences are found in patent applications, as well as in entries from 
the Protein Data Bank (PDB). Given the wide variety of primary sources 
and variation in the degree and quality of annotation, UniParc was 
created; it is designed to capture all available protein sequence data 
from sources such as the DDBJ/EMBL/GenBank, UniProt/Swiss-Prot, 
UniProt/TrEMBL, PIR-PSD, Ensembl, International Protein Index (IPI), 
PDB, RefSeq, FlyBase, WormBase, H-Inv, TROME, European Patent Office, 
United States Patent and Trademark Office and Japan Patent Office. This 
combination of sources makes UniParc the most comprehensive, publicly 
accessible, non-redundant protein sequence database available. UniParc 
represents each protein sequence once and only once, assigning it a 
unique UniParc identifier. UniParc cross-references the accession 
numbers of the source databases, providing sequence versions that are 
incremented in the usual fashion. Status flags are used to indicate the 
status of the entry in the original source database, with "active" 
indicating that the entry is still present in the source database and 
"obsolete" indicating that the entry no longer exists in the source 
database. UniParc’s intended use is to track the current status and 
history of all proteins. Sequence similarity search is the most reliable 
method for such retrieval. UniParc records carry no annotation, but this 
information can be found in the UniProt Knowledgebase.

Additional information about UniProt databases can be obtained from 

UniProt Release 2.0
Database -- Entries
UniProt -- 1,487,788 (UniProt/Swiss-Prot 44.0: 153,871; UniProt/TrEMBL 
27.0: 1,333,917)
UniRef 100 -- 1,306,318
UniRef 90 -- 816,857
UniRef 50 -- 465,394
UniParc -- 3,863,370


UniProt Knowledgebase - Please read the UniProt/Swiss-Prot and 
UniProt/TrEMBL release notes (http://expasy.org/sprot/relnotes/ and 
ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt) and the recent 
changes webpage (http://expasy.org/sprot/relnotes/sp_news.html).

UniRef - The current UniRef100 database combines identical sequences and 
sub-fragments from any organism into a single UniRef entry. Prior to 
Release 1.8, these sequences were combined only if they were derived 
from the same species. The new DTD is available at 

You can read about forthcoming changes at 

A detailed description of how to link to UniProt entries can be found at 

We are constantly trying to improve our database in terms of accuracy 
and representation and hence we consider your feedback 
(http://www.uniprot.org/support/feedback.shtml) extremely valuable. 
Please contact us if you have any questions 
(http://www.uniprot.org/support/helpdesk.shtml) or comments 
(http://www.uniprot.org/support/feedback.shtml). You can also subscribe 
(http://www.uniprot.org/support/alerts.shtml) to e-mail alerts for the 
latest information on UniProt databases.

UniProt is supported mainly by the National Institutes of Health (NIH) 
grant U01 HG02712. Minor support for the EBI’s involvement in UniProt 
comes from the two European Union contracts BioBabel (QLRT-2000-00981) 
and TEMBLOR (QLRI-2001-00015) and from the NIH grant R01 HGO2273. 
Swiss-Prot activities at the SIB are supported by the Swiss Federal 
Government through the Federal Office of Education and Science. PIR 
activities are also supported by the National Science Foundation (NSF) 
grants DBI-0138188 and ITR-0205470.

UniProt is available for both commercial and non-commercial use. Please 
see http://www.uniprot.org/terms.shtml for details.

More information about the Bio-soft mailing list