[DI] YPD Release 4.0

"Jim Garrels"jg at proteome.com "Jim Garrels"jg at proteome.com
Sun Jun 25 22:11:44 EST 1995


Dear Bionet Readers,

Version 4.0 of the Yeast Protein Database (YPD) is now available by ftp from
isis.cshl.org at the Cold Spring Harbor Laboratory.  See directory
/pub/yeast/YPD.

YPD is a spreadsheet containing 11 categories of information (see below)
for each of the S. cerevisiae proteins of known sequence.  These include
sequences from genomic sequencing projects and all yeast GenBank sequences
through May 22, 1995.  There are currently 4046 entries in YPD.

A version of YPD is also available on the QUEST WWW server maintained by
Jerry Latter at the Cold Spring Harbor Lab.  The address is
http://siva.cshl.org.


NEW IN RELEASE 4.0.

Release 4.0 contains new gene name fields to include the names used
in SWISS-PROT/LISTA, and the Saccharomyces Genome Database (SGD).  The YPD
names are generally in agreement with the SGD names.

Release 4.0 now includes CAI (codon adaptation index) as well as codon bias.

Release 4.0 uses a different method for calculation of isoelectric point.  The
pK values used are those determined from 2D gel analysis by Bjellqvist et al.
Electrophoresis 14:1023-1031(1993).

Release 4.0 has extensive annotations for most proteins, found in the
YPD_FORMATTED file.


SPREADSHEET DATA STRUCTURE

The 11 categories of data and the data columns within each category
are listed here. A more complete description of each field is found in the
file YPD.doc which is available by ftp.

   a. Gene Names
      YPD gene name
      SWISS-PROT/LISTA gene name
      Saccharomyces Genome Database name
      Synonym list

   b. Technical flags (used primarily in database development)
      Identical sequence flags (entries with same number are identical)
      Closely-related sequence flags (entries with same number are related)
      Database source (0 = PIR, not GenBank; 1 = GenBank major release;
          2 = GenBank cumulative update; 3 = GenBank daily update)
      Systematic sequence flag (1 = derived from systematic sequencing)

   c. Calculated data
      Isoelectric point
      Isoelectric point after adding 1 positive charge
      Isoelectric point after adding 1 negative charge
      Molecular weight
      Codon bias
      Codon adaptation index

   d. Genetic data
      Chromosome number
      Presence or absence of intron in gene
      Knockout mutation (L = lethal, V = viable)

   e. Accession numbers
      GenBank
      PIR-International
      SWISS-PROT
      YEPD  (2D gel database numbers)

   f. Subcellular localization and functional classification
      Major localization category (nuclear, mitochondrial etc)
      Minor localization category (mitochondrial inner membrane, etc)
      Molecular environment (integral membrane, DNA-associated, etc)
      Functional classification (protein kinase, transcription factor, etc)

   g. Post-translational modifications and length
      N-terminal modification (acetylation, myristoylation)
      C-terminal modification (farnesylation, geranylgeranylation, etc.)
      Phosphorylation
      N- or O-linked glycosylation
      N-terminal precursor length
      Mature protein length (in amino acids after removal of N- and
         C-terminal precursor peptides)

   h. Amino acid composition
      20 amino acid fields (number of residues in mature protein)
      Met-adjust field (1 indicates N-met is predicted but not known
         to be removed)

   i. Motifs
      Potential sites for phosphorylation by Cdc28 protein kinase
      Potential sites for phosphorylation by CKII protein kinase
      Potential sites for phosphorylation by PKA protein kinase
      Potential sites for N-linked glycosylation
      Potential transmembrane domains

   j. N- and C-terminal sequence fragments
      N-terminal sequence of precursor protein
      N-terminal sequence of mature protein
      C-terminal sequence of mature protein

   k. Protein name/description and references
      Protein name and descriptive phrases
      List of references (Refers to numbered references in YPD_REFS file)


YPD_FORMATTED (text file) DATA STRUCTURE

   In YPD_FORMATTED, data for each protein is presented as a formatted
     "data sheet".

   a. Self-explanatory form containing 34 of the fields from the
         spreadsheet.
   b. Annotations (phrases that elaborate on mutant phenotypes,
         protein associations, genetic interactions, etc)
   c. Citations list with titles.


CONTENT SUMMARY (for 4046 proteins in Release 4.0)
      Includes: GenBank entries through May 22, 1995
             SWISS-PROT entries through May. 23, 1995
             PIR-International entries from Release 43

   4046 TOTAL SEQUENCES
      3516 Sequences from systematic sequencing projects
      1951 Protein characterized through genetics or biochemistry.  Most of
             these have meaningful mnemonic names.
       667 Protein known only by homology to characterized proteins.  These
             proteins have descriptions such as "Protein with similarity to".
      1428 Proteins of unknown function.  Some of these contain known motifs
             but no extensive homology to known proteins.  These proteins
             have descriptions starting with "Protein of unknown function".

   a.  Of the 1951 proteins known from genetic or biochemical studies:
         422  (21.6%) Nuclear
         385  (19.7%) Cytoplasmic
         245  (12.6%) Mitochondrial
          91   (4.7%) Plasma membrane
          60   (3.1%) Endoplasmic reticulum
          50   (2.6%) Unspecified membrane
          47   (2.4%) Cytoskeletal
          34   (1.7%) Extracellular or cell wall
          29   (1.5%) Vacuolar
          22   (1.1%) Vesicles of secretory pathway
          23   (1.2%) Golgi
          17   (0.9%) Peroxisomal
         526  (27.0%) Unknown

        Note:  The unknown category contains many metabolic and housekeeping
            proteins are likely to be cytoplasmic, but definitive studies
            on their localization are difficult to find.

         N-terminal modifications
             81  (4.2%) Known to be N-terminally acetylated
            102  (5.2%) Known to be N-terminally unmodified
              8  (0.4%) Known to be N-myristylated
           1760 (90.2%) N-terminal status unknown

         C-terminal modifications
              9 (0.5%) Known to be farnesylated
             10 (0.5%) Known to be geranylgeranylated
             10 (0.5%) Known to have GPI anchors

         Phosphorylation
            132 (6.8%) Known to be phosphorylated

         Glycosylation
             49 (2.5%) Known to be N-glycosylated only
             18 (0.9%) Known to be O-glycosylated only
              4 (0.2%) Known to be N- and O-glycosylated

         Precursors
            233 (11.9%) Known to have N-terminal precursor peptide
            206 (10.6%) Known to have N-met removal only
             76  (3.9%) Known to have no precursor peptide and no N-met
                          removal


   b.  Of the 2618 proteins known by genetics, biochemistry or homology
         By molecular environment
            334 (12.8%) Integral membrane
            316 (12.1%) DNA-associated (not necessarily direct DNA-binding)
            145  (5.5%) Ribosomal
             86  (3.3%) Peripheral membrane
             88  (3.4%) RNA-associated
             40  (1.5%) Protein synthesis factors
             15  (0.6%) Actin cytoskeleton-associated
             13  (0.5%) Tubulin cytoskeleton-associated

         By functional category
             114 (4.4%) Transcription factors
             90  (3.4%) Protein kinases
             62  (2.4%) Enzymes of amino acid metabolism
             43  (1.6%) GTPases
             31  (1.2%) Heat shock
             33  (1.3%) tRNA synthetases
             28  (1.1%) Protein phosphatases
             26  (1.0%) Proteases other than proteasome subunits
             21  (0.8%) Conserved ATPase domain family (SEC18/PAS1/SUG1/YME1)
             16  (0.6%) Enzymes of glucose metabolism
             20  (0.8%) Serine-alanine-rich proteins (Srp1/Tip1p family)
             16  (0.6%) Cyclins
             14  (0.5%) Proteasome components
             10  (0.4%) Ubiquitin-conjugating enzymes
              9  (0.3%) GTPase-activating proteins
              8  (0.3%) Guanine nucleotide exchange factors


Obviously, these counts of protein by category are not necessarily
indicative of the true abundance of yeast proteins in each category because
many proteins are still uncharacterized or uncategorized.

Best of luck with YPD.  Feedback, corrections, comments, new data, etc.
are always welcome.

Sincerely,


Jim Garrels



---------------------------------------------------------------
James I. Garrels                        Tel (508) 922-1643
PROTEOME INC.                           FAX (508) 922-3971
181 Elliott St.,  Suite 909             Email jg at proteome.com
Beverly, MA 01915
---------------------------------------------------------------




More information about the Bio-www mailing list