database statistics

USA::MJA12046 MJA12046%USA.decnet at USAV01.GLAXO.COM
Mon Sep 20 08:22:00 EST 1993


Hello,
	Last Monday I asked the net the following:

: Is there a file, in Gopherspace or wherever, that keeps the statistics
: of DNA and Protein Databases with regard to species?  It would be interesting
: to know what percentage of the E. coli genome has been sequenced, what
: percent of Genbank is human DNA, etc.

I received two replies.  The first was from Amos Bairoch (thanks!) regarding
SwissProt.  He informed me that the SwissProt release notes appendix contains
all sorts of information regarding the database. 
Here is just a sample:

SwissProt database of Protein Sequences

        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        2454          Human
         2        2222          Escherichia coli
         3        1439          Mouse
         4        1339          Rat
         5        1220          Baker's yeast (Saccharomyces cerevisiae)
         6         634          Bovine
         7         560          Fruit fly (Drosophila melanogaster)
         8         477          Chicken
         9         454          Bacillus subtilis
        10         362          African clawed frog (Xenopus laevis)
        11         340          Salmonella typhimurium
        12         333          Rabbit
        13         298          Pig
        14         251          Vaccinia virus (strain Copenhagen)
        15         222          Maize
        16         193          Human cytomegalovirus (strain AD169)
        17         177          Arabidopsis thaliana (Mouse-ear cress)
                   177          Rice
        19         176          Vaccinia virus (strain WR)
        20         167          Bacteriophage T4
        21         161          Pea
        22         159          Tobacco
                   159          Wheat
        24         151          Pseudomonas aeruginosa
        25         142          Caenorhabditis elegans
        26         141          Fission yeast (Schizosaccharomyces pombe)
        27         133          Barley
        28         129          Staphylococcus aureus
        29         127          Spinach
        30         125          Soybean
        31         123          Sheep
        32         122          Slime mold (Dictyostelium discoideum)
        33         119          Marchantia polymorpha (Liverwort)
        34         118          Rhodobacter capsulatus
        35         115          Dog
        36         113          Pseudomonas putida
        37         110          Neurospora crassa
                   110          Klebsiella pneumoniae


Dennis Benson of GenBank replied (thanks) and told me that each GenBank release
has a file (gbrel.txt) which includes the number of bases for the top 
twenty organisms (excluding chloroplast and mitochondrial sequences). Here is
the file from release 78:

  Entries      Bases   Species

  36990     28328775   Homo sapiens
  11115     10665461   Mus musculus
  4427       6634841   Rattus norvegicus
  2347       5371333   Saccharomyces cerevisiae
  2606       4571085   Escherichia coli
  2246       4391333   Drosophila melanogaster
  5123       4139634   Caenorhabditis elegans
  1710       2228362   Gallus gallus
  1392       1759777   Bos taurus
  2351       1639151   Arabidopsis thaliana
  3270       1503383   Human immunodeficiency virus type 1
  1021       1399704   Xenopus laevis
  972        1371412   Oryctolagus cuniculus
  519         970769   Bacillus subtilis
  771         907555   Influenza virus type A
  1254        873268   Plasmodium falciparum
  1522        864290   Oryza sativa
  525         859881   Zea mays
  354         689647   Schizosaccharomyces pombe
  509         685265   Sus scrofa






More information about the Bio-soft mailing list