Responses to genome project query

Paul D. Boyer pboyer at hsc.usc.edu
Thu Sep 16 18:49:24 EST 1993


This is a compilation of the responses I received by e-mail from my question
regarding the current staus of the genome project.  Thanks again to everyone 
who volunteered their help, and I hope this is as helpful to those who asked 
for it...

From: Bill Pearson <wrp at cyclops.micr.virginia.edu>

     Lambda was determined almost 10 years ago.  The largest complete
sequences now available are  yeast chromosome III (170 Kb) about two
years ago and several large DNA viruses (CMV, 300K, for example).

     40% of E. Coli has been sequenced in about 400 pieces.  The
largest pieces tend to be about 200 Kb.  The is an article about the
project in a recent ASM News.

     Progress on the human genome depends on who you talk to and
how you define progress.  The Expressed Sequence Tag projects are
producing very large numbers of (low quality) sequences - more than
20,000. If there are 100,000 genes, then we are already 20% done.
However, high quality genomic DNA sequence is a much smaller fraction
that someone else must tell you.  There are 25 million bases of
primate sequence in genbank, and probably 90% of that is human.
2.5*10^7/3*10^9 = 1% sequenced.

Bill Pearson

=======================================================

From: toms at ncifcrf.gov

Paul:  Lambda was sequenced YEARS ago!  It's in GenBank under accession number
J02459.  You can have it emailed to you by sending a message like this:

>From toms Sun Sep  5 18:35:01 1993
To: retrieve at ncbi.nlm.nih.gov
Content-Length: 52

MAXLINES 1000000
DATALIB genbank
BEGIN
J02459 [acc]

(Unfortunately you will get two copies - duplication is a terrible problem with
the databases, I'll have to investigate this...)

For further information, you probably can send a message to the same
server with:

help

on a line by itself.

Anyway, Kenn Rudd is collecting the entire Ecoli sequences - more than 50% is
now done!  He is rudd at bio.nlm.nih.gov  he may have a sense for how much human
sequence is done, but I'm pretty sure it's not a large fraction yet.

To see one way that one can use the huge amounts of data, you could check out
our recent paper:

@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
  and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124-1136",
year = "1992"}

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov

=======================================================

From: gwilliam at mrc-crc.ac.uk (Gary Williams x3294)

Here's something that always amazes me....

The sizes of the gene sequence databases are increasing exponentially -
doubling about every 18 months

European Molecular Biology Labs nucleotide database
# Source: EMBL release notes 'relnotes.doc'

# Month       Entries   Nucleotides
# -------     -------   -----------
06/1982         568        585433
04/1983         811       1114447
12/1983        1481       1654863
08/1984        1698       2147205
04/1985        2378       2874493
08/1985        4835       4567592
12/1985        5789       5622638
04/1986        6395       6353040
09/1986        7630       7813214
12/1986        8817       9766948
04/1987       11621      12189783
07/1987       12706      13638061
10/1987       14397      16023478
01/1988       15344      17272160
05/1988       17961      20318442
08/1988       19592      22625941
11/1988       20695      24211054
02/1989       22938      27249830
05/1989       24365      29066676
08/1989       26223      31240948
11/1989       28679      34748087
02/1990       31508      38165786
05/1990       34902      42923803
08/1990       37784      47354438
11/1990       41580      52900354
02/1991       43745      55859549
05/1991       46871      59915244
09/1991       54558      70448052
12/1991       57765      75400487
03/1992       63378      83574342
06/1992       72481      94390065
09/1992       79377     101292310
12/1992       89100     111413979
03/1993       99591     121420828
06/1993      108973     131880111


European Molecular Biology Labs protein sequence database
# Swissprot sizes. 
# Source: Swissprot release notes 'relnotes.doc'
#
# Release Date No. entries    No amino acids

3.0  11/86     4160      969641
4.0  04/87     4387      1036010
5.0  09/87     5205      1327683
6.0  01/88     6102      1653982
7.0  04/88     6821      1885771
8.0  08/88     7724      2224465
9.0  11/88     8702      2498140
10.0 03/89     10008          2952613
11.0 07/89     10856          3265966
12.0 10/89     12305          3797482
13.0 01/90     13837          4347336
14.0 04/90     15409          4914264
15.0 08/90     16941          5486399
16.0 11/90     18364          5986949
17.0 02/91     20024          6524504
18.0 05/91     20772          6792034
19.0 08/91     21795          7173785
20.0 11/91     22654          7500086
21.0 03/92     23742          7866596
22.0 05/92     25044          8375696
23.0 08/92     26706          9011391
24.0 12/92     28154          9545427
25.0 04/93     29955          10214020
26.0    07/93   31808         10875091


USA nucleotide sequence database
Genbank Database sizes

This info is available for anonymous FTP from genbank.bio.net in
pub/db/genbank.stats.


Release Date    Entries Bases

3       Dec-82  606     680338
14      Nov-83  2427    2274029
20      May-84  3665    3002088
24      Sep-84  4135    3323270
26      Nov-84  4393    3689752
32      May-85  4954    4311931
36      Sep-85  5700    5204420
40      Feb-86  6642    5925429
42      May-86  7416    6765476
44      Aug-86  8823    8442357
46      Nov-86  9978    9615371
48      Feb-87  10913   10961380
50      May-87  12534   13048473
52      Aug-87  14020   14855145
54      Dec-87  15465   16752872
55      Mar-88  17047   19156002
56      Jun-88  18226   20795279
57      Sep-88  19044   22019698
58      Dec-88  21248   24690876
59      Mar-89  22479   26382491
60      Jun-89  26317   31808784
61      Sep-89  28791   34762585
62      Dec-89  31229   37183950
63      Mar-90  33377   40127752
64      Jun-90  35100   42495893
65      Sep-90  39533   49179285
66      Dec-90  41057   51306092
67      Mar-91  43903   55169276
68      Jun-91  51418   65868799
69      Sep-91  55631   71947426
70      Dec-91  58952   77337678
71      Mar-92  65100   83894652
72      Jun-92  71280   92160761
77   Jun-93    120134    138904393

Great strides are being made in sequencing the following organisms:

E.coli, Brewers Yeast (Saccharomyces - the whole of chromosome III of
this has been sequenced), Nematode, Puffer fish, Mouse, Pig, Human,
Thale weed (Thaliana arabidipsis), Rice.

There are probably many other smaller sequencing projects that I don't
know of. 

There are an awful lot of small bits of the human genome that have been
sequenced.  It's very difficult at present to give an accurate figure to
the number of genes sequenced or the proportion of the genome sequenced,
bvut there are 21748 human entries in the nucleotide databases at present
 - this will be an overestimate of the number of genes sequenced because
some genes will have been sequences several times and some genes will be
present as fragments in different entries that have not been edited
together yet. But it's a good starting figure. Try halving it to get a 
good estimate of the number of human genes sequenced.


Regards,
Gary Williams

Computing Services Section,            Janet:       G.Williams at UK.AC.MRC.HGMP
MRC Human Genome Mapping Project,      Internet:    G.Williams at HGMP.MRC


More information about the Bioforum mailing list