Responses to genome project query
Paul D. Boyer
pboyer at hsc.usc.edu
Thu Sep 16 18:49:24 EST 1993
This is a compilation of the responses I received by e-mail from my question
regarding the current staus of the genome project. Thanks again to everyone
who volunteered their help, and I hope this is as helpful to those who asked
for it...
From: Bill Pearson <wrp at cyclops.micr.virginia.edu>
Lambda was determined almost 10 years ago. The largest complete
sequences now available are yeast chromosome III (170 Kb) about two
years ago and several large DNA viruses (CMV, 300K, for example).
40% of E. Coli has been sequenced in about 400 pieces. The
largest pieces tend to be about 200 Kb. The is an article about the
project in a recent ASM News.
Progress on the human genome depends on who you talk to and
how you define progress. The Expressed Sequence Tag projects are
producing very large numbers of (low quality) sequences - more than
20,000. If there are 100,000 genes, then we are already 20% done.
However, high quality genomic DNA sequence is a much smaller fraction
that someone else must tell you. There are 25 million bases of
primate sequence in genbank, and probably 90% of that is human.
2.5*10^7/3*10^9 = 1% sequenced.
Bill Pearson
=======================================================
From: toms at ncifcrf.gov
Paul: Lambda was sequenced YEARS ago! It's in GenBank under accession number
J02459. You can have it emailed to you by sending a message like this:
>From toms Sun Sep 5 18:35:01 1993
To: retrieve at ncbi.nlm.nih.gov
Content-Length: 52
MAXLINES 1000000
DATALIB genbank
BEGIN
J02459 [acc]
(Unfortunately you will get two copies - duplication is a terrible problem with
the databases, I'll have to investigate this...)
For further information, you probably can send a message to the same
server with:
help
on a line by itself.
Anyway, Kenn Rudd is collecting the entire Ecoli sequences - more than 50% is
now done! He is rudd at bio.nlm.nih.gov he may have a sense for how much human
sequence is done, but I'm pretty sure it's not a large fraction yet.
To see one way that one can use the huge amounts of data, you could check out
our recent paper:
@article{Stephens.Schneider.Splice,
author = "R. M. Stephens
and T. D. Schneider",
title = "Features of spliceosome evolution and function
inferred from an analysis of the information at human splice sites",
journal = "J. Mol. Biol.",
volume = "228",
pages = "1124-1136",
year = "1992"}
Tom Schneider
National Cancer Institute
Laboratory of Mathematical Biology
Frederick, Maryland 21702-1201
toms at ncifcrf.gov
=======================================================
From: gwilliam at mrc-crc.ac.uk (Gary Williams x3294)
Here's something that always amazes me....
The sizes of the gene sequence databases are increasing exponentially -
doubling about every 18 months
European Molecular Biology Labs nucleotide database
# Source: EMBL release notes 'relnotes.doc'
# Month Entries Nucleotides
# ------- ------- -----------
06/1982 568 585433
04/1983 811 1114447
12/1983 1481 1654863
08/1984 1698 2147205
04/1985 2378 2874493
08/1985 4835 4567592
12/1985 5789 5622638
04/1986 6395 6353040
09/1986 7630 7813214
12/1986 8817 9766948
04/1987 11621 12189783
07/1987 12706 13638061
10/1987 14397 16023478
01/1988 15344 17272160
05/1988 17961 20318442
08/1988 19592 22625941
11/1988 20695 24211054
02/1989 22938 27249830
05/1989 24365 29066676
08/1989 26223 31240948
11/1989 28679 34748087
02/1990 31508 38165786
05/1990 34902 42923803
08/1990 37784 47354438
11/1990 41580 52900354
02/1991 43745 55859549
05/1991 46871 59915244
09/1991 54558 70448052
12/1991 57765 75400487
03/1992 63378 83574342
06/1992 72481 94390065
09/1992 79377 101292310
12/1992 89100 111413979
03/1993 99591 121420828
06/1993 108973 131880111
European Molecular Biology Labs protein sequence database
# Swissprot sizes.
# Source: Swissprot release notes 'relnotes.doc'
#
# Release Date No. entries No amino acids
3.0 11/86 4160 969641
4.0 04/87 4387 1036010
5.0 09/87 5205 1327683
6.0 01/88 6102 1653982
7.0 04/88 6821 1885771
8.0 08/88 7724 2224465
9.0 11/88 8702 2498140
10.0 03/89 10008 2952613
11.0 07/89 10856 3265966
12.0 10/89 12305 3797482
13.0 01/90 13837 4347336
14.0 04/90 15409 4914264
15.0 08/90 16941 5486399
16.0 11/90 18364 5986949
17.0 02/91 20024 6524504
18.0 05/91 20772 6792034
19.0 08/91 21795 7173785
20.0 11/91 22654 7500086
21.0 03/92 23742 7866596
22.0 05/92 25044 8375696
23.0 08/92 26706 9011391
24.0 12/92 28154 9545427
25.0 04/93 29955 10214020
26.0 07/93 31808 10875091
USA nucleotide sequence database
Genbank Database sizes
This info is available for anonymous FTP from genbank.bio.net in
pub/db/genbank.stats.
Release Date Entries Bases
3 Dec-82 606 680338
14 Nov-83 2427 2274029
20 May-84 3665 3002088
24 Sep-84 4135 3323270
26 Nov-84 4393 3689752
32 May-85 4954 4311931
36 Sep-85 5700 5204420
40 Feb-86 6642 5925429
42 May-86 7416 6765476
44 Aug-86 8823 8442357
46 Nov-86 9978 9615371
48 Feb-87 10913 10961380
50 May-87 12534 13048473
52 Aug-87 14020 14855145
54 Dec-87 15465 16752872
55 Mar-88 17047 19156002
56 Jun-88 18226 20795279
57 Sep-88 19044 22019698
58 Dec-88 21248 24690876
59 Mar-89 22479 26382491
60 Jun-89 26317 31808784
61 Sep-89 28791 34762585
62 Dec-89 31229 37183950
63 Mar-90 33377 40127752
64 Jun-90 35100 42495893
65 Sep-90 39533 49179285
66 Dec-90 41057 51306092
67 Mar-91 43903 55169276
68 Jun-91 51418 65868799
69 Sep-91 55631 71947426
70 Dec-91 58952 77337678
71 Mar-92 65100 83894652
72 Jun-92 71280 92160761
77 Jun-93 120134 138904393
Great strides are being made in sequencing the following organisms:
E.coli, Brewers Yeast (Saccharomyces - the whole of chromosome III of
this has been sequenced), Nematode, Puffer fish, Mouse, Pig, Human,
Thale weed (Thaliana arabidipsis), Rice.
There are probably many other smaller sequencing projects that I don't
know of.
There are an awful lot of small bits of the human genome that have been
sequenced. It's very difficult at present to give an accurate figure to
the number of genes sequenced or the proportion of the genome sequenced,
bvut there are 21748 human entries in the nucleotide databases at present
- this will be an overestimate of the number of genes sequenced because
some genes will have been sequences several times and some genes will be
present as fragments in different entries that have not been edited
together yet. But it's a good starting figure. Try halving it to get a
good estimate of the number of human genes sequenced.
Regards,
Gary Williams
Computing Services Section, Janet: G.Williams at UK.AC.MRC.HGMP
MRC Human Genome Mapping Project, Internet: G.Williams at HGMP.MRC
More information about the Bioforum
mailing list