dbEST passes 50,000 sequence mark

Mark Boguski boguski at CASTALIA.NLM.NIH.GOV
Wed Sep 28 13:40:45 EST 1994


The number of public cDNA sequences ("Expressed Sequence Tags" or 
ESTs) recently exceeded the 50,000 mark* and it was of interest to assess 
the usefulness of this resource for gene discovery.  We therefore compiled a 
list of 32 human disease genes that had been cloned as of August 1994 by 
either the positional cloning or positional candidate methods  (1)  and 
performed sequence homology searching  (2) , against dbEST, the database 
of expressed sequence tags (3).   Thirty eight percent of these human genes 
had exact and often multiple matches in dbEST and an additional 47% 
were represented by homologs in other organisms.**  Only five human 
disease genes had no convincing matches with ESTs.  Thus for 85% of the 
human disease genes positionally-cloned to date, there is a homologous 
partial cDNA sequence in the public domain.

	These results underscore the utility of "single pass," tag/survey 
cDNA sequencing  (4)  and demonstrate that much valuable information is 
already present in the public databases if one knows how to find it  (2) .  
These results also underscore the value of "model organisms" for 
accelerating progress in the identification of human genes by homology - 
an explicit goal of the U.S. Genome Program (5).  If one is searching for 
exons in human genomic DNA, a statistically significant match to a 
cDNA, whether it be from humans, nematodes, rice, maize or yeast, is the 
best proof (apart from an experiment) that an exon has been found.

	dbEST may be searched using the BLAST  (2)  e-mail or network 
services and full reports on individual ESTs may be obtained via NCBI's 
retrieve e-mail server (6).  The capability of retrieving ESTs based on their 
chromosome assignment and map location has recently been 
implemented.  Instructions on submitting new sequence and mapping 
data are available (6).   World Wide Web access is also provided at 
http://www.ncbi.nlm.nih.gov/.  An NCSA Mosaic interface (7) allows 
complex (Boolean) queries of dbEST to be performed.

Mark S. Boguski, Carolyn M. Tolstoshev
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Bldg. 38A,
8600 Rockville Pike,
Bethesda, MD  20894, USA

Douglas E. Bassett, Jr.
Johns Hopkins University
School of Medicine,
725 North Wolfe Street,
Baltimore, MD 21205, USA

*dbEST release 2.27 contained 50,214 DNA sequences from 22 different 
organisms.  Information on the current release is available via the
World Wide Web at http://www.ncbi.nlm.nih.gov/dbEST/index.html.
**A detailed summary of these homologies with dbEST sequences in is 
available in Postscript, GIF and HTML formats on the dbEST Home Page 
at the URL specified above.  We thank Dan Jacobson for instructing us on 
how to provide the HTML links to OMIM entries (McKusick, V.  Online 
Mendelian Inheritance in Man.  The Johns Hopkins University, 
Baltimore, MD).

References and Notes

1.	A. Ballabio, Nature Genet. 3, 277-279 (1993).

2.	S. F. Altschul, M. S. Boguski, W. Gish, J. C. Wootton, Nature Genet. 
6, 119-129 (1994).  The TBLASTN program is essential for EST homology 
searching.  TBLASTN takes a protein query sequence and compares it 
against conceptual translations of DNA sequences in all six reading 
frames.  This is much more sensitive than nucleotide vs. nucleotide 
comparisons for detecting more distant, cross-phylum relationships (D.J. 
States, S.F. Altschul, Methods 3, 66-70 (1991)).  Indeed most of the 
homologs representing inexact matches would not have been detected by 
searching GenBank for nucleotide sequence similarities alone.

3.	M. S. Boguski, T. M. J. Lowe, C. M. Tolstoshev, Nature Genetics 4, 
332-333 (1993).  Although all dbEST sequences are also present in the EST 
Division of GenBank (D. Benson, D.J. Lipman, J. Ostell, Nucl. Acids Res. 
13, 2963-2965 (1993)), dbEST contains additional value-added annotation 
such as the latest homologies, mapping data and contact information for 
obtaining physical DNA clones.  Note that in addition to cDNA data, dbEST 
contains some genomic sequences that have been isolated by exon 
"trapping" or "amplification" (e.g. A.J. Buckler, et al.  Proc. Natl. Acad. 
Sci. USA 88, 4005-4009 (1991)).

4.	M. D. Adams, et al., Science 252, 1651-6 (1991); A. S. Kahn, et al., 
Nature Genet. 2, 180-185 (1992); K. Okubo, et al., Nature Genet. 2, 173-179 
(1992); R. Waterston, et al., Nature Genet. 1, 114-123 (1992).

5.	F. Collins, D. Galas, Science 262, 43-46 (1993).

6.	The e-mail address for BLAST is blast at ncbi.nlm.nih.gov and the 
address for database records is retrieve at ncbi.nlm.nih.gov.  To receive 
documentation, send a message containing the work 'help' (unquoted) in 
the body of the message.  For specific information on dbEST, place the 
instruction 'datalib dbest' (unquoted) on a line preceding 'help.'  For 
information on the BLAST network service, send e-mail to blast-
help at ncbi.nlm.nih.gov.  For information on submitting data send e-mail to 
info at ncbi.nlm.nih.gov.  For other questions, telephone 301-496-2475
and ask for the service desk.

7.	B.R. Schatz, J.B. Hardin, Science 265, 895-901 (1994).

More information about the Arab-gen mailing list