Sequence Annotation

T. Kaufman kaufman at
Tue Mar 28 12:46:01 EST 2000

I am posting this notice on behalf of the Drosophila Genome Sequencing
Consortium.  If you have any questions or comments please direct them to
the appropriate parties.  Do not reply to me.


Dear Drosophila Community Members:

The BDGP and Celera Genomics are happy to report the sequencing and
annotation of the euchromatic genome of Drosophila melanogaster. The
results appear in the March 24 issue of the journal Science.  The sequence
can be accessed from GenBank/EMBL/DDBJ or from the BDGP web site
<> and the annotations can be queried at the BDGP or at
FlyBase <>.

The sequence and its annotation are an extraordinary resource, and we want
you to make the fullest possible use of them. The pointers below will help
you to get the most out of these data.

Right now, the error rate in the Drosophila sequence is around 1 in 10,000
bp. This means that there may be on average one error - for example, a
frameshift mutation - in every 5 genes. There are also about 1,500 small
gaps. The sequence was annotated at Celera at an 'Annotation Jamboree'
using gene prediction programs and limited human curation. Over the next
six months to one year, the BDGP will finish the sequence to the Phase III
standard. As finishing is completed for an interval, that interval will be
re-annotated; when an entire arm is finished and re-annotated, there will
be a new release of that arm. Between now and then, we need your help! You
will undoubtedly notice errors in the sequence and mistakes in the
annotations. Help us to correct them by sending us an Error Report
(instructions will be on the FlyBase and BDGP web sites). We will post your
comments as additions to the gene record and, when we reach that part of
the genome, we will use them as an aid in the finishing process and to
correct gene annotations.

You can view the annotated sequence using a tool called GeneScene; details
of the annotated genes are available in a database called GadFly. These
have some limitations, and the following guidelines will help you to avoid
wasting time and energy.

a. Treat the current annotations with skepticism. The annotations were done
using a combination of gene prediction programs and limited human curation.
Gene prediction programs do a very good job of identifying exons, but are
less proficient at determining exact splice sites. It is likely that only a
minority of the predicted gene structures in the current annotated set are
completely correct. Another common problem is that two genes are merged
into one, or conversely, that one gene has been split into two. If you know
that a particular annotation is incorrect, please help us by filling out an
Error Report form. <>

b. The functional classifications that you see in GadFly were done
computationally as a way to manage the task of annotation and have had
limited human oversight; therefore some of the classifications of a
predicted protein's function may be wrong. You must not unquestioningly
accept them.

c. The BLAST searches reported in GadFly were run in November. You must do
your own BLAST searches in order to get the most current results. When we
re-annotate the genome, we will re-run BLAST, but until the annotations are
refined, it is important to do your own BLAST searches.

d. We anticipate greatly increased use of the BDGP BLAST server. In order
to manage the increased load, we are now directing all TBLAST searches to
the NCBI BLAST server.

e. GeneScene reports evidence by whole gene, not by exon. For example, a
BLAST hit to one exon of a gene is indistinguishable from a BLAST hit along
the entire length of the gene. Please re-run BLAST to make sure that you
can correctly evaluate these pieces of evidence.

f. All predicted genes have been assigned a CG number. Each CG number has
been assigned a FlyBase Annotation number and a FlyBase gene number. You
can search FlyBase for a particular gene using its CG number, which will
lead you to a FlyBase gene report for that gene. For the time being, a
separate GadFly annotation report also exists for that gene on the BDGP
website (, containing both overlapping and additional
molecular data. These two separate pages for each gene exist because we
have not yet had time to integrate the vast amount of data from the
sequence annotation jamboree (summarized on the GadFly pages) and the
traditional FlyBase gene reports. The two pages will be merged, and we
appreciate your patience.

g. The Drosophila genome may change relatively radically over the next year
as the sequence is finished and re-annotated. This makes it problematic to
identify a particular gene in your publications according to the coordinate
system that you see in GeneScene, as gap filling will change the sequence
numbering. We are working on identifying a set of genomic sequence tags
that will serve as unique identifiers at approximately 1kb intervals
throughout the genome. Using these will involve your adjusting to a
different reference system, but we hope you immediately appreciate the
advantage that it poses for continuity in the literature. Moreover, the
reannotation will change the predicted gene structures, and some predicted
genes may disappear entirely. For example, the number of predicted genes in
the C. elegans genome has shrunk by more than 1000 since its first release;
we can expect a comparable change in the Drosophila sequence annotation.
This means that as genes are merged or broken up, their CG and FBgn numbers
may change. However, old CG and FBgn numbers will be maintained as

Again, we are relying on the fly community to help us to make the sequence
and its annotation as accurate as possible; their value to the community at
large will increase with your efforts to refine them. We look forward to
your hearing from you.

Thom Kaufman   	     	       kaufman at
--- Department of Biology, HHMI --- Indiana University
--- Jordan Hall 142 --- 1001 East Third Street --- Bloomington, IN 47405
--- 812-855-3033/Office --- 812-855-7674/Lab --- 812-855-2577/FAX


More information about the Dros mailing list