GENEID - Online Prediction of Gene Structure

Steen Knudsen steen at darwin.bu.edu
Sun Dec 1 11:27:08 EST 1991



          GENEID ONLINE SYSTEM FOR PREDICTION OF GENE STRUCTURE
                        version 1.0 11/20/1991

Geneid is an Artificial Intelligence system for analyzing vertebrate genomic 
DNA and prediction of exons and gene structure (1). A prototype is implemented
as a fast, automatic email-response system. 
 
REGISTRATION:
Before or simultaneously with submitting a sequence for analysis, you need to
register your name by sending a line with the word "register", followed by
your name and address. Example:

  register, Don Johnson,  Miami Vice,  Baywiev Marina Dock A12,  Miami, FL 
 34566-1234, U.S.A.

(the line can be longer than 80 characters as long as it contains no 
linebreaks). Send the line in a mail to: geneid at darwin.bu.edu.  The
registration information will only be used for maintaining a file of the
number and geographic distribution of the users.


SUBMITTING SEQUENCES:
Your sequences must be submitted in the following format (approximately same 
format as used for fasta, BLAST and GRAIL):
You can submit only one sequence per mail. Put the sequence after the 
keyword "Genomic Sequence" as shown below:

Genomic Sequence
>seqname
TTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCC
CGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTG
GCCAACTCCATCACTA...................

(Remember that long lines get truncated by Mail, so try to keep the lines 
below 80 characters. The seqname is limited to 20 characters). If your mail
does not contain the keyword "Genomic Sequence", or any other keywords listed
in this file, no mail will be returned to you.
If the reply file with the results will exceed the Mail limit of 300 kB, the 
reply will be split into several files.
On a UNIX system you could send the File containing the sequence as follows:
mail -v geneid at darwin.bu.edu  <File


LIMITS:
GeneId currently will not accept sequences smaller than 100 bp or larger 
than 20 kb.


CONFIDENTIALITY:
Your submitted sequence will be deleted automatically immediately after
reception by GeneID.


ANALYSIS:
GeneID will scan your sequence for potential splice sites, startcodons, and 
stopcodons. Then it will try to assemble these into potential first exons, 
internal exons, and last exons. Exons will be evaluated according to a number 
of characteristics related to coding and splicing, and only likely exons will 
be kept. Mutually exchangeable exons (normally overlapping and in the same 
frame) will be put together in classes. Only the top 15 ranking first and
last exon classes, and the top 35 ranking internal exon classes 
from each sequence will be kept, and assembled into potential gene models 
with
open reading frame, that will be ranked according to quality of the exons 
they contain. The top 20 models will be included in the return mail. Your 
return mail will also contain lists of the sites and exons created during the 
analysis. GeneID will not analyze the reverse complement of your sequence. If 
you suspect a gene on the other strand, submit the reverse complement 
sequence separately.


TIPS FOR USE OF GENEID:
GeneID will try to identify first, internal, and last exons in each of the
sequences you submit, and try to assemble these into models of ONE likely
gene in each sequence. To avoid missing any exons, the number of exons will
be vastly overpredicted, and only a few of them are likely to be true (they 
tend to be the top ranking exons, but a few true exons rank very low). But
these few true exons are likely to be found in the gene models because they
fit together to form a continuous open reading frame. Thus you should look 
to the gene models to find a probable coding region.
If you submit a sequence that turns out to contain two genes, the behavior 
of GeneID is unpredictable. It could either predict one large gene containing 
both, or it could predict only the gene with the most typical charateristics.
If you submit a sequence that contains only part of a gene, GeneID will try 
to identify an entire gene in this sequence. Thus the predicted first exon 
may actually be part of a true internal exon, or the predicted last exon may 
be part of a true internal exon. If GeneID fails to predict any genes, you 
might look at the potential exon lists.
Thus you can experiment with input and response, by starting out with 
sequences that are not too long (for example less than 10 kb), and see if GeneID is able to extend the gene if you extend the sequence.
GeneID will not construct models with more than 22 exons.
If the sequence contains frameshift errors in exons, then that may affect the 
quality of the prediction in the current implementation.

ACCURACY:
In a test on 28 genes from GenBank, 91% of the nucleotides were correctly 
predicted as coding or non-coding. Since these two categories are unequally 
represented, a better measure of accuracy may be the correlation 
coefficient, which was found to be 0.68. See paper for details.

ANALYSIS TIME:
Will depend on the load on the system and grows approximately linearly with
the length of the sequence input. Expect at least 1 minute per kb. Longer
response times can occur if the system is temporarily down (check with the
UNIX command: "finger geneid at darwin.bu.edu").

FURTHER INFORMATION:
A preprint of a paper describing the development and testing of GeneID is
available as a Stuffit.hqx file for Macintosh. Simply include the line:
  
  Preprint Request

in your mail to geneid at darwin.bu.edu, and the manuscript will be mailed to 
you.


REFERENCING:
Publication of output from GeneID must be referenced as follows:
(1) Guigo, R., Knudsen, S., Drake, N., and Smith, T. (1991) Prediction of 
Gene Structure. Submitted manuscript.


PROBLEMS, COMMENTS, AND SUGGESTIONS:
Can be mailed to klose at darwin.bu.edu.

Users of the MBCRR and BMERC national computer resources have direct 
online access from their account. Contact Tom Graf at tom at mbcrr.harvard.edu
for information on these accounts.










----------------------------------------
Biomolecular Engineering Research Center
Boston University




More information about the Bio-soft mailing list