New FGENESB - Finding genes in microbial genomes

victor at softberry.com victor at softberry.com
Wed Aug 28 09:18:47 EST 2002


         New FGENESB - Finding genes in microbial genomes

New FgenesB is the fastest (E.coli genome analyzed in ~14 sec) and most 
accurate ab initio Bacterial gene prediction program available.

	http://www.softberry.com/berry.phtml?topic=fgenesb

It uses parameters learned for different bacteria by FgenesB-train script, 
which input is just new bacterial sequence. It will automatically create 
file with gene prediction parameters for the analyzed organism. 
It takes only ~10 minutes to create such file for such genome as 
E.coli using its sequence. If you need parameters for your new bacteria, 
please contact Softberry Inc., we can include them in the WEB list. 


Algorithm based on pattern recognition of different types of signals 
and Markov chain models of coding regions. Optimal combination of these 
features is then found by dynamic programming and a set of gene models 
is constructed along given sequencea.

--------------------------------------------------------------------------------
Accuracy of prediction estimated on B.subtilis sequence: 
Number of non-first possible start codon genes - 19.1% 
Borodovsky et al. (see GeneMark WEB pages) calculated accuracy for all genes, 
and 3 sets of difficult short genes (L <= 300bp) having protein similarity 
support to demonstrate that short genes also can be predicted reasonably good. 
First set (51set) has 51 genes with at least 10 strong similarities to known 
proteins. Then 72set has 72 genes with at least 2 strong similarities and 
123set with at least one homolog. 
Here is data of GeneMarkS and Glimmer as he calculated and 
                              FgenesB (after 3 iterations of fgenesB-train): 

               Sn (exact        Sn (exact+overlapping
                  predictions)      predictions)

 123set: 

Glimmer         57.0%           91.1 
GeneMarkS       82.9            91.9 
FgenesB         89.3            98.4

 72set: 

Glimmer         57.0%           91.7 
GeneMarkS       88.9            94.4 
FgenesB         91.5            98.6

 51set: 

Glimmer         51.0%           88.2 
GeneMarkS       90.2%           94.1% 
FgenesB         02.0            98.0

 All genes set: 

Glimmer         62.4%           98.1 
GeneMarkS       83.9            96.7  
FgenesB         83.8            98.7   

(PS: we should note that many genes in GenBank is annotated using GeneMark
     program, and it should generate overestimation of accuracy for GeneMark).



FgenesB output: 
bact  Tue Aug 27 00:12:46 EDT 2002
 FgenesB:  Finding genes in microbial genomes (Softberry Inc.)
 Time:   Tue Aug 27 00:12:46 2002
 Seq name: Softberry SERVER PAST Sequence 
 Length of sequence - 12780 bp   Parameters:  Escherichia_coli_K-12.dat 
 Number of predicted genes - 12
     N   S             Start         End    Score

     1   +    CDS        190 -       255    100.0 
     2   +    CDS        337 -      2799   2467.0 
     3   +    CDS       2801 -      3733    785.0 
     4   +    CDS       3734 -      5020   1493.0 
     5   +    CDS       5234 -      5530    161.0 
     6   -    CDS       5683 -      6459    870.0 
     7   -    CDS       6529 -      7959   1033.0 
     8   +    CDS       8238 -      9191   1319.0 
     9   +    CDS       9306 -      9893    544.0 
    10   -    CDS       9928 -     10479    775.0 
    11   -    CDS      10643 -     11356    594.0 
    12   -    CDS      11382 -     11786    394.0 

.................................

Predicted protein(s):
>GENE     1       190  -       255     21 aa, chain +
MKRISTTITTTITITTGNGAG
>GENE     2       337  -      2799    820 aa, chain +
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA
LPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINA
ALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIP
ADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD
EDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISF
CVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGAL
LEQLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRL
VKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSR
RKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLA
REMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMA
NLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAF
YSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV
>GENE     3      2801  -      3733    310 aa, chain +
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEP
RENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLND
TRLLALMGELEGRISGSIHYDNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGI
KVSTAEARAILPAQYRRQDCIAHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLP
GFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRL
DTAGARVLEN
...............................


-------------------------------------------------
This mail sent through AceDSL WebMail (http://webmail.acedsl.com)
---




More information about the Bio-www mailing list