New FGENESH with GC-donor exon gene prediction program

Victor Solovyev solovyev at sanger.ac.uk
Fri Dec 17 06:18:14 EST 1999


We installed New Version of gene-finding HMM based program FGENESH (GC) 
 for multiple gene prediction including GC-exons in  genomic DNA at
                     http://genomic.sanger.ac.uk/

 
FGENESH (with possible Donor GC) - Prediction of multiple genes in genomic
DNA sequences 
A NEW version of FGENESH program including NONCANONICAL GC dinucleotide in
donor splice sites. 
This is the first program including in prediction the noncanonical exons.
Donor GC splice site is accounting for the major part of non-standard splice
sites in
human genes. It present about 0.6% of all splice sites and observed in more
than 5% of human
genes. 
The noncanonical splice sites we investigated by us recently (Burset,
Seledtsov and Solovyev,
 1999 in preparation) and we received about 20000 verified by EST splice
sites.
 We received a very strong GC-donor site weight matrix which is used in gene
prediction program. 
We have developed this variant of program to predict GC-donor exons in
addition to standard
exons and we preserve the accuracy of program on the standard genes. Testing
the
program on 68 human genes with at least one GC donor site shows that FGENESH
(GC) 
provide 10% higher rate of exact exon prediction for such group and 5%
higheraccuracy on the
nucleotide livel. 

Past your sequence to the first window or load your file with nucleotide 
sequence in FASTA format

Past your protein sequence to the second window 

     References: Salamov A.A., Solovyev V.V. (1999), unpublished data. 
     Please reference: CGG WEB server:
     http://genomic.sanger.ac.uk/ 

     Fgenesh+ output: 

             
      G - the number of predicted gene (from sequence start)
      Str -  DNA strand (+ and - for complementary)
      Feature - type of coding sequence (CDSf - First 
                (Starting with Start codon); 
                 CDSi - internal (internal exon);
                 CDSl - the last coding seagment, 
                        finishing by stop codon)
      TSS - Position of transcription start (TATA-box position and score) 

      Start and End - Position of the Feature
      Weight - Log likelihood*10 score for the feature
      ORF-start/end - positions where the complete codons start and end 
      The last 3 values: Length of exon, positions in protein, % of
similarity w
n 

          FGENESH+ Prediction of potential genes in Human      genomic DNA
          Time:   Mon Jul 26 21:38:41 1999
          Seq name: Adh_and_cact.1 (2919020 bases) 848501 853000 Protein -
gi|23
4 Length  215 Sim: 90
          Length of sequence:  4500  GC content: 40 Zone: 1
          Number of predicted genes 1 in +chain 1 in -chain 0
          Number of predicted exons 4 in +chain 4 in -chain 0
          Positions of predicted genes and exons:
           G Str Feature    Start     End   Score        ORF           Len

           1 +   1 CDSi    2577 -    2690    197.66    2579 -    2689    111

           1 +   2 CDSi    2756 -    2936    312.35    2758 -    2934    177

           1 +   3 CDSi    2991 -    3173    307.82    2992 -    3171    180

           1 +   4 CDSl    3242 -    3419    301.90    3243 -    3419    177


         Predicted protein(s):
         >FGENESH   1   4 exon (s)   2577  -   3419    217 aa, chain +
         PNMTAAPYNYNYIFKYIIIGDMGVGKSCLLHQFTEKKFMANCPHTIGVEFGTRIIEVDDK
         KIKLQIWDTAGQERFRAVTRSYYRGAAGALMVYDITRRSTYNHLSSWLTDTRNLTNPSTV
         IFLIGNKSDLESTREVTYEEAKEFADENGLMFLEASAMTGQNVEEAFLETARKIYQNIQE
         GRLDLNASESGVQHRPSQPSRTSLSSEATGAKDQCSC
---





More information about the Bio-soft mailing list