Prot_map - a fast tool to align proteins with genome and reconstruct exon-intron structure

webmaster at softberry.com webmaster at softberry.com
Fri May 7 02:40:45 EST 2004


Prot_map - a fast tool to align proteins with genome and reconstruct exon-
intron structure

  has been developed recently and available to run at:

 http://sun1.softberry.ru/berry.phtml?
topic=prot_map&group=programs&subgroup=xmap

Prot_map program maps a set of protein sequences onto genomic sequence 
producing gene 
structures and the corresponding alignments of coding exons with the similar 
or identical 
protein queries. Prot_map uses a genomic sequence and a set of protein 
sequences as 
its input parameters. Prot_map reconstructs the gene structure on the base of 
identical 
or similar protein instead of a set of unordered alignment fragments that 
generated 
the Blast program. The program is very fast, and the produces gene structure 
similar 
with the accuracy of slow Genewise program (that practically required knowing 
the 
protein genomic location) (Table 1). You can further significantly improve the 
accuracy 
of gene reconstruction with Fgenesh+ program by using the results of Prot_map 
(i.e.a fragment of genomic sequence and the protein sequence mapped on it) ( 
Table 2).

(1) Prot_map program is used in pipeline (Fgenesh++) of automatic annotation 
of 
new genomic sequences, as well as (2) to generate a set of genes in new 
genomes 
(without known genes) to train parameters of gene-finding programs. (3) It is 
very useful to find pseudogenes by selection of corrupted gene structures 
resulted in mapping a set of known proteins.

Figure 1. Example of mapping a protein sequence on the human 19 chromosome.

L:3000000    Sequence Chr19 [cut:1 3000000]
[DD] Sequence:       1(      1), S:      105.56, L:1739
IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3
Summ of block lengths: 1284, Alignment bounds:
On first  sequence: start   2146727, end   2167197, length 20471
On second sequence: start       263, end      1682, length 1420
Blocks of alignment: 21       
    1 E: 2146727      70 [ca GT] P: 2146727     263 L: 23, G: 101.574  S:14.75
    2 E: 2147573     107 [AG GT] P: 2147575     287 L: 35, G: 103.465, S:18.56
    3 E: 2148934      42 [AG GT] P: 2148934     322 L: 14, G: 103.043, S:11.68
    4 E: 2150399     111 [AG GT] P: 2150399     336 L: 37, G: 102.130, S:18.82
    5 E: 2150620     235 [AG GT] P: 2150620     373 L: 78, G: 101.500, S:27.15
    6 E: 2151098     114 [AG GT] P: 2151100     452 L: 37, G: 106.924, S:19.76
    7 E: 2151750      92 [AG GT] P: 2151752     490 L: 30, G: 101.424, S:16.82
    8 E: 2153538     102 [AG GT] P: 2153538     520 L: 34, G: 100.496, S:17.73
    9 E: 2153848     138 [AG GT] P: 2153848     554 L: 46, G:  99.003, S:20.30
   10 E: 2154470     126 [AG GT] P: 2154470     600 L: 42, G: 101.283, S:19.87

          1        11   2146713   2146723   2146739   2146769
          gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg
          ---------------(..)evdhqlkerfanmke  GGRIVSSKPFAPLNFRINSRNLS-
        248       248       249       259       267       277

    2146797   2146806   2147558   2147568   2147581   2147611
          ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK
           ---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK
        286       286       286       286       289       299

    2147641   2147671   2147686   2148919   2148926   2148937
          PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP
          PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP
        309       319       322       322       322       323

    2148967   2148982   2150384   2150391   2150402   2150432
          KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP
          KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP
        333       336       336       336       337       347

    2150462   2150492   2150513   2150523   2150609   2150619
          TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK
          TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK
        357       367       373       373       373       373


Table 1. Speed of processing sequences by Prot_map, Fgenesh+ and GeneWise.

  Fgenesh+  Prot_map  GeneWise 
88 sequences of genes < 20 kb  ~1 min  ~1 min  ~90 min 
8 sequences of genes > 400000 kb  ~1 min  ~1 min  ~1200 min 

Table 2. Comparison of accuracy of gene identification programs: ab initio 
Fgenesh and prediction with protein support: Fgenesh+ , GenWise and Prot_map 
on a set of human genes using mouse or drosophila homologous proteins. %CG 
(correct genes) is % of exactly predicted genes.

Mouse homologs: 60% < similarity level < 80% - 1425 sequences 

   Sn ex  Sno ex Sp ex Sn nuc Sp nuc CC %CG 
Fgenesh  83.4  90.9  86.8  93.2  94.9  0.937 30 
Genwise  88.1 96.5 90.5 97.8 99.2 0.984 43 
Fgenesh+ 93.9 97.9 94.9 98.4 99.3 0.988 65 
Prot_map 87.0 96.5 86.6 97.0 98.5 0.976 40 

Drosophila homologs: similarity level > 80% - 66 sequences.

  Sn ex  Sno ex Sp ex  Sn nuc Sp nuc CC  CG% 
Fgenesh  90.5  93.8  95.1  97.9  96.9  0.950  55 
Genwise  79.3  83.9  86.8  97.3  99.5  0.985  23 
Fgenesh+  95.1  97.8  97.0  98.9  99.5  0.9914  70 
Prot_map  86.4  95.3  88.1  97.6  99.0  0.982  41 
 

---




More information about the Bio-soft mailing list