New WWW,Email service:TSSG and TSSW Prediction of PolII promoter regions in human DNA

Victor V. Solovyev solovyev at cmb.bcm.tmc.edu
Wed Aug 30 21:11:32 EST 1995


New BCM Gene-Finder service: 
==========================================================================
TSSG  and TSSW  Recognition of human PolII promoter region and start of 
                           transcription
==========================================================================
	    (TSS - Transcription Start Site)

   Department of Cell Biology, Baylor College of Medicine

	Analysis of uncharacterized human sequences is available 
through WWW: http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html

 or by sending your file containing a sequence 
(the sequence format is described below) to University of Houston 

and soon to Weizmann Institute of Science Email services:

service at bchs.uh.edu   or services at bioinformatics.weizmann.ac.il

Examples: mail -s tssg service at bchs.uh.edu < test.seq

mail -s tssg services at bioinformatics.weizmann.ac.il < test.seq

where test.seq a file with the sequence.

The same for TSSW program.
 
METHOD DESCRIPTION (TSSG):

   Algorithm predicts  potential transcription start positions by linear 
discriminant function combining characteristics describing functional motifs
and oligonucleotide composition of these sites. TSSG uses promoter.dat file
with selected factor binding sites (TFD, Ghosh,1993) developed by Dan Prestridge 
to calculate the density of functional sites as in (J.Mol.Biol.,1995,249,923-932).
In addition to the parameters of Prestridge's method we use oligonucleotide 
composition around the start of transcription, that permits us to increase an 
accuracy of TSS (transcription start site) defining.
		
  For approximately 50-55% level of true promoter region recognition,
the TSSG program will give one false positive prediction  for about 5000 bp. (this
accuracy is similar with the test sequences anlysis by Prestridge's method).

We estimate an accuracy of defining TSS position on 10 test genes where both 
(our  and  Prestridge's) algorithms found promoter region:

Deviation of predicted TSS from the real TSS: 
_____________________________________________________________________
  Method/deviation	I  5b	I  50 b	I 150 b	I mean of observed
________________________I_______I_______I_______I___deviations_______
  Prestridge's		I  0 	I   3	I   7	I	81.2 base
________________________I_______I_______I_______I_____________________
  TSSG			I  7	I   3 	I   0	I   	 7.3 base
________________________I_______I_______I_______I_____________________

         
 METHOD DESCRIPTION (TSSW):

   Algorithm predicts  potential transcription start positions by linear 
discriminant function combining characteristics describing functional motifs
and oligonucleotide composition of these sites. TSSW uses file
with selected factor binding sites from currently supported functional site data 
base of (E.Wingender, J.of Biotechnology,1994, 35, 273-280).
In addition to the parameters of Prestridge's method (J.Mol.Biol.,1995,249,923-932)
we use some oligonucleotide composition characteristics around the start of 
transcription and within promoter region.
  For approximately 50-55% level of true promoter region recognition,
the TSSW program will give one false positive prediction  for about 4000 bp.

SUBMITTING SEQUENCES VIA EMAIL:

  For email submission the sequences must have the following format:  

Name of your  sequence
ccatctctgtcttgcaggacaatgccgtcttctgtctcgtggggcatcctcctgctggca
ggcctgtgctgcctggtccctgtctccctggctgaggatccccagggagatgctgcccag
aagacagatacatcccaccatgatcaggatcacccaaccttcaacaagatcacccccaac
ctggctgagttcgccttcagcctataccgccagctggcacaccagtccaacagcaccaat
atcttcttctccccagtgagcatcg...............

   (The line length must be less than 80 letters).



   You have to send the file containing the sequence to: 
   service at theory.bchs.uh.edu
   Subject line must be:tssg

   EXAMPLE: mail -s tssg service at bchs.uh.edu < test.seq

TSSG output:		

1st line - name of your sequence; 2nd and 3d lines - LDF threshold and the 
	length of presented sequence
4th line - The number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box
	position (if found)
Position shows the first nucleotide of the transcript (TSS position).
After that functional motifs are given for each predicted region;
(+) or (-) reflects the direct or complementary chain; S... means a
particular motif identificator from the Ghosh data base.

   FOR EXAMPLE: (TSSG)	

 HSCALCAC     7637 bp    DNA             PRI       14-MAR-1995
 Length of sequence-      7637
 Threshold for LDF-  4.00
     1 promoter(s)  were predicted
 Pos.:   1820 LDF- 16.65 TATA box predicted at   1804
 Transcription factor binding sites:
for promoter at position -    1820
  1764 (-) S00098       AACCAAT
  1608 (-) S01152       AAGTGA
  1741 (+) S01153       AARKGA
  1608 (-) S01153       AARKGA
  1657 (+) S01090       AATGA
  1617 (-) S01027       ACGCCC
  1577 (+) S00534       ACGTCA
  1580 (-) S00534       ACGTCA
  1580 (-) S01257       ACGTCAT
..............................

EXAMPLE: (TSSW)

HSCALCAC     7637 bp    DNA             PRI       14-MAR-1995    
     Length of sequence-      7637
 Threshold for LDF-  4.00
     2 promoter(s)  were predicted
 Pos.:   1834 LDF- 11.08 TATA box predicted at   1804
 Pos.:   7031 LDF-  4.64 TATA box predicted at   7001
 Transcription factor binding sites:
             for promoter at position -    1834
  1752 (+) CHICK$ACRA   CCGCCC
  1762 (-) HS$BAC_03    CCAAT
  1764 (-) RAT$ALBU_2   AACCAAT
  1757 (-) HS$APOE_08   GGGCGG
  1575 (+) HS$ACHGON_   TGACGTCA
  1582 (-) HS$ACHGON_   TGACGTCA
  1758 (+) MOUSE$A21C   ATTGG
  1745 (+) MOUSE$A21C   gcccagccctcccATTGGtggagacg
  1609 (+) Y$CYC1_09    ctcatttggcgagcGTTGGt
  1724 (+) AD$E2L_04    TGACgcA
  1577 (+) AD$E4_16     ACGTCA
  1580 (-) AD$E4_16     ACGTCA
  1580 (-) AD$E4_18     ACGTCAT
  1655 (+) HS$EGFR_15   TCAAT
..............................

   HS$EGFR_15 and etc. are particular motif identificators from the 
Wingender data base.


Reference:

Solovyev V.V., Salamov A.A., Lawrence C.B. Recognition
	of PolII promoter region and start of transcription position in human genes.
	(1995) (in preparation). 

Questions:solovyev at cmb.bcm.tmc.edu

===============================================================
The other services are 
===============================================================
FGENEH - search for gene structure with exons assembling by dynamic programming 
FEXH   - search for 5'-, internal and 3'-exons
HEXON  - search for internal exons 
HSPL   - search for splice sites
RNASPL - prediction exon-exon junctions in cDNA sequences
CDSB   - prediction of Bacterial coding regions
HBR    - recognition of human and bacterial sequences to test a library 
         for E. coli contamination by sequencing example clones
TSSG   - recognition of human promoter regions (Ghosh/Prestridge motif data)
TSSW   - recognition of human promoter regions (Weingender motif data base) 
POLYAH - recognition of of 3'-end cleavage and polyadenilation region
         of human mRNA precursors
SSP    - prediction of a-helix and b-strand in globular proteins
	 by segment-oriented approach.
NSSP   - prediction of a-helix and b-strand segments in globular proteins
         by nearest-neighbor algorithm.




More information about the Bio-soft mailing list