Announcement: new gene (exon/intron) revealing system for PC

STRELETS at SCRI.FSU.EDU STRELETS at SCRI.FSU.EDU
Thu Oct 21 08:00:20 EST 1993


		SAGITTARIUS DNA Block Marker
		****************************
  
  SAGITTARIUS DNA Block Marker is a package for exon/intron structure 
revealing on the base of protein k-tuples statistic, with orientation 
on MS DOS PC-compartibles (386 or 486 recommended).

   Input: any sequence file (.SEQ). Probably (not well tested..)
you can use free format sequence representation, GCG-style files,
standard sequence extractions from GENBANK or EMBL databases etc.

   Output: on-screen and in file EXONS.RES. Output includes
probable coding regions list (start-end, frame#, weight) with
corresponding aminoacid sequences (this is special feature of
this method - even in the case of nonperfect region boundary
prediction error in coding frame is practically impossible),
and suggested variant of assembled gene sequence (start-end and 
frame# for all included regions) with full coded aminoacid
sequence.

   Alternative input: in presense of SAGITTARIUS GENBANK on
your computer you can use on input any file with bank numbers
(.NOM, standard output of buffer content from SAGITTARIUS databanks).
Program will predict coding regions and genes for corresponding
database sequences in parallel with demonstration of real GENBANK CDS 
features for each sequence what allows to test prediction quality  
on the set of well-known genes (learning mode).

   In the case of user sequence input (from file .SEQ) program
will search for coding regions in both of direct and inverted
strands consequently (independent predictions).
-------------------------------------------------------------
		Short algorithm description
		---------------------------
     Taking into accout a certain proximity of the k-tuple  (peptide)
organization of proteins with similar structure and/or  functions,  a
new method is proposed for detection  of  coding  regions  of  genome
based   on   mapping   of   newly   sequenced    DNA    regions    by
subsequences "admissibility"  function.  This  function  obtained  by
k-tuple analysis of protein data base. In  the  simplest  case,  this
algorithm marks all the  k-tuples  not  found  in  real  proteins  as
structural prohibitions which are forbidden in the coding regions  of
analogous proteins. The longest regions without such prohibitions can
be realized as the best exon (for eukaryotic genome) or  active  gene
(for prokaryotic genome).
     To ensure a statistically significant level in the comparison of
different sequences, the length k of  the  used  k-tuples  is  chosen
taking into account the general incidence of respective  peptides  in
the data base. Thus, for chance sequences with the  same  amino  acid
composition as the data base under study, all the used k-tuples  must
be realized in the data base at least once. In this case, the absence
of some k-tuples can detect  both  the  limitations  on  the  primary
protein structure and a mere lack of statistical data.
     The proposed algorithm has been  tested  on  the  basis  of  the
latest releases of  data  bases  EMBL  and  SwissProt.  As  the  best
examples, human genes with a complicated exon/intron structures  were
tested. Practically in all cases,  the  longest  region  without  any
k-tuple prohibitions was present in the database EMBL as CDS feature.
For all the other shorter real exons, the same result  was  observed,
although with a lower signal/noise ratio. The position  forecast  for
the longest exon was much higher than in  the  case  of  using  other
coding potentials (up to 95% instead of 60-70%).
     The field of application of the proposed algorithm is  somewhat
limited by the existence of a structural or  functional  analog  of
the coded protein among the already sequenced proteins  in  the  data
base. The probability of such a failure has been estimated  by  means
of checking the number of appearing new superfamily proteins  in  two
consecutive releases of PIR base and found to be 2-5%. In  all  other
cases the algorithm may be used as a tool for a  rapid  and  accurate
detection of at least one best exon  in  a  eukaryotic  genome  (with
subsequent more detailed study of complete exon structure by means of
usual algorithms) or a complete active gene in a prokaryotic one.
-------------------------------------------------------------

  SAGITTARIUS DNA Block Marker is available by anonymous FTP from:

 - FTP.SCRI.FSU.EDU, directory /pub/genetics/exons/

  Probably  SAGITTARIUS DNA Block Marker is available by anonymous FTP 
from some of the well-known bio-servers (iubio etc.).

-------------------------------------------------------------
			Installation
			------------
  Distributive variant includes ready-for-use informational files 
and executables (2.5 Mb in total) - all in self-extracting archieve 
(file exonsall.exe).

  All decompressed SAGITTARIUS DNA Block Marker files must be placed in  
one directory (including Borland .BGI and huge .WRK files). 

  Program intensively read data from huge file .WRK what slow down 
prediction process. If you have on your computer virtual (in-memory) 
disk of at least 2.1 Mb capasity you may copy huge file .WRK on this 
disk before program running and assign this disk drive letter G:. Such 
operation allows program to find .WRK file copy on this (G:) virtual disk 
and speed up process at least 5 times. In absense of such large virtual 
disk program will use .WRK file from current directory.

--------------------------------------------------------------

SAGITTARIUS is a FREE DOMAIN software.

This package (with compressed data files) can be redistributed
freely without any limitations but only free of charge and for 
non-commercial usage. No changes in data files and/or executables 
allowed.

--------------------------------------------------------------

For HELPFUL comments and discussions please contact:

	Dr. Victor B.Strelets (strelets at scri.fsu.edu)
	or Dr. Hwa A.Lim (hlim at scri.fsu.edu)

	Supercomputer Computations Research Institute, 
	Florida State University, B-186, 
	Tallahassee, FL 32306-4052, USA

---------------------------------------------------------------

Common SAGITTARIUS information
------------------------------

  SAGITTARIUS is a family of free domain application packages for 
molecular biologists with orientation on MS DOS PC-compartibles:

 - Compressed sequence databases with dialog shells for
   fast and easy data manipulation (PIR and GENBANK variants)

 - Fast programs for cross-bank user SEQ homology searches (short 
   subregion homologies sensitive) (closely connected with compressed 
   SAGITTARIUS databases)

 - Fast sensitive programs for pairwaise alignments (both aminoacid
   and nucleotide), including cross-bank user SEQ alignments (closely 
   connected with compressed databases to allow cross-bank user SEQ 
   alignments)
   
 - Packages for fast tree-based multiple alignments
 
 - Package for sequenation-errors-stable contigs joining
   (on the base of tree-based multiple alignment)

 - Automated system for revealing coding regions (exon/intron structure) 
   in new nucleotide sequences (including learning mode access to
   the nucleotide sequences in compressed SAGITTARIUS GENBANK
   database)
   
 - Personal Reference Database dialog shell for manipulation 
   of data from standard BIO-JOURNALS(BIOSCI), SEQANALREF(Bairoch),
   JOURNALS-TOC(multiple sources) databases	   

--------------------------------------------------------------

Author(s) will in no way be held liable for any loss of profit or 
any other commercial damage including but not limited to special,  
incidental, consequential or other damages from use of this 
package. You may use them only with the understanding that 
you use it at your own risk  and that your use of the software 
is your agreement to this disclaimer. 




More information about the Biomatrx mailing list