Announcement: new gene (exon/intron) revealing system for PC
STRELETS at SCRI.FSU.EDU
STRELETS at SCRI.FSU.EDU
Thu Oct 21 08:00:20 EST 1993
SAGITTARIUS DNA Block Marker
****************************
SAGITTARIUS DNA Block Marker is a package for exon/intron structure
revealing on the base of protein k-tuples statistic, with orientation
on MS DOS PC-compartibles (386 or 486 recommended).
Input: any sequence file (.SEQ). Probably (not well tested..)
you can use free format sequence representation, GCG-style files,
standard sequence extractions from GENBANK or EMBL databases etc.
Output: on-screen and in file EXONS.RES. Output includes
probable coding regions list (start-end, frame#, weight) with
corresponding aminoacid sequences (this is special feature of
this method - even in the case of nonperfect region boundary
prediction error in coding frame is practically impossible),
and suggested variant of assembled gene sequence (start-end and
frame# for all included regions) with full coded aminoacid
sequence.
Alternative input: in presense of SAGITTARIUS GENBANK on
your computer you can use on input any file with bank numbers
(.NOM, standard output of buffer content from SAGITTARIUS databanks).
Program will predict coding regions and genes for corresponding
database sequences in parallel with demonstration of real GENBANK CDS
features for each sequence what allows to test prediction quality
on the set of well-known genes (learning mode).
In the case of user sequence input (from file .SEQ) program
will search for coding regions in both of direct and inverted
strands consequently (independent predictions).
-------------------------------------------------------------
Short algorithm description
---------------------------
Taking into accout a certain proximity of the k-tuple (peptide)
organization of proteins with similar structure and/or functions, a
new method is proposed for detection of coding regions of genome
based on mapping of newly sequenced DNA regions by
subsequences "admissibility" function. This function obtained by
k-tuple analysis of protein data base. In the simplest case, this
algorithm marks all the k-tuples not found in real proteins as
structural prohibitions which are forbidden in the coding regions of
analogous proteins. The longest regions without such prohibitions can
be realized as the best exon (for eukaryotic genome) or active gene
(for prokaryotic genome).
To ensure a statistically significant level in the comparison of
different sequences, the length k of the used k-tuples is chosen
taking into account the general incidence of respective peptides in
the data base. Thus, for chance sequences with the same amino acid
composition as the data base under study, all the used k-tuples must
be realized in the data base at least once. In this case, the absence
of some k-tuples can detect both the limitations on the primary
protein structure and a mere lack of statistical data.
The proposed algorithm has been tested on the basis of the
latest releases of data bases EMBL and SwissProt. As the best
examples, human genes with a complicated exon/intron structures were
tested. Practically in all cases, the longest region without any
k-tuple prohibitions was present in the database EMBL as CDS feature.
For all the other shorter real exons, the same result was observed,
although with a lower signal/noise ratio. The position forecast for
the longest exon was much higher than in the case of using other
coding potentials (up to 95% instead of 60-70%).
The field of application of the proposed algorithm is somewhat
limited by the existence of a structural or functional analog of
the coded protein among the already sequenced proteins in the data
base. The probability of such a failure has been estimated by means
of checking the number of appearing new superfamily proteins in two
consecutive releases of PIR base and found to be 2-5%. In all other
cases the algorithm may be used as a tool for a rapid and accurate
detection of at least one best exon in a eukaryotic genome (with
subsequent more detailed study of complete exon structure by means of
usual algorithms) or a complete active gene in a prokaryotic one.
-------------------------------------------------------------
SAGITTARIUS DNA Block Marker is available by anonymous FTP from:
- FTP.SCRI.FSU.EDU, directory /pub/genetics/exons/
Probably SAGITTARIUS DNA Block Marker is available by anonymous FTP
from some of the well-known bio-servers (iubio etc.).
-------------------------------------------------------------
Installation
------------
Distributive variant includes ready-for-use informational files
and executables (2.5 Mb in total) - all in self-extracting archieve
(file exonsall.exe).
All decompressed SAGITTARIUS DNA Block Marker files must be placed in
one directory (including Borland .BGI and huge .WRK files).
Program intensively read data from huge file .WRK what slow down
prediction process. If you have on your computer virtual (in-memory)
disk of at least 2.1 Mb capasity you may copy huge file .WRK on this
disk before program running and assign this disk drive letter G:. Such
operation allows program to find .WRK file copy on this (G:) virtual disk
and speed up process at least 5 times. In absense of such large virtual
disk program will use .WRK file from current directory.
--------------------------------------------------------------
SAGITTARIUS is a FREE DOMAIN software.
This package (with compressed data files) can be redistributed
freely without any limitations but only free of charge and for
non-commercial usage. No changes in data files and/or executables
allowed.
--------------------------------------------------------------
For HELPFUL comments and discussions please contact:
Dr. Victor B.Strelets (strelets at scri.fsu.edu)
or Dr. Hwa A.Lim (hlim at scri.fsu.edu)
Supercomputer Computations Research Institute,
Florida State University, B-186,
Tallahassee, FL 32306-4052, USA
---------------------------------------------------------------
Common SAGITTARIUS information
------------------------------
SAGITTARIUS is a family of free domain application packages for
molecular biologists with orientation on MS DOS PC-compartibles:
- Compressed sequence databases with dialog shells for
fast and easy data manipulation (PIR and GENBANK variants)
- Fast programs for cross-bank user SEQ homology searches (short
subregion homologies sensitive) (closely connected with compressed
SAGITTARIUS databases)
- Fast sensitive programs for pairwaise alignments (both aminoacid
and nucleotide), including cross-bank user SEQ alignments (closely
connected with compressed databases to allow cross-bank user SEQ
alignments)
- Packages for fast tree-based multiple alignments
- Package for sequenation-errors-stable contigs joining
(on the base of tree-based multiple alignment)
- Automated system for revealing coding regions (exon/intron structure)
in new nucleotide sequences (including learning mode access to
the nucleotide sequences in compressed SAGITTARIUS GENBANK
database)
- Personal Reference Database dialog shell for manipulation
of data from standard BIO-JOURNALS(BIOSCI), SEQANALREF(Bairoch),
JOURNALS-TOC(multiple sources) databases
--------------------------------------------------------------
Author(s) will in no way be held liable for any loss of profit or
any other commercial damage including but not limited to special,
incidental, consequential or other damages from use of this
package. You may use them only with the understanding that
you use it at your own risk and that your use of the software
is your agreement to this disclaimer.
More information about the Biomatrx
mailing list