Software for finding specified sequence elements
Ethan.Hack at newcastle.ac.uk
Ethan.Hack at newcastle.ac.uk
Wed Aug 14 07:14:42 EST 1991
This is a follow-up to a request that I posted some time ago for
information about programs with the ability to find loosely-specified
nucleotide sequence elements, such as prokaryotic ribosome binding
sites, in larger sequences. This request led me to a package called
OVERSEER, written by Peter Sibbald et al. of the EMBL Biocomputing
Group, which I highly recommend. (A description is in press in
CABIOS.) The particular virtue of this program is its ability to find
sequences whose specification is complex. It can find specified fixed
length sequences, repeats, and palindromes, all with or without
mismatches, but most importantly, it can find sequences containing any
combination you like of the above.
The prokaryotic ribosome binding site provides a simple example: a
somewhat loosely defined consensus sequence must be followed at a
certain distance by an exactly-defined initiation codon. To
illustrate the possible complexity of the target, here is another
example, taken from the program manual:
"Find two stem loop structures that are within 30 nucleotides of each
other. The last position of the first loop must be a g. The stems
can have a maximum length of 10 and a minimum length of 8. The loop
sizes must be between 5 and 7. The first 3 positions of loop 1 must
pair with the first 3 positions of loop 2. . .
Strategy: first find a stem loop with a stem of 8 to 10 and a loop of 5 to 7.
second, when that is found, ask if last position of loop is a g
third, if it is then find second stem loop 8-10, 5-7
fourth, compare the 2 three-base strands and see if they pair
if all four then a hit will be reported."
The OVERSEER programs (one to describe the sequence to be found, the
other to find it) are supplied as source code, written in Pascal, not
as compiled programs, and the authors encourage users to modify them.
The package is available for UNIX and VAX by e-mail (not ftp at the
moment) from the EMBL File Server, where it is rather uninformatively
described as a "package for searching nucleic acid sequence data
bases". Both UNIX and VAX versions are in uuencoded (i.e.,
e-mailable) files called OVERSEER.UUE, which include clear
instructions for using the package. I had to remove one line to get
one of the programs to work correctly on a Sun 4/470 running SunOS
4.1; I can provide details to anyone who is interested.
For information on decoding the file, see the accompanying posting, headed
"Uudecoding files with SunOS 4.1"
A further comment which may be helpful to those, like myself, new to
the UNIX world. The instructions for obtaining UNIX software from the
EMBL File Server (to get these send the message HELP UNIX_SOFTWARE to
netserv @ earn.embl) tell one to obtain from the File Server a program
called UUD.C to decode uuencoded files. This program is supplied as
source code in ANSI 'C' and cannot be compiled with Sun 'C'. However,
our computer has a program called "uudecode", which does the job with
no trouble. According to our computer service, "uudecode is part of
the UUCP (UNIX-to-UNIX-copy) suite of programs, which is falling into
disuse in many areas. Since UUCP is often an installation option you
may find that many machines, especially small ones, don't have
uudecode. However there are readily available re-writes in C which
are likely to be more portable than UUD.c."
Ethan Hack, Department of Biological and Nutritional Sciences,
The University, Newcastle upon Tyne NE1 7RU, England
Phone: +44 91 222 6000 ext. 8576 Fax: +44 91 222 6720
More information about the Bio-soft