How do I find sites?

mangalam at SALK-SC2.SDSC.EDU mangalam at SALK-SC2.SDSC.EDU
Sat Sep 26 20:28:42 EST 1992

Brian Osborne writes:
>I am a C beginner learning how to program on my NeXT (!). I thought a
>useful program for me to attempt to make would find all restriction
>sites in a sequence and print it. What sort of methods (structures, hash
>tables, lookup tables, etc.) should I study? How does a program like
>(the great) DNA Strider for the Mac perform this operation so quickly?
>I donUt mean specifically, just the principles I should pay attention to.
>As you can guess, IUm much more biologist than programmer

   As one in a similar state, here are some remarks:
   Strider achieves its exceptional speed by using a prehashed table of
hexamer restriction sites (that's why there's that annoying pause when it
loads - it's recalculating the hash table each time).  Using this method,
when Strider goes to check if a sequence is recognized by a restriction
enzyme (RE), it only has to check its hash table to see if any RE will cut
there; if there is none (and about 75% of the time there isn't) then it can
quickly proceed to the next 6mer.  If the 6mer is recognized, Strider then
checks the list of REs that _can_ cut there until it is finished with the
list.  (Strider itself is slightly more complex than this might imply; in
his intro, Marck says that V1.2 is up to about 22K lines of code - even
with NeXT's wondrous NeXT-Step, it will be more than an afternoon's work to
dupe it)
   While this technique certainly wasn't discovered by Marck, he describes
it briefly and clearly in his paper about Strider (1988, NAR 16(5):p1829) -
that would probably be a good place to start; he also apparently makes a
lot of use of quicksort for his lists.
   Despite the fact that hashing is not a new technique, and the tremendous
advantage in speed, many commercial developers of molbio softwware have
either not implemented it or botched the coding very badly (see the timing
chart at the end of my mini pseudo review).  There is a spread of over 2
orders of magnitude in speed from Strider (fastest) to the slowest
(Hitachi's DNAsis and MacMolly Tetra).  
   On the other hand, as Don Gilbert and others have mentioned string
matching as a simpler alternative method and since they have much more
experience than I, I'd tend to believe them, but if you want a good
exercise in C, the hashing approach is certainly faster.
   Study everything you mentioned - you'll use it all eventually.  Though
it's not nearly as weighty as some of the other books, the 2nd edition of
K+R is still a great little book and tho it pains me to say so, most of the
documentation I've read from Microsoft Press is also pretty good (altho
their compiler products are considerably less so).  

   I find it takes about 4 passes thru a procedure - once in plain english,
once in very garbled pseudo-code with lots of arrows and graffiti, once in
slightly better pseudo-code and finally in real code. (And then a lot of
time debugging).
   I don't know what compiler you're using, but if it's the Objective C
that ships with the NeXT, beware that it's not ANSI C nor C++, so code
probably won't port very easily unless it has a checking function (if
that's a concern - probably not since you're working on a NeXT 8)).
   Some other points:
   Despite its multiple and manifest shortcomings, I've chosen DOS as a
programming platform simply because of the incredible number of cheap tools
and tricks.  One example is Instant C - an incremental compiler that allows
you to do all kinds of astonishing things with code - very quick debugging
because of its incremental nature (it only recompiles what you've changed),
on the fly functions and pseudo functions (you can call a function that you
haven't written yet; just tell it what to return so that you can go on
coding), etc, etc.  And it breaks the 640K barrier because of a builtin
extender - your programs can go to 16MB!  Of course, it's not an optimizing
compiler, so you still have to go thru one (it's set up for MS C 6, but
will accept others with some arm twisting) and it will enforce ANSI
compliance and all kinds of error checking if you tell it to.  Because of
the 16 MB memory extension and the ANSI/error checking, you can use it to
write portable code relatively easily (and cheaply - it was only $100 as an
introductory offer).  It also does not do C++ or Windows, but that's
another story altogether.  Sold by Rational Systems (800-666-2585).

Good Luck with your project...

Harry Mangalam                                   Vox:(619) 453-4100, x250
Dept of Biocomputing                                   Fax:(619) 552-1546
The Salk Institute                             mangalam at
10010 N Torrey Pines Rd                        mangalam at
La Jolla CA 92037                                    mangalam at salk.bitnet

More information about the Bio-soft mailing list