Internal repeats Q & A

CSHL wchang at phage.cshl.org
Tue May 18 07:57:36 EST 1993


In article <930514075122.5345 at CUCCFA.CCC.COLUMBIA.EDU> MARK at CUCCFA.CCC.COLUMBIA.EDU (T. Mark Reboul) writes:
>Regarding Gaston Gonnet's solution to Wentian Li's internal repeats problem....
>
>Beautiful, not to mention elegant, solution!
>
>However, I wonder if that solution remains "trivial" in practice for 
>substantially longer sequences, say, sequences with thousands of elements.
>Then there is a question of which sorting algorithm to use, how much 
>computing will be required, and how much dynamic memory is needed for 
>temporary data.
>
>For one of these larger repeat-finding exercises, is it not possible that 
>the computing resources required (CPU + memory) can be lessened by exploiting
>the specific goal (i.e., that an identification of internal repeats is all
>that is sought).
>
>I'm no expert, but there seems to have been a lot of work done on "suffix
>trees" and their efficient construction, in the pure c.s. field, related 
>to lexicographic ordering/sorting problems, and perhaps some of those 
>algorithms might become useful for handling longer sequences.
>
>Mark Reboul
>Columbia University
>Columbia-Presbyterian Cancer Center Computing Facility
>mark at cuccfa.ccc.columbia.edu

Having done some of the suffix tree work, I can say that the answer depends...
on what you want to do.  The radix sorting method will in all likelihood be
fast enough (as Gaston pointed out, N log_b N to sort, where b is the alphabet 
size--but the slow part is outputing the repeats given the sorted list). 
It requires very little memory to sort, just a permutation of the indices.  

The real questions are how should the output itself be sorted, and should
it be represented succinctly, since the set of repeats can be larger than
the original sequence?

If performance is really a problem, there is a fantastically engineered 
speedup by Myers and Manber called "suffix arrays".  The details are 
only slightly complicated :-) and code is probably obtainable from them.

But if more functionality is required (one has to be specific), or if 
"real-time" performance is critical on very large sequences (say all of 
SwissProt), then suffix tree may be the answer.  I used it for sublinear 
expected time approximate string matching, on-line substring queries, etc.
If someone is interested in comparing the two approaches, I can probably 
dust off my old suffix tree code; to my knowledge, it was used by folks
at Bell for finding duplicate source code in very large libraries.

As for reverse complement, one simply finds the repeats in the concatenation
of the input and its reverse complement.  Alternatively one can "feed" the r.c.
to a suffix tree of the sequence in linear time.

Sorry for the jargons in this post.  If there's interest I can elaborate
in plainer terms.

Aside: is there general interest in software such as this, or blast/fasta/
sim (Smith-Waterman-Miller-Huang) output parsers, etc.?  The last time I 
submitted a very meager proposal (NIH postdoc training grant) to turn 
algorithms into software, it got clobbered ("very weak... just an extension 
of PhD work" :-)


-- Bill Chang (wchang at cshl.org)  Cold Spring Harbor Lab, NY




More information about the Bio-soft mailing list