# Internal repeats Q & A

CSHL wchang at phage.cshl.org
Tue May 18 07:57:36 EST 1993

```In article <930514075122.5345 at CUCCFA.CCC.COLUMBIA.EDU> MARK at CUCCFA.CCC.COLUMBIA.EDU (T. Mark Reboul) writes:
>Regarding Gaston Gonnet's solution to Wentian Li's internal repeats problem....
>
>Beautiful, not to mention elegant, solution!
>
>However, I wonder if that solution remains "trivial" in practice for
>substantially longer sequences, say, sequences with thousands of elements.
>Then there is a question of which sorting algorithm to use, how much
>computing will be required, and how much dynamic memory is needed for
>temporary data.
>
>For one of these larger repeat-finding exercises, is it not possible that
>the computing resources required (CPU + memory) can be lessened by exploiting
>the specific goal (i.e., that an identification of internal repeats is all
>that is sought).
>
>I'm no expert, but there seems to have been a lot of work done on "suffix
>trees" and their efficient construction, in the pure c.s. field, related
>to lexicographic ordering/sorting problems, and perhaps some of those
>algorithms might become useful for handling longer sequences.
>
>Mark Reboul
>Columbia University
>Columbia-Presbyterian Cancer Center Computing Facility
>mark at cuccfa.ccc.columbia.edu

Having done some of the suffix tree work, I can say that the answer depends...
on what you want to do.  The radix sorting method will in all likelihood be
fast enough (as Gaston pointed out, N log_b N to sort, where b is the alphabet
size--but the slow part is outputing the repeats given the sorted list).
It requires very little memory to sort, just a permutation of the indices.

The real questions are how should the output itself be sorted, and should
it be represented succinctly, since the set of repeats can be larger than
the original sequence?

If performance is really a problem, there is a fantastically engineered
speedup by Myers and Manber called "suffix arrays".  The details are
only slightly complicated :-) and code is probably obtainable from them.

But if more functionality is required (one has to be specific), or if
"real-time" performance is critical on very large sequences (say all of
SwissProt), then suffix tree may be the answer.  I used it for sublinear
expected time approximate string matching, on-line substring queries, etc.
If someone is interested in comparing the two approaches, I can probably
dust off my old suffix tree code; to my knowledge, it was used by folks
at Bell for finding duplicate source code in very large libraries.

As for reverse complement, one simply finds the repeats in the concatenation
of the input and its reverse complement.  Alternatively one can "feed" the r.c.
to a suffix tree of the sequence in linear time.

Sorry for the jargons in this post.  If there's interest I can elaborate
in plainer terms.

Aside: is there general interest in software such as this, or blast/fasta/
sim (Smith-Waterman-Miller-Huang) output parsers, etc.?  The last time I
submitted a very meager proposal (NIH postdoc training grant) to turn
algorithms into software, it got clobbered ("very weak... just an extension
of PhD work" :-)

-- Bill Chang (wchang at cshl.org)  Cold Spring Harbor Lab, NY

```