Internal repeats Q & A

Gaston Gonnet gonnet at inf.ethz.ch
Fri May 14 11:19:03 EST 1993


In article <930514075122.5345 at CUCCFA.CCC.COLUMBIA.EDU> MARK at CUCCFA.CCC.COLUMBIA.EDU (T. Mark Reboul) writes:
>Regarding Gaston Gonnet's solution to Wentian Li's internal repeats problem....
>
>Beautiful, not to mention elegant, solution!
>
>However, I wonder if that solution remains "trivial" in practice for 
>substantially longer sequences, say, sequences with thousands of elements.
>Then there is a question of which sorting algorithm to use, how much 
>computing will be required, and how much dynamic memory is needed for 
>temporary data.
>
>For one of these larger repeat-finding exercises, is it not possible that 
>the computing resources required (CPU + memory) can be lessened by exploiting
>the specific goal (i.e., that an identification of internal repeats is all
>that is sought).
=========================================================
You are right, you would be building a suffix tree, and this
was shown to be possible to do in linear time.  Just reading
the data will take linear time! so you can hardly improve this
method.

Sorting will be O(n*log(n)) (for a sequence with n amino acids)
and this will be acceptable even for the largest databases
around (even if they were considered a single sequence).  Darwin
does create such an index for all its internal computation, and
the building for the entire SwissProt25 (10M aa) takes about
30 minutes on a Sparc workstation.

We have found that the suffix tree algorithms are too complicated
and will give no gains for the present day databases.  We use
a minor variation of plain sorting.  A suffix tree algorithm will
require a more complicated data structure, which will consume
a much higher proportion of memory.

Gaston Gonnet.  ETH Zurich.




More information about the Bio-soft mailing list