internal repeats

Gaston Gonnet gonnet at inf.ethz.ch
Fri May 14 04:51:02 EST 1993


In article <1survlINNdfl at phage.cshl.org> wli at phage.cshl.org ( CSHL) writes:
>
>i'm looking for a C code which lists *all* internal repeats
>in a sequence. no restriction on the repeating unit size,
>the distance between the repeat units, or anything. this
>list has to be complete.
>
>is it a trivial task or a tricky one? i'm new to this topic, 
>so i have no idea. any pointer is appreciated...( public
>domain code, reference... or your own program-- i don't need
>a complete code, only the subroutine dealing with this task
>would be fine).
>
>thanks in advance! -- w. li
>
>wli at cshl.org
>
===================================
This is a trivial task.  Consider your sequence with n
amino acids as n sequences, one starting at pos 1, one
at pos 2, etc. until one starting at pos n.  This is
done without any copying, just pointing to each position.
Now sort the pointers in lexicographical order by the pointed
text.  All the information that you want will be directly
available from the sorted list.

An example helps.

If the sequence is: MARRARRPRGRFYAFRRG  (20 aa  long)

these are the 20 subsequences
	MARRARRPRGRFYAFRRGRW
	ARRARRPRGRFYAFRRGRW
	RRARRPRGRFYAFRRGRW
	RARRPRGRFYAFRRGRW
	ARRPRGRFYAFRRGRW
	RRPRGRFYAFRRGRW
	RPRGRFYAFRRGRW
	PRGRFYAFRRGRW
	RGRFYAFRRGRW
	GRFYAFRRGRW
	RFYAFRRGRW
	FYAFRRGRW
	YAFRRGRW
	AFRRGRW
	FRRGRW
	RRGRW
	RGRW
	GRW
	RW
	W

These are the above strings sorted in lexicographical order
	AFRRGRW
	ARRARRPRGRFYAFRRGRW
	ARRPRGRFYAFRRGRW
	FRRGRW
	FYAFRRGRW
	GRFYAFRRGRW
	GRW
	MARRARRPRGRFYAFRRGRW
	PRGRFYAFRRGRW
	RARRPRGRFYAFRRGRW
	RFYAFRRGRW
	RGRFYAFRRGRW
	RGRW
	RPRGRFYAFRRGRW
	RRARRPRGRFYAFRRGRW
	RRGRW
	RRPRGRFYAFRRGRW
	RW
	W
	YAFRRGRW

Notice that all the repeats are next to each other.  For example,
both instances of ARR, three instances of RR, nine instances of R, etc.

Gaston H. Gonnet, ETH Zurich.




More information about the Bio-soft mailing list