internal repeats

Fri May 14 09:56:00 EST 1993

>i'm looking for a C code which lists *all* internal repeats
>in a sequence. no restriction on the repeating unit size,
>the distance between the repeat units, or anything. this
>list has to be complete.
This is a trivial task.  Consider your sequence with n
amino acids as n sequences, one starting at pos 1, one
at pos 2, etc. until one starting at pos n.  This is
done without any copying, just pointing to each position.
Now sort the pointers in lexicographical order by the pointed
text.  All the information that you want will be directly
available from the sorted list.

	I wouldn't be so sure it is such a 'trivial task'.
Althought that algorithm may result complete, it has some
little disadvantages as I see.

	The major ones derive of the statement of the problem
itself, which I find somehow undefined. For instance,
	- are you interested only in direct repeats or in reverse
or inverted repeats too? When working with sequences, this is very
	- what kind of sequences do you want to work out, nt or
aa sequences or both?
	- what do you consider an 'internal repeat'? when working
with sequences you usually define a 'match' not as an exact one
but as a set of characters with at least some proportion in common.

	There are also some minor problems arising: for instance,
the length of the sequence may be a problem when using the sorting
algorithm. But also when showing the results:
	- if you only want to see the matching region, you'll need
additional comparisons after the sort to find it (assuming you use
the above algorithm).
	- if yoou want to see them by match-extent or by position,
(the usual case) then you need to know, not only the position where
the match begins, but also where it ends (the last problem again)...

	In my opinion it is best to look for an algorithm that
outputs at least this additional information so that you can rearrange
the repeats appropriately and automatically (in the mentioned algo-
rithm you decide that by visual inspection).

	Oh, and surely there may be other considerations... But depending
on wath you intend, in the simplest case, this could be the most inmediate

	Hope this helps

			J. R. Valverde
		Biomedical Research Insitute
			Madrid - SPAIN

More information about the Bio-soft mailing list