[Bio-software] sequence partitioning
Michal Walicki
via bio-soft%40net.bio.net
(by Michal.Walicki from ii.uib.no)
Wed Feb 3 13:38:54 EST 2010
Gentle people,
I started to work with sequences a couple of months ago and
encountered a problem which may be different from those faced by
biological sequence analysis but, hopefully, you can give me an
answer or, at least, point me in the right direction.
The problem:
partition a given sequence S into a minimal number of patterns.
(We assume some finite alphabet and, of course, S is finite. The
problem gets a bit different flavour when S may not have a partition,
so let us assume that a partitioning of S is possible.)
A pattern P is any sequence with all symbols distinct and length at least 2.
P occurs in a sequence S if P is a subsequence of S (not necessarily
a substring).
A subsequence of S is an
- do(P), disjoint occurence of P, if it consists of (at least 2)
disjoint occurrences of pattern P
and it is an
- mdo(P), maximal do(P), if it is do(P) and there are no occurrences
of P in the rest of S, i.e., in
S \ mdo(P) = the subsequence of S obtained by removing the subsequence mdo(P).
A partition (exact cover) of a sequence S is given by
a) a set of mutually distinct (not necessarily disjoint) patterns
P1...Pn, with
b) mutually disjoint, nonempty subsequences do(P1)...do(Pn) of S which,
c) together, constitute the whole S.
It is minimal if there is no partition with a smaller set a) (with
fewer patterns).
It is MDO if all do(Pi) in point b) are, actually, mdo(Pi).
Question 1.
The problem seems - obviously - NP-hard but does anybody know of a
proof of that?
Question 2.
Suppose that
i) there is a minimal partition with N patterns, and
ii) there is some MDO partition (obviously, with at least N patterns).
Is there, then, an MDO partition with N patterns?
Conjecture 2 (the answer Yes to question 2) is quite intriguing
since, on the one hand, it seems plausible - a minimal partition
should contain mdo's of involved patterns but, on the other hand,
there seems to be no reason why it should happen to be the case. (If
we drop the assumption ii), it is, in fact, wrong, i.e., there are
sequences which can be partitioned with N patterns, but which have no
MDO partition.) We have run tens of tests (on random sequences (with
exact partitions) up to 20-30 symbols) and never met a counterexample
to Conjecture 2.
Any hints are appreciated
Michal
--
--------------------
Assoc. Prof. Michal Walicki
Institute of Informatics
University of Bergen
5020 Bergen, Norway
http://www.ii.uib.no/~michal
(on leave at IST, Lisbon, until 8.2010)
More information about the Bio-soft
mailing list