[Bio-software] sequence partitioning

Michal Walicki via bio-soft%40net.bio.net (by Michal.Walicki from ii.uib.no)
Wed Feb 3 13:38:54 EST 2010


Gentle people,
  I started to work with sequences a couple of months ago and 
encountered a problem which may be different from those faced by 
biological sequence analysis but, hopefully, you can give me an 
answer or, at least, point me in the right direction.

The problem:
partition a given sequence S into a minimal number of patterns.

(We assume some finite alphabet and, of course, S is finite. The 
problem gets a bit different flavour when S may not have a partition, 
so let us assume that a partitioning of S is possible.)

A pattern P is any sequence with all symbols distinct and length at least 2.
P occurs in a sequence S if P is a subsequence of S (not necessarily 
a substring).
A subsequence of S is an
- do(P), disjoint occurence of P, if it consists of (at least 2) 
disjoint occurrences of pattern P
and it is an
- mdo(P), maximal do(P), if it is do(P) and there are no occurrences 
of P in the rest of S, i.e., in
S \ mdo(P) = the subsequence of S obtained by removing the subsequence mdo(P).

A partition (exact cover) of a sequence S is given by
   a) a set of mutually distinct (not necessarily disjoint) patterns 
P1...Pn, with
   b) mutually disjoint, nonempty subsequences do(P1)...do(Pn) of S which,
   c) together, constitute the whole S.
It is minimal if there is no partition with a smaller set a) (with 
fewer patterns).
It is MDO if all do(Pi) in point b) are, actually, mdo(Pi).

Question 1.
The problem seems - obviously - NP-hard but does anybody know of a 
proof of that?

Question 2.
Suppose that
    i) there is a minimal partition with N patterns, and
   ii) there is some MDO partition (obviously, with at least N patterns).
Is there, then, an MDO partition with N patterns?

Conjecture 2 (the answer Yes to question 2) is quite intriguing 
since, on the one hand, it seems plausible - a minimal partition 
should contain mdo's of involved patterns but, on the other hand, 
there seems to be no reason why it should happen to be the case. (If 
we drop the assumption ii), it is, in fact, wrong, i.e., there are 
sequences which can be partitioned with N patterns, but which have no 
MDO partition.) We have run tens of tests (on random sequences (with 
exact partitions) up to 20-30 symbols) and never met a counterexample 
to Conjecture 2.

Any hints are appreciated

	Michal
-- 

--------------------
Assoc. Prof. Michal Walicki
Institute of Informatics
University of Bergen
5020 Bergen, Norway
http://www.ii.uib.no/~michal
(on leave at IST, Lisbon, until 8.2010)



More information about the Bio-soft mailing list