how to find optimal test coverage of protein sequences
arzewski at hotmail.com
Fri Apr 12 18:05:23 EST 2002
I'm in the process of figuring a way to help define the protein
sequence of a possible vaccine. The vaccine is to cover about 1000
cases of a protein that varies little (mutations) between each
sequence, and I am looking for an algorithm that can find sequences
that accumulatively can cover for, say, 90% of the protein sequences.
Each protein sequence is about 1500 amino-acids long. The series of
sequences to be used in the vaccine are each about 300 amino-acid long
and have at least 40 distinct amino-acid in each.
The general idea is to find the minimal set of amino-acid sequences of
300 each, that, as a whole, can give most coverage of an existing set
of about 100 protein sequences each having about 1500 amino-acids.
Something like this has been done for testing microchips years ago, in
testing the possible transistor states, and creating a set of
"01010101" to be passed in into the IN chip pin and reading and
comparing the result from the chip OUT pin. Because the permutations
of all the possible transistor states grow exponentially when in large
chips there may be thousands or millions of transistor, the goal is to
find a "minimal" set of input test sequences that can provide "most"
test coverage. I expect something like this has probably been done in
the field of bio-informatics.
Any leads or tips?
More information about the Comp-bio