Algorithm for MicroArray Problem
Original
Problem Description
(somewhat abbreviated) Problem
Description:
Input: a dna sequence (composed of
nucleotides A, C, G & T) of a certain length, eg.
TCGGATTAGCG.
Output: a set of subsequences of the input
(let's call them strands) that can be used to verify the original
sequence.
How does verification work?
If an output strand
occurs in the original sequence, a light shows in a given color. Where in the
sequence the strand is and how long the strand is (position 3-6, position 8-15,
etc.) is irrelevant...except:
- if the original sequence starts
with the given strand, the light shows in a special distinguishable
color.
- if a given strand occurs both at the start and somewhere
else in the middle too, the light shows in another special distinguishable
color.
Also allowed to factor in the knowledge of how many times each
nucleotide occurs in the sequence (eg. 7 As, 5 Cs, etc.).
What's a
good solution?
Lexicographic order on (length of longest strand, #
strands used), lowest wins.
My algorithm runs in 2 stages:
What length should the
strands be?
We want the length of the longest strand to be as
small as possible. I figured if we're going to use 1 strand of length say 7,
then they may as well all be of length 7. Having the others be of lower length
doesn't make our solution any better (since the length of the longest strand is
the same), and if they're shorter, we'd probably have to use more strands to
cover the whole sequence, which would make our solution worse (since it would
increase the # strands used).
That being said, what is this shortest
length?
There are two trivial cases:
0 strands of length
0: can be used to verify a sequence composed of one nucleotide only
(eg. AAAAA). why? because we know the # of times each nucleotide occurs.
1
strand of length 1: can be used to verify a sequence of the pattern
CTTTTTTT. Specify the C, and the rest can be determined from the nucleotide
counts.
OK, those were trivial. Where do we go from here? Start with
strand length = 2, and keep incrementing by 1 as long as we can prove that
strands of the current length cannot uniquely determine the original
sequence.
So how do we prove this cannot uniquely determine
'ness?
Check 1
Find all subsequences of the original
sequence of length 2*lenStrand-1. Exclude the one starting at the
beginning of the original sequence (cuz this lights differently and hence has a
uniquely determined position). Build a list of all pairs of these subsequences
(1st & 2nd, 1st & 3rd, 1st & 4th,...,2nd & 3rd, 2nd & 4th,
...). Some of these pairs will have member subsequences that were overlapping in
their positions in the original sequence, while most of them will not (the 3rd
& the 4th probaly overlap, but the 3rd & the 19th probably
dont).
For a pair that does not overlap in the original sequence, if the
members of the pair are identical except for their central letters, we can
switch these central letters to create an alternate sequence that will respond
to strands of length lenStrand exactly the same way as the original:
lenStrand = 4
subsequence length = 2*lenStrand-1 = 7
seq: ...gttCaga...gttTaca...
alt: ...gttTaga...gttCaca...
For a pair that does overlap in the original sequence, if the members of
the pair are identical except at the indices corresponding to each of
their centers, we can switch the chars at positions in the sequence
corresponding to their centers to create an alternate...:
lenStrand = 3
subsequence length = 2*lenStrand-1 = 5
seq: C G A A A T A A A
indices in original seq: 0 1 2 3 4 5 6 7 8
x = subsequence # 2: A A A T A
y = subsequence # 3: A A T A A
center of x = idxMidX = 4
center of y = idxMidY = 5
check identical'ness
1st char: x(0) == seq(2) == A y(0) == seq(3) == A equal
2nd char: x(1) == seq(3) == A y(1) == seq(4) == A equal
3rd char: x(2) == seq(4) == seq(idxMidX) allowed to be different
4th char: x(3) == seq(5) == seq(idxMidY) allowed to be different
5th char: x(4) == seq(6) == A y(4) == seq(7) == A equal
if we switch the chars at idxMidX & idxMidY, we get:
alt: C G A A T A A A A
this alt cannot be distinguished from the original seq by strands of length 3
Check 2
Find all subsequences of the original sequence
of length lenStrand-1. Exclude the one starting at the beginning. These
represent the maximal overlaps between strands of length
lenStrand. eg. if lenStrand=3, maximal overlap is of length
2.
If any of these maximal overlaps occur three times: maximal overlap occuring 3 times: x
seq: AxBxCxD
alt: AxCxBxD
eg:
x: ac
seq: CTacTacGGacCG
alt: CTacGGacTacGG
At least one of B or C must be non-empty, otherwise the rearrangement
would not have changed the sequence. If they were the same, the sequence would
have already failed Check 1.
If there are two of these maximal
overlaps that occur twice each and the positions they occur at are alternating: maximal overlaps occuring twice in alternating order: x & y
seq: AxByCxDyE
alt: AxDyCxByE
eg:
x: ac
y: gt
seq: AGacGTCgtCCAGATacTgtTAA
alt: AGacTgtCCAGATacGTCgtTAA
B and D must be different and C must be non-empty. If B and D were the
same, the rearrangement would not have changed the sequence. If C was non-empty,
the sequence would have already failed Check 1.
Check
3
Make sure the end of the sequence is determinable. To ensure this,
the last unique strand must cover the last character in the sequence, or all
characters after the last character of the last unique strand must be the same.
Otherwise..
chars at the end not covered by unique strand set: x & y
seq: ...xy
alt: ...yx
Everything earlier is the same, the rearranging x & y doesn't change
the nucleotide counts.
If a candidate strand length passes all these
checks, I think that if a verifier were fed every single
subsequence of this length (which would mean maximal overlaps everywhere),
there would be no way to rearrange the strands to find an alternate sequence
that matched every input strand.
Try minimize the # of strands
used
Now that we know what the minimum size for strands that can
uniquely verify the sequence is, we want a minimal subset of these.
We
could try every possible combination of these, but this is a search space of
size O(maxStrands!) which is a bit too large. So instead, we try to drop
strands selectively:
When we drop a strand, we still want to ensure that
the region of the sequence it was covering is still in some way uniquely
specified. I say in some way becuase it doesn't actually have to be
unique. What we really need is, going rightwards from the start strand (whose
position is uniquely determined because it lights in a special way), if at any
point we tried to connect a strand in a way that doesn't match the original
sequence, it wouldn't work because either
- a change in overlaps
would make the resulting alternate sequence too long
- the
nucleotide counts would be wrong
So I start with every possible strand of
the length we're working with. Then I build this 3d matrix containing every
possible overlap of every one of these strands with each other. Matrix
dimensions are:
maxStrands * maxStrands * maxOverlap
If
non-consecutive strands do not overlap correctly for any of the possible
maxOverlap char-length overlaps, that entry would be null. This matrix
now contains a mix of real (in the original sequence) overlaps, and
fake ones (which could be exploited to rearrange the strands to form an
alternate sequence). We end up having tons of entries for single-char overlaps,
and lower counts as the length of the overlap increases. Higher char-length
yields more overlaps that are unique.
Now I look for minimal-length
unique real overlaps of length less than maxOverlap. When I find
one that's unique, all super-overlaps of that must be unique
too:
Suppose lenStrand = 5 and we have a section of the original
sequence ...AGCATTA... which yields strands AGCAT, GCATT, &
CATTA.
Suppose the overlap of CAT b/w AGCAT &
CATTA is unique. Then the super-overlaps GCAT &
CATT must be unique too. (if there was another GCAT
somewhere else, then there would be another CAT too,
right?).
So this unique CAT tells us that going rightwards
from the start, once we add AGCAT and we're looking for the next
strand to add, if we wanted to do an overlap of 4 using the GCAT,
there is only one choice - the real one. If we wanted to do an overlap of 3
using the CAT, there is also only one choice - the real one. We could
do a fake overlap (trying to find an alternate sequence) of 2 or 1 (with
some other strand starting with AT or just T), or even 0, but then the alternate
sequence would turn out too long.
So...we have uniquely specified what
needs to come after AGCAT.
And...since the connection is
uniquely specified with just the CATTA, the GCATT is
redundant!! So we can drop the GCATT, and our strand count has
dropped by one.
Clear all GCATT's entries from the matrix
(since it is no longer part of our strand set), and if we're lucky, the removal
of these entries will make some more overlaps unique opening up more chances for
eliminating redundant strands.
Repeat...until we can't find any more
redundant strands.
What's left is our minimal subset.