This page is to discourage someone from trying the same thing. :) Turns out that hoping that a single binding of the DNA strand on each side is insufficient. (Duh.. i don't know why i didn't see it.)
I think there may be hope to recover though, b/c when this does produce an answer that works, it produces a really nice answer. (low strand count, and lots of short strands). However, this might just be luck.

source code here

DNA MicroArray matching

Step 1:   Find the "language" of the potential solution

Make a window of size W=2.   
Label:  WINDOW
Slice up the reference DNA into window size chunks.
Put these chunks in a Set.   (Sets don't allow duplicates)
If the size of the Set < the number of possible windows of size W on the reference DNA,
    add this Set to a master Set
    W++;
    goto WINDOW
else move on

This slices up the string
into substring size pieces, adding them to the potential solution set.  
It stops adding when the window size is so large that no more repetition is possible.  (sort of like
not being able to compress the string with any larger window)


CHANGE:    before starting, run down the string to find the longest repeated sequence.  The length of this +1 is the "window break"
size.    if in the process of increasing the windows size, we hit window break, then move on to step 2.

GCCCGTGCCA

CA
TG
CG
CC
GT
GC
CGT
CCC
CCG
CCA
TGC
GTG
GCC

Step 2: Reduce the language set until a solution presents itself

Convert the language Set into an array of strands.
Sort the array of strands by length, then by the first index in which they appear in the list.

Now, make a counter matrix with the strands going down one size and the reference DNA on the other side.
Going down the matrix on the strand size; wherever the strand overlaps the reference DNA, place a 1.

GCCCGTGCCA
111.......   GCC
11........   GC
.11.......   CC
.111......   CCC
..111.....   CCG
...111....   CGT
...11.....   CG
....111...   GTG
....11....   GT
.....11...   TG
.....111..   TGC
.......111   CCA
........11   CA


Also keep another matrix which indicates the head and tail points of each substring
GCCCGTGCCA
1.1.......   GCC
11........   GC
.11.......   CC
.1.1......   CCC
..1.1.....   CCG
...1.1....   CGT
...11.....   CG
....1.1...   GTG
....11....   GT
.....11...   TG
.....1.1..   TGC
.......1.1   CCA
........11   CA


Now, go down the columns and sum up each column of the counter matrix.
  replace the 1's in the column with the sum

GCCCGTGCCA
244.......   GCC
24........   GC
.44.......   CC
.444......   CCC
..445.....   CCG
...455....   CGT
...45.....   CG
....553...   GTG
....55....   GT
.....53...   TG
.....532..   TGC
.......222   CCA
........22   CA


Somewhere in this matrix is the answer to the problem,
we just have to keep chipping at the marble to get it out

Starting from the top:
Go across every row.   if the row does not contain a 1, that means that this strand's
contribution is duplicated by another strand further down the list.  This makes it a candidate for
removal. 

In order to remove a candidate, we check the columns that it covers in the head/tail matrix.   if
it covers the head or the tail of another string, and if head or tail of this other string has a
count of 2, we do not remove the candidate.  This is because the candidate acts as a "binder" between
2 other strings  EX:

......2221....
....2222......  Can't remove this string without breaking continuitiy.
..1122........

If the candidate is not a "binder", then I remove it.  I do this by zero'ing out its line, and for
every column it covers, decrement every non-zero value in the column by 1.

When one hits the bottom of the list, this leads to a selection of strings which are the answer:  (I hope!)

TCCATCTCTAGAGTCCGTCTCACACGGCTCTCGCACACGGAGATAGCTC
12...............................................   TC
.21112...........................................   CCATC
.....211112......................................   CTCTAG
..........2111112................................   GAGTCCG
................2111112..........................   GTCTCAC
......................2111112....................   CACGGCT
............................2111112..............   TCTCGCA
..................................2111112........   ACACGGA
........................................2112.....   AGAT
...........................................211111   TAGCTC


This doesn't always work.   Algorithim fails for
the test string (i don't know why)
CACCATTACGTCTAATTACCCCCTCGTGTTCCTAC
12.................................   CA
.22................................   AC
..212..............................   CCA
....212............................   ATT
......212..........................   TAC
........22.........................   CG
.........21112.....................   GTCTA
.............212...................   AAT
...............21112...............   TTACC
...................21112...........   CCCCT
.......................21112.......   TCGTG
...........................21112...   GTTCC
...............................2111   CTAC





john lin (jjl364@nyu.edu)