Microarray Design

Dennis Shasha

Omniheurist Course

Computer Science


 

Description

Given a set of strings, you want the closest matches of a certain length to distinguish one string from the other.

We are cloning, er, higher mammals on our floating laboratory in the South Pacific. The ship is called Noah's Arc, but you must allow a certain poetic license. We don't need two parents for each species, just the sex of interest. We give our mammals certain genetic advantages by inserting specific DNA sent to us by suppliers. We don't completely trust those suppliers, however, so we want to verify the DNA they send.

To do that, we will use a microarray. In each cell of the microarray, we will place short sequences of DNA, called strands. If we expose the supplied DNA S to the chip, any cell whose strand is a consecutive subsequence of S will light up; no others will light up. (Technically, we put the reverse complement of those strands on the microarrays, but the mathematics doesn't change, so let's forget that.)

For example, if the sequence S is AGGTCACGTGG, then a strand consisting of CACG will light up but one consisting of CTCG will not. Similarly, GG will light up but TT will not.

Also, GG will not light up with greater intensity just because it appears twice.

In general, if I give you a sequence, I would like to know what to put in the cells of my microarray. I would like the maximum length strand to be as short as possible and within that length, I'd like to use as few strands as possible.

Strands that match the leftmost end of the DNA sequence will light up with a special color, so you'll know which ones those are. If a strand matches both leftmost and an interior sequence, then it will light up in two colors. Also, by independent means, we can tell you the individual totals of As, Cs, Ts, and Gs. Here are some questions to train your intuition.

What is the smallest number of strands needed for an arbitrarily long sequence?

Answer: Any sequence consisting entirely of a single nucleotide requires no strands at all.

Here's a harder one. What is the shortest sequence S you can imagine such that even if strands can be 6 long, you won't be able to verify S?

Answer: Suppose S were AAAAAATAAAAAA. With strands of length 6, there would be no way to distinguish between S and S' = AAAAAAATAAAAA. Both S and S' have the same number of As, the same number of Ts and no strand of length 6 can distinguish one from the other.

Even small strands can do a good job. For example, if S = ACGAC, then strands of length 2 are enough. Can you see how?

Answer: You place AC and GA. The AC will display in two colors. The GA will display in one. You already know how many of each kind of nucleotide there are. So this distinguishes ACACG from ACGAC.

Here is an example sequence we want to verify: TCACTCGGCTCTCGCACACGGAGATAGCTC. Goal is to get the smallest maximum size strand and given that, the smallest number of strands of that size or less. So, lexicographic ordering based on (max strand size, number of strands).

Your program is to answer this question in general. I will generate a sequence for you of length 100 before class starts. The verifier will confirm that your solution gives a unique path or will find ambiguities. Here is a 49 nucleotide strand with its solution due to Kenji Yoshihira. Here is a strategy idea

The architecture group will provide that verifier.