Algorithmic Issues in Computational Genomics
2:00 p.m., Wednesday, July 29, 1998
12th fl. conference room
This thesis explores the application of algorithmic techniques in understanding and solving computational problems arising in Genomics (called Computational Genomics). In the first part of the thesis we focus on the problem of reconstructing physical maps from data, related to "reading" the genome of an organism, and in the second part we focus on problems related to "interpreting" (in a very limited sense) the genome. The main contributions of the thesis are understanding the computational complexity of, and designing algorithms for some key problems in both these domains.
The primary goal of the Human Genome Project is to determine the entire three billion base pair sequence of the human genome and locate roughly 100,000 genes on the DNA. Recently, a set of single molecule methods (such as optical mapping) have been developed that allow one to create physical maps (a set of landmarks on the DNA whose locations are well defined), but can only do so by combining a population of data in the presence of errors from various sources. In the first part of the thesis, we focus on the problem of computing physical maps from data that arise in single molecule methods. We describe two combinatorial models of the problem termed Exclusive Binary Flip Cut (EBFC) and Weighted Consistency Graph (WCG) problems. We show that both the problems are MAX SNP hard and give bounds on the approximation factors achievable. We give polynomial time 0.878-approximation algorithm for the EBFC problem and 0.817-approximation algorithm for the WCG problem, using the maxcut approximation algorithm due to Goemans and Williamson. We also give a low polynomial time practical algorithm that works well on simulated and real data. Naksha is an implementation of this algorithm and a demonstration is available at http://www.cs.nyu.edu/parida/naksha.html. We also have similar results on complexity for generalizations of the problem which model various other sources of errors. We have generalized our complexity and algorithmic results to the case where there is more than one population in the data (which we call the K-populations problem). In the second part of the thesis, we focus on "interpreting" the genome. We consider the problem of discovering patterns (or motifs) in strings on a finite alphabet: we show that by appropriately defining irredundant motifs, the number of irredundant motifs is only quadratic in the input size. We use these irredundant motifs in designing algorithms to align multiple genome or protein sequences. Alignment of sequences aids in comparing similarities, in structure and function of the proteins.