DOCTORAL DISSERTATION DEFENSE

Candidate: Laxmi Parida

Advisor: B. Mishra

Candidate: Laxmi Parida

Advisor: B. Mishra

Algorithmic Issues in Computational Genomics

2:00 p.m., Wednesday, July 29, 1998

12th fl. conference room

719 Broadway

**Abstract**

This thesis explores the application of algorithmic techniques
in understanding and solving computational problems
arising in Genomics (called *Computational Genomics*).
In the first part of the thesis we focus on the problem of
reconstructing physical maps from data, related to
"reading" the genome of an organism, and in the second part
we focus on problems related to "interpreting" (in a very
limited sense) the genome. The main contributions of the thesis
are understanding the computational complexity of, and designing
algorithms for some key problems in both these domains.

The primary goal of the Human Genome Project is to determine
the entire three billion base pair sequence of the human
genome and locate roughly 100,000 genes on the DNA.
Recently, a set of single molecule methods (such as optical
mapping) have been developed that allow one to create physical
maps (a set of landmarks on the DNA whose locations are well
defined), but can only do so by combining a population of
data in the presence of errors from various sources. In the
first part of the thesis, we focus on the problem of computing
physical maps from data that arise in single molecule methods.
We describe two combinatorial models of the problem termed
Exclusive Binary Flip Cut (EBFC) and Weighted Consistency Graph
(WCG) problems. We show that both the problems are MAX SNP hard
and give bounds on the approximation factors achievable. We
give polynomial time 0.878-approximation algorithm for the EBFC
problem and 0.817-approximation algorithm for the WCG problem,
using the maxcut approximation algorithm due to Goemans and
Williamson. We also give a low polynomial time practical algorithm
that works well on simulated and real data. *Naksha* is an
implementation of this algorithm and a demonstration is available
at `http://www.cs.nyu.edu/parida/naksha.html`. We also have
similar results on complexity for generalizations of the problem
which model various other sources of errors. We have
generalized our complexity and algorithmic results to the case
where there is more than one population in the data (which
we call the *K*-populations problem).
In the second part of the thesis, we focus on "interpreting"
the genome. We consider the problem of discovering patterns
(or motifs) in strings on a finite alphabet: we show that by
appropriately defining irredundant motifs, the number of irredundant
motifs is only quadratic in the input size. We use these irredundant
motifs in designing algorithms to align multiple genome or protein
sequences. Alignment of sequences aids in comparing similarities, in
structure and function of the proteins.