The recent advances in DNA sequencing technology and their many
potential applications to Biology and Medicine have rekindled enormous
interest in several classical algorithmic problems at the core of
Genomics and Computational Biology: primarily, the whole-genome
sequence assembly problem (WGSA). Two decades back, in the context of
the Human Genome Project, the problem had received unprecedented
scientific prominence: its computational complexity and intractability
were thought to have been well understood; various competitive
heuristics, thoroughly explored and the necessary software, properly
implemented and validated. However, several recent studies, focusing
on the experimental validation of de novo assemblies, have highlighted
several limitations of the current assemblers.
Intrigued by these negative results, this dissertation reinvestigates the algorithmic techniques required to correctly and efficiently assemble genomes. Mired by its connection to a well-known NP-complete combinatorial optimization problem, historically, WGSA has been assumed to be amenable only to greedy and heuristic methods. By placing efficiency as their priority, these methods opted to rely on local searches, and are thus inherently approximate, ambiguous or error-prone. This dissertation presents a novel sequence assembler, SUTTA, that dispenses with the idea of limiting the solutions to just the approximated ones, and instead favors an approach that could potentially lead to an exhaustive (exponential-time) search of all possible layouts but tames the complexity through constrained search (Branch-and-Bound) and quick identification and pruning of implausible solutions.
Complementary to this problem is the task of validating the generated assemblies. Unfortunately, no commonly accepted method exists yet and widely used metrics to compare the assembled sequences emphasize only size, poorly capturing quality and accuracy. This dissertation also addresses these concerns by developing a more comprehensive metric, the Feature-Response Curve, that, using ideas from classical ROC (receiver-operating characteristic) curve, more faithfully captures the trade-off between contiguity and quality.
Finally, this dissertation demonstrates the advantages of a complete pipeline integrating base-calling (TotalReCaller) with assembly (SUTTA) in a Bayesian manner.