Theses & Reports

Instructions for submitting a technical report or thesis.

You can find technical reports published prior to 1990 archived here.

Title

Authors

Year

TR2010-931 2010 Design and Results of the 4th Annual Satisfiability Modulo Theories Competition (SMT-COMP 2008) Barrett, Clark; Deters, Morgan; Oliveras, Albert; Stump, Aaron Abstract | PDF

Title: Design and Results of the 4th Annual Satisfiability Modulo Theories Competition (SMT-COMP 2008)

Author(s): Barrett, Clark; Deters, Morgan; Oliveras, Albert; Stump, Aaron

Abstract:

The Satisfiability Modulo Theories Competition (SMT-COMP) is an annual competition aimed at stimulating the advance of the state-of-the-art techniques and tools developed by the Satisfiability Modulo Theories (SMT) community. As with the first three editions, SMT-COMP 2008 was held as a satellite event of CAV 2008, held July 7-14, 2008. This report gives an overview of the rules, competition format, benchmarks, participants and results of SMT-COMP 2008.
M.S. Thesis 2010 DTAC: A method for planning to claim in Bridge Bethe, Paul Abstract | PDF

Title: DTAC: A method for planning to claim in Bridge

Candidate: Bethe, Paul

Advisor(s): Davis, Ernest

Abstract:

The DTAC program uses depth-first search to find an unconditional claim in bridge; that is, a line of play that is guaranteed to succeed whatever the distribution of the outstanding cards among the defenders. It can also find claims that are guaranteed to succeed under specified assumptions about the distribution of the defenders. cards. Lastly, DTAC can find a claim which requires losing a trick at some point. Using transposition tables to detect repeated positions, DTAC can carry out a complete DFS to find an unconditional ordered claim in less than 0.001 seconds on average, and less than 1 second for claims which lose a trick. The source code for DTAC is available from: http://cs.nyu.edu/~pmb309/DTAC.html
Ph.D. Thesis 2010 On the Randomness Requirements for Privacy Bosley, Carleton Abstract | PDF

Title: On the Randomness Requirements for Privacy

Candidate: Bosley, Carleton

Advisor(s): Dodis, Yevgeniy

Abstract:

Most cryptographic primitives require randomness (for example, to generate secret keys). Usually, one assumes that perfect randomness is available, but, conceivably, such primitives might be built under weaker, more realistic assumptions. This is known to be achievable for many authentication applications, when entropy alone is typically sufficient. In contrast, all known techniques for achieving privacy seem to fundamentally require (nearly) perfect randomness. We ask the question whether this is just a coincidence, or, perhaps, privacy inherently requires true randomness?

We completely resolve this question for information-theoretic private-key encryption, where parties wish to encrypt a b-bit value using a shared secret key sampled from some imperfect source of randomness S. Our technique also extends to related primitives which are sufficiently binding and hiding, including computationally secure commitments and public-key encryption.

Our main result shows that if such n-bit source S allows for a secure encryption of b bits, where b > log n, then one can deterministically extract nearly b almost perfect random bits from S . Further, the restriction that b > log n is nearly tight: there exist sources S allowing one to perfectly encrypt (log n - log log n) bits, but not to deterministically extract even a single slightly unbiased bit.

Hence, to a large extent, true randomness is inherent for encryption: either the key length must be exponential in the message length b, or one can deterministically extract nearly b almost unbiased random bits from the key. In particular, the one-time pad scheme is essentially "universal".
Ph.D. Thesis 2010 Machine Learning Approaches to Gene Duplication and Transcription Regulation Chen, Huang-Wen Abstract | PDF

Title: Machine Learning Approaches to Gene Duplication and Transcription Regulation

Candidate: Chen, Huang-Wen

Advisor(s): Shasha, Dennis

Abstract:

Gene duplication can lead to genetic redundancy or functional divergence, when duplicated genes evolve independently or partition the original function. In this dissertation, we employed machine learning approaches to study two different views of this problem: 1) Redundome, which explored the redundancy of gene pairs in the genome of Arabidopsis thaliana, and 2) ContactBind, which focused on functional divergence of transcription factors by mutating contact residues to change binding affinity.

In the Redundome project, we used machine learning techniques to classify gene family members into redundant and non-redundant gene pairs in Arabidopsis thaliana, where sufficient genetic and genomic data is available. We showed that Support Vector Machines were two-fold more precise than single attribute classifiers, and performed among the best within other machine learning algorithms. Machine learning methods predict that about half of all genes in Arabidopsis showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks>1), suggesting that redundancy is stable over long evolutionary periods. The genome-wide predictions were plot with similarity trees based on ClustalW alignment scores, and can be accessed at http://redundome.bio.nyu.edu .

In the ContactBind project, we use Bayesian networks to model dependences between contact residues in transcription factors and binding site sequences. Based on the models learned from various binding experiments, we predicted binding motifs and their locations on promoters for three families of transcription factors in three species. The predictions are publicly available at http://contactbind.bio.nyu.edu . The website also provides tools to predict binding motifs and their locations for novel protein sequences of transcription factors. Users can construct their Bayesian networks for new families once such a familial binding data is available.
Ph.D. Thesis 2010 New Privacy-Preserving Architectures for Identity-/Attribute-based Encryption Chow, Sze Ming Abstract | PDF

Title: New Privacy-Preserving Architectures for Identity-/Attribute-based Encryption

Candidate: Chow, Sze Ming

Advisor(s): Dodis, Yevgeniy; Shoup, Victor

Abstract:

The notion of identity-based encryption (IBE) was proposed as an economical alternative to public-key infrastructures. IBE is also a useful building block in various cryptographic primitives such as searchable encryption. A generalization of IBE is attribute-based encryption (ABE). A major application of ABE is fine-grained cryptographic access control of data. Research on these topics is still actively continuing.

However, security and privacy of IBE and ABE are hinged on the assumption that the authority which setups the system is honest. Our study aims to reduce this trust assumption.

The inherent key escrow of IBE has sparkled numerous debates in the cryptography/security community. A curious key generation center (KGC) can simply generate the user's private key to decrypt a ciphertext. However, can a KGC still decrypt if it does not know the intended recipient of the ciphertext? This question is answered by formalizing KGC anonymous ciphertext indistinguishability (ACI-KGC). All existing practical pairing-based IBE schemes without random oracles do not achieve this notion. In this thesis, we propose an IBE scheme with ACI-KGC, and a new system architecture with an anonymous secret key generation protocol such that the KGC can issue keys to authenticated users without knowing the list of users' identities. This also matches the practice that authentication should be done with the local registration authorities. Our proposal can be viewed as mitigating the key escrow problem in a new dimension.

For ABE, it is not realistic to trust a single authority to monitor all attributes and hence distributing control over many attribute-authorities is desirable. A multi-authority ABE scheme can be realized with a trusted central authority (CA) which issues part of the decryption key according to a user's global identifier (GID). However, this CA may have the power to decrypt every ciphertext, and the use of a consistent GID allowed the attribute-authorities to collectively build a full profile with all of a user's attributes. This thesis proposes a solution without the trusted CA and without compromising users' privacy, thus making ABE more usable in practice.

Underlying both contributions are our new privacy-preserving architectures enabled by borrowing techniques from anonymous credential.
TR2010-930 2010 Coordination Mechanisms for Weighted Sum of Completion Times Cole, Richard; Gkatzelis, Vasilis; Mirrokni, Vahab Abstract | PDF

Title: Coordination Mechanisms for Weighted Sum of Completion Times

Author(s): Cole, Richard; Gkatzelis, Vasilis; Mirrokni, Vahab

Abstract:

We study policies aiming to minimize the weighted sum of completion times of jobs in the context of coordination mechanisms for selfish scheduling problems. Our goal is to design local policies that achieve a good price of anarchy in the resulting equilibria for unrelated machine scheduling. In short, we present the first constant-factor-approximate coordination mechanisms for this model.

First, we present a generalization of the ShortestFirst policy for weighted jobs, called SmithRule; we prove that it achieves an approximation ratio of 4 and we show that any set of non-preemptive ordering policies can result in equilibria with approximation ratio at least 3 even for unweighted jobs. Then, we present ProportionalSharing, a preemptive strongly local policy that beats this lower bound of 3; we show that this policy achieves an approximation ratio of 2.61 for the weighted sum of completion times and that the EqualSharing policy achieves an approximation ratio of 2.5 for the (unweighted) sum of completion times. Furthermore, we show that ProportionalSharing induces potential games (in which best-response dynamics converge to pure Nash equilibria).

All of our upper bounds are for the robust price of anarchy, defined by Roughgarden [36], so they naturally extend to mixed Nash equilibria, correlated equilibria, and regret minimization dynamics. Finally, we prove that our price of anarchy bound for ProportionalSharing can be used to design a new combinatorial constant-factor approximation algorithm minimizing weighted completion time for unrelated machine scheduling.
Ph.D. Thesis 2010 Tools and Techniques for the Sound Verification of Low Level Code Conway, Christopher L. Abstract | PDF

Title: Tools and Techniques for the Sound Verification of Low Level Code

Candidate: Conway, Christopher L.

Advisor(s): Barrett, Clark

Abstract:

Software plays an increasingly crucial role in nearly every facet of modern life, from communications infrastructure to the control systems in automobiles, airplanes, and power plants. To achieve the highest degree of reliability for the most critical pieces of software, it is necessary to move beyond ad hoc testing and review processes towards verification---to prove using formal methods that a piece of code exhibits exactly those behaviors allowed by its specification and no others.

A significant portion of the existing software infrastructure is written in low-level languages like C and C++. Features of these language present significant verification challenges. For example, unrestricted pointer manipulation means that we cannot prove even the simplest properties of programs without first collecting precise information about potential aliasing relationships between variables.

In this thesis, I present several contributions. The first is a general framework for combining program analyses that are only conditionally sound. Using this framework, I show it is possible to design a sound verification tool that relies on a separate, previously-computed pointer analysis.

The second contribution of this thesis is Cascade, a multi-platform, multi-paradigm framework for verification. Cascade includes a support for precise analysis of low-level C code, as well as for higher-level languages such as SPL.

Finally, I describe a novel technique for the verification of datatype invariants in low-level systems code. The programmer provides a high-level specification for a low-level implementation in the form of inductive datatype declarations and code assertions. The connection between the high-level semantics and the implementation code is then checked using bit-precise reasoning. An implementation of this datatype verification technique is available as a Cascade module.
Ph.D. Thesis 2010 Probabilistic and Topological methods in Computational Geometry Dhandapani, Raghavan Abstract | PDF

Title: Probabilistic and Topological methods in Computational Geometry

Candidate: Dhandapani, Raghavan

Advisor(s): Pach, Janos

Abstract:

We consider four problems connected by the common thread of geometry. The first three involve problems and algorithms that arise in applications that apriori do not involve geometry but this turns out to be the right language for visualizing and analyzing them. In the fourth, we generalize some well known results in geometry to the topological plane. The techniques we use come from probability and topology.

First, we consider two algorithms that work well in practice but the theoretical mechanism behind whose success is not very well understood.

Greedy routing is a routing mechanism that is commonly used in wireless sensor networks. While routing on the Internet uses standard established protocols, routing in ad-hoc networks with little structure (like sensor networks) is more difficult. Practitioners have devised algorithms that work well in practice, however they were no known theoretical guarantees. We provide the first such result in this area by showing that greedy routing can be made to work on Planar triangulations.

Linear Programming is a technique for optimizing a linear function subject to linear constraints. Simplex Algorithms are a family of algorithms that have proven quite successful in solving Linear Programs in practice. However, examples of Linear Programs on which these algorithms are very inefficient have been obtained by researchers. In order to explain this discrepancy between theory and practice, many authors have shown that Simplex Algorithms are efficient in expectation on randomized Linear Programs. We strengthen these results by proving a partial concentration bound for the Shadow Vertex Simplex Algorithm.

Next, we point out a limitation in an algorithm that is used commonly by practitioners and suggest a way of overcoming this.

Recommendation Systems are algorithms that are used to recommend goods (books, movies etc.) to users based on the similarities between their past preferences and those of other users. Low Rank Approximation is a common method used for this. We point out a common limitation of this method in the presence of ill-conditioning: the presence of multiple local minima. We also suggest a simple averaging based technique to overcome this limitation.

Finally, we consider some basic results in convexity like Radon's, Helly's and Caratheodory's theorems and generalize them to the topological plane, i.e., a plane which has the concept of a linear path which is analogous to a straight line but no notion of metric or distances.
TR2010-936 2010 An Iterative Substructuring Algorithm for Two-dimensional Problems in H(curl) Dohrmann, Clark R.; Widlund, Olof B. Abstract | PDF

Title: An Iterative Substructuring Algorithm for Two-dimensional Problems in H(curl)

Author(s): Dohrmann, Clark R.; Widlund, Olof B.

Abstract:

A domain decomposition algorithm, similar to classical iterative substructuring algorithms, is presented for two-dimensional problems in the space H0(curl). It is defined in terms of a coarse space and local subspaces associated with individual edges of the subdomains into which the domain of the problem has been subdivided. The algorithm differs from others in three basic respects. First, it can be implemented in an algebraic manner that does not require access to individual subdomain matrices or a coarse discretization of the domain; this is in contrast to algorithms of the BDDC, FETIâDP, and classical twoâlevel overlapping Schwarz families. Second, favorable condition number bounds can be established over a broader range of subdomain material properties than in previous studies. Third, we are able to develop theory for quite irregular subdomains and bounds for the condition number of our preconditioned conjugate gradient algorithm, which depend only on a few geometric parameters.

The coarse space for the algorithm is based on simple energy minimization concepts, and its dimension equals the number of subdomain edges. Numerical results are presented which confirm the theory and demonstrate the usefulness of the algorithm for a variety of mesh decompositions and distributions of material properties.
Ph.D. Thesis 2010 Semi-Supervised Learning via Generalized Maximum Entropy Erkan, Ayse Naz Abstract | PDF

Title: Semi-Supervised Learning via Generalized Maximum Entropy

Candidate: Erkan, Ayse Naz

Advisor(s): LeCun, Yann

Abstract:

Maximum entropy (MaxEnt) framework has been studied extensively in the supervised setting. Here, the goal is to find a distribution p, that maximizes an entropy function while enforcing data constraints so that the expected values of some (pre-defined) features with respect to p, match their empirical counterparts approximately. Using different entropy measures, different model spaces for p and different approximation criteria for the data constraints yields a family of discriminative supervised learning methods (e.g., logistic regression, conditional random fields, least squares and boosting). This framework is known as the generalized maximum entropy framework.

Semi-supervised learning (SSL) has emerged in the last decade as a promising field that enables utilizing unlabeled data along with labeled data so as to increase the accuracy and robustness of inference algorithms. However, most SSL algorithms to date have had trade-offs, for instance in terms of scalability or applicability to multi-categorical data.

In this thesis, we extend the generalized MaxEnt framework to develop a family of novel SSL algorithms using two different approaches: i. Introducing Similarity Constraints We incorporate unlabeled data via modifications to the primal MaxEnt objective in terms of additional potential functions. A potential function stands for a closed proper convex function that can take the form of a constraint and/or a penalty representing our structural assumptions on the data geometry. Specifically, we impose similarity constraints as additional penalties based on the semi-supervised smoothness assumption; i.e., we restrict the generalized MaxEnt problem such that similar samples have similar model outputs. ii. Augmenting Constraints on Model Features We incorporate unlabeled data to enhance the estimates on the model and empirical expectations based on our assumptions on the data geometry.

In particular, we derive the semi-supervised formulations for three specific instances of the generalized MaxEnt on conditional distributions, namely logistic regression and kernel logistic regression for multi-class problems, and conditional random fields for structured output prediction problems. A thorough empirical evaluation on standard data sets that are widely used in the literature demonstrates the validity and competitiveness of the proposed algorithms. In addition to these benchmark data sets, we apply our approach to two real-life problems: i. vision based robot grasping, and ii. remote sensing image classification, where the scarcity of the labeled training samples is the main bottleneck in the learning process. For the particular case of grasp learning, we propose a combination of semi-supervised learning and active learning, another sub-field of machine learning that is focused on the scarcity of labeled samples, when the problem setup is suitable for incremental labeling.

The novel SSL algorithms proposed in this thesis have numerous advantages over the existing semi-supervised algorithms as they yield convex, scalable, inherently multi-class loss functions that can be kernelized naturally.
TR2010-929 2010 Information Extraction on High-School Level Chemistry Labs Galron, Daniel Abstract | PDF

Title: Information Extraction on High-School Level Chemistry Labs

Author(s): Galron, Daniel

Abstract:

In this report we present a feasibility study on automatically interpreting instructions found in a set of high school chemistry labs, and discuss the role of deep domain knowledge in the interpretation. We define the task of sentence-level interpretation as the extraction of symbolic representations of the sentence semantics. In the broader scope, the sentence-level semantics of a particular sentence will be resolved with semantics from other sentences in the lab along with domain knowledge to disambiguate and reason about a physical system. The task of general automatic sentence-level interpretation is a difficult one. The general problem is not very well defined in the natural language processing research community, and few researchers have studied the problem. The common practice is to decompose the problem into subtasks, such as resolving coreferences of noun phrases, labeling the semantic roles of arguments to predicates, and identifying word categories. We describe a pipeline combining the subtasks described, along with parsing, to create a system capable of extracting sentence-level semantics. All the systems used for the subtask are found off-the-shelf, and we should stress that such a system will be highly-error prone for reasons we discuss. Finally, we do a close study of the chemistry lab corpus, and analyze each instruction to determine the feasibility of its automatic interpretation and the role of deep domain knowledge in its disambiguation and understanding.
Ph.D. Thesis 2010 Solving Quantified First Order Formulas in Satisfiability Modulo Theories Ge, Yeting Abstract | PDF

Title: Solving Quantified First Order Formulas in Satisfiability Modulo Theories

Candidate: Ge, Yeting

Advisor(s): Barrett, Clark

Abstract:

Design errors in computer systems, i.e. bugs, can cause inconvenience, loss of data and time, and in some cases catastrophic damages. One approach for improving design correctness is formal methods: techniques aiming at mathematically establishing that a piece of hardware or software satisfies certain properties. For some industrial cases in which formal methods are utilized, quantified first order formulas in satisfiability modulo theories (SMT) are useful. This dissertation presents several novel techniques for solving quantified formulas in SMT.

In general, deciding a quantified formula in SMT is undecidable. The practical approach for general quantifier reasoning in SMT is heuristics-based instantiation. This dissertation proposes a number of new heuristics that solves several challenges. Experimental results show that with the new heuristics a significant number of more benchmarks can be solved than before.

When only consider formulas within certain fragments of first order logic, it is possible to have complete algorithms based on instantiation. We propose several new fragments, and we prove that formulas in these fragments can be solved by a complete algorithm based on instantiation. For satisfiable quantified formulas in these fragments, we show how to construct the models.

As SMT solvers grow in complexity, the correctness of SMT solvers become questionable. A practical method to improve the correctness is to check the proofs from SMT solvers. We propose a proof translator that translates proofs from SMT solver CVC3 into a trusted solver HOL Light that actually checks the proofs. Experiments with the proof translator discover a faulty proof rule in CVC3 and two MIT-labeled quantified benchmarks in the SMT benchmark library SMT-LIB.
TR2010-922 2010 Polite Theories Revisited Jovanovic, Dejan; Barrett, Clark Abstract | PDF

Title: Polite Theories Revisited

Author(s): Jovanovic, Dejan; Barrett, Clark

Abstract:

The classic method of Nelson and Oppen for combining decision procedures requires the theories to be stably-infnite. Unfortunately, some important theories do not fall into this category (e.g. the theory of bit-vectors). To remedy this problem, previous work introduced the notion of polite theories. Polite theories can be combined with any other theory using an extension of the Nelson-Oppen approach. In this paper we revisit the notion of polite theories, fxing a subtle flaw in the original definition. We give a new combination theorem which specifies the degree to which politeness is preserved when combining polite theories. We also give conditions under which politeness is preserved when instantiating theories by identifying two sorts. These results lead to a more general variant of the theorem for combining multiple polite theories.
M.S. Thesis 2010 TestRig: A Platform independent system testing tool Kaul, Vaibhav Abstract | PDF

Title: TestRig: A Platform independent system testing tool

Candidate: Kaul, Vaibhav

Advisor(s): Shasha, Dennis

Abstract:

The goal of the TestRig software is to give a test engineer a fixed interface to help him with system/integration testing of software systems. TestRig is platform independent and can be utilized to test software systems coded with any programming language. In addition to doing that, it provides templates and examples of using various Open Source testing tools to help a user design their test cases. TestRig has been designed keeping in mind the current scenario in software development where complex systems are often created using multiple programming languages across different platforms. The challenge is to have a defined set of rules that are able to test any such system. The software makes use of various open source testing tools to run tests and verify results, which enables a user to test a system at different levels such as Performance Testing, Blackbox Testing, and User Acceptance Testing. TestRig is open source and utilizes a programmer’s creativity to test across multiple scenarios. The thesis will show how different software systems have been tested using TestRig.
Ph.D. Thesis 2010 An Algorithmic Enquiry Concerning Causality Kleinberg, Samantha Abstract | PDF

Title: An Algorithmic Enquiry Concerning Causality

Candidate: Kleinberg, Samantha

Advisor(s): Mishra, Bhubaneswar

Abstract:

In many domains we face the problem of determining the underlying causal structure from time-course observations of a system. Whether we have neural spike trains in neuroscience, gene expression levels in systems biology, or stock price movements in finance, we want to determine why these systems behave the way they do. For this purpose we must assess which of the myriad possible causes are significant while aiming to do so with a feasible computational complexity. At the same time, there has been much work in philosophy on what it means for something to be a cause, but comparatively little attention has been paid to how we can identify these causes. Algorithmic approaches from computer science have provided the first steps in this direction, but fail to capture the complex, probabilistic and temporal nature of the relationships we seek.

This dissertation presents a novel approach to the inference of general (type-level) and singular (token-level) causes. The approach combines philosophical notions of causality with algorithmic approaches built on model checking and statistical techniques for false discovery rate control. By using a probabilistic computation tree logic to describe both cause and effect, we allow for complex relationships and explicit description of the time between cause and effect as well as the probability of this relationship being observed (e.g. "a and b until c, causing d in 10-20 time units"). Using these causal formulas and their associated probabilities, we develop a novel measure for the significance of a cause for its effect, thus allowing discovery of those that are statistically interesting, determined using the concepts of multiple hypothesis testing and false discovery control. We develop algorithms for testing these properties in time-series observations and for relating the inferred general relationships to token-level events (described as sequences of observations). Finally, we illustrate these ideas with example data from both neuroscience and finance, comparing the results to those found with other inference methods. The results demonstrate that our approach achieves superior control of false discovery rates, due to its ability to appropriately represent and infer temporal information.
TR2010-926 2010 The Temporal Logic of Token Causes Kleinberg, Samantha; Mishra, Bud Abstract | PDF

Title: The Temporal Logic of Token Causes

Author(s): Kleinberg, Samantha; Mishra, Bud

Abstract:

While type causality helps understand general relationships such as the etiology of a disease (smoking causing lung cancer), token causality aims to explain causal connections in specific instantiated events, such as the diagnosis of a specific patient (Ravi's developing lung cancer after a 20-year smoking habit). Understanding why something happened, as in these examples, is central to reasoning in such diverse cases as the diagnosis of patients, understanding why the US financial market collapsed in 2007 and finding a causal explanation for Obama's victory over Clinton in the US primary. However, despite centuries of work in philosophy and decades of research in computer science, the problem of how to rigorously formalize token causality and how to automate such reasoning has remained unsolved. In this paper, we show how to use type-level causal relationships, represented as temporal logic formulas, together with philosophical principles, to reason about these token-level cases. Finally, we show how this method can correctly reason about examples that have traditionally proven difficult for both computational and philosophical theories to handle.
TR2010-932 2010 An overlapping domain decomposition method for the Reissner-Mindlin Plate with the Falk-Tu Elements Lee, Jong Ho Abstract | PDF

Title: An overlapping domain decomposition method for the Reissner-Mindlin Plate with the Falk-Tu Elements

Author(s): Lee, Jong Ho

Abstract:

The Reissner-Mindlin plate theory models a thin plate with thickness t. The condition numbers of finite element approximations of this model deteriorate badly as the thickness t of the plate converges to 0. In this paper, we develop an overlapping domain decomposition method for the Reissner-Mindlin plate model discretized by the Falk-Tu elements with the convergence rate which does not deteriorate when t converges to 0. It is shown that the condition number of this overlapping method is bounded by C(1+ H/delta)^3(1 +logH/h)^2. Here H is the maximum diameter of the subdomains, delta the size of overlap between subdomains, and h the element size. Numerical examples are provided to confirm the theory.
Ph.D. Thesis 2010 Time Series Modeling with Hidden Variables and Gradient-Based Algorithms Mirowski, Piotr Abstract | PDF

Title: Time Series Modeling with Hidden Variables and Gradient-Based Algorithms

Candidate: Mirowski, Piotr

Advisor(s): LeCun, Yann

Abstract:

We collect time series from real-world phenomena, such as gene interactions in biology or word frequencies in consecutive news articles. However, these data present us with an incomplete picture, as they result from complex dynamical processes involving unobserved state variables. Research on state-space models is motivated by simultaneously trying to infer hidden state variables from observations, as well as learning the associated dynamic and generative models.

I have developed a tractable, gradient-based method for training Dynamic Factor Graphs (DFG) with continuous latent variables. A DFG consists of (potentially nonlinear) factors modeling joint probabilities between hidden and observed variables. The DFG assigns a scalar energy to each configuration of variables, and a gradient-based inference procedure finds the minimum-energy state sequence for a given observation sequence. We approximate maximum likelihood learning by minimizing the expected energy over training sequences with respect to the factors' parameters. These alternated inference and parameter updates constitute a deterministic EM-like procedure.

Using nonlinear factors such as deep, convolutional networks, DFGs were shown to reconstruct chaotic attractors, to outperform a time series prediction benchmark, and to successfully impute motion capture data where a large number of markers were missing. In a joint work with the NYU Plant Systems Biology Lab, DFGs have been subsequently employed to the discovery of gene regulation networks by learning the dynamics of mRNA expression levels.

DFGs have also been extended into a deep auto-encoder architecture, and used on time-stamped text documents, with word frequencies as inputs. We focused on collections of documents that exhibit a structure over time. Working as dynamic topic models, DFGs could extract a latent trajectory from consecutive political speeches; applied to news articles, they achieved state-of-the-art text categorization and retrieval performance.

Finally, I used an embodiment of DFGs to evaluate the likelihood of discrete sequences of words in text corpora, relying on dynamics on word embeddings. Collaborating with AT&T; Labs Research on a project in speech recognition, we have improved on existing continuous statistical language models by enriching them with word features and long-range topic dependencies.
TR2010-933 2010 An Overlapping Schwarz Algorithm for Raviart-Thomas Vector Fields with Discontinuous Coefficients Oh, Duk-Soon Abstract | PDF

Title: An Overlapping Schwarz Algorithm for Raviart-Thomas Vector Fields with Discontinuous Coefficients

Author(s): Oh, Duk-Soon

Abstract:

Overlapping Schwarz methods form one of two major families of domain decomposition methods. We consider a two-level overlapping Schwarz method for Raviart-Thomas vector fields. The coarse part of the preconditioner is based on the energy-minimizing extensions and the local parts are based on traditional solvers on overlapping subdomains. We show that the condition number grows linearly with the logarithm of the number of degrees of freedom in the individual subdomains and linearly with the relative overlap between the overlapping subdomains. The condition number of the method is also independent of the values and jumps of the coefficients. Numerical results for 2D and 3D problems, which support the theory, are also presented.
TR2010-928 2010 BDDC preconditioners for spectral element discretizations of almost incompressible elasticity in three dimensions Pavarino, Luca F.; Widlund, Olof B.; Zampini, Stefano Abstract | PDF

Title: BDDC preconditioners for spectral element discretizations of almost incompressible elasticity in three dimensions

Author(s): Pavarino, Luca F.; Widlund, Olof B.; Zampini, Stefano

Abstract:

BDDC algorithms are constructed and analyzed for the system of almost incompressible elasticity discretized with Gauss-Lobatto-Legendre spectral elements in three dimensions. Initially mixed spectral elements are employed to discretize the almost incompressible elasticity system, but a positive definite reformulation is obtained by eliminating all pressure degrees of freedom interior to each subdomain into which the spectral elements have been grouped. Appropriate sets of primal constraints can be associated with the subdomain vertices, edges, and faces so that the resulting BDDC methods have a fast convergence rate independent of the almost incompressibility of the material. In particular, the condition number of the BDDC preconditioned operator is shown to depend only weakly on the polynomial degree \(n\), the ratio \(H/h\) of subdomain and element diameters, and the inverse of the inf-sup constants of the subdomains and the underlying mixed formulation, while being scalable, i.e., independent of the number of subdomains and robust, i.e., independent of the Poisson ratio and Young's modulus of the material considered. These results also apply to the related FETI-DP algorithms defined by the same set of primal constraints. Numerical experiments carried out on parallel computing systems confirm these results.
Ph.D. Thesis 2010 Structure Prediction and Visualization in Molecular Biology Poultney, Christopher Abstract | PDF

Title: Structure Prediction and Visualization in Molecular Biology

Candidate: Poultney, Christopher

Advisor(s): Shasha, Dennis

Abstract:

The tools of computer science can be a tremendous help to the working biologist. Two broad areas where this is particularly true are visualization and prediction. In visualization, the size of the data involved often makes meaningful exploration of the data and discovery of salient features difficult and time-consuming. Similarly, intelligent prediction algorithms can greatly reduce the lab time required to achieve significant results, or can reduce an intractable space of potential experiments to a tractable size.

Whereas the thesis discusses both a visualization technique and a machine learning problem, the thesis presentation will focus exclusively on the machine learning problem: prediction of temperature-sensitive mutations from protein structure. Temperature-sensitive mutations are a tremendously valuable research tool particularly for studying genes such as yeast essentially genes. To date, most methods for generating temperature-sensitive mutations involve large-scale random mutations followed by an intensive screening and characterization process. While there have been successful efforts to improve this process by rational design of temperature-sensitive proteins, surprisingly little work has been done in the area of predicting those mutations that will exhibit a temperature-sensitive phenotype. We describe a system that, given the structure of a protein of interest, uses a combination of protein structure prediction and machine learning to provide a ranked "top 5" list of likely candidates for temperature-sensitive mutations.
TR2010-934 2010 An Empirical Bayesian Interpretation and Generalization of NL-means Raphan, Martin; Simoncelli, Eero P. Abstract | PDF

Title: An Empirical Bayesian Interpretation and Generalization of NL-means

Author(s): Raphan, Martin; Simoncelli, Eero P.

Abstract:

A number of recent algorithms in signal and image processing are based on the empirical distribution of localized patches. Here, we develop a nonparametric empirical Bayesian estimator for recovering an image corrupted by additive Gaussian noise, based on fitting the density over image patches with a local exponential model. The resulting solution is in the form of an adaptively weighted average of the observed patch with the mean of a set of similar patches, and thus both justifies and generalizes the recently proposed nonlocal-means (NL-means) method for image denoising. Unlike NL-means, our estimator includes a dependency on the size of the patch similarity neighborhood, and we show that this neighborhood size can be chosen in such a way that the estimator converges to the optimal Bayes least squares estimator as the amount of data grows. We demonstrate the increase in performance of our method compared to NL-means on a set of simulated examples.
Ph.D. Thesis 2010 Theoretical Foundations and Algorithms for Learning with Multiple Kernels Rostamizadeh, Afshin Abstract | PDF

Title: Theoretical Foundations and Algorithms for Learning with Multiple Kernels

Candidate: Rostamizadeh, Afshin

Advisor(s): Mohri, Mehryar

Abstract:

Kernel-based algorithms have been used with great success in a variety of machine learning applications. These include algorithms such as support vector machines for classification, kernel ridge regression, ranking algorithms, clustering algorithms, and virtually all popular dimensionality reduction algorithms, since they are special instances of kernel principal component analysis.

But, the choice of the kernel, which is crucial to the success of these algorithms, has been traditionally left entirely to the user. Rather than requesting the user to commit to a specific kernel, multiple kernel algorithms require the user only to specify a family of kernels. This family of kernels can be used by a learning algorithm to form a combined kernel and derive an accurate predictor. This is a problem that has attracted a lot of attention recently, both from the theoretical point of view and from the algorithmic, optimization, and application point of view.

This thesis presents a number of novel theoretical and algorithmic results for learning with multiple kernels.

It gives the first tight margin-based generalization bounds for learning kernels with Lp regularization. In particular, our margin bounds for L1 regularization are shown to have only a logarithmic dependency on the number of kernels, which is a significant improvement over all previous analyses. Our results also include stability-based guarantees for a class of regression algorithms. In all cases, these guarantees indicate the benefits of learning with a large number of kernels.

We also present a family of new two-stage algorithms for learning kernels based on a notion of alignment and give an extensive analysis of the properties of these algorithms. We show the existence of good predictors for the notion of alignment we define and give efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP.

Finally, we also report the results of extensive experiments with our two-stage algorithms in classification and regression tasks, which show an improvement both over the uniform combination of kernels and over other state-of-the-art learning kernel methods for L1 and L2 regularization. These might constitute the first series of results for learning with multiple kernels that demonstrate a consistent improvement over a uniform combination of kernels.
Ph.D. Thesis 2010 Creating collections and evaluating viewpoints: Selection techniques for interface design Secord, Adrian Abstract | PDF

Title: Creating collections and evaluating viewpoints: Selection techniques for interface design

Candidate: Secord, Adrian

Advisor(s): Zorin, Denis

Abstract:

In computer graphics and user interface design, selection problems are those that require the user to select a collection consisting of a small number of items from a much larger library. This dissertation explores selection problems in two diverse domains: large personal multimedia collections, containing items such as personal photographs or songs, and camera positions for 3D objects, where each item is a different viewpoint observing an object. Multimedia collections have by discrete items with strong associated metadata, while camera positions form a continuous space but are weak in metadata. In either domain, the items to be selected have rich interconnections and dependencies, making it difficult to successfully apply simple techniques (such as ranking) to aid the user. Accordingly, we develop separate approaches for the two domains.

For personal multimedia collections, we leverage the semantic metadata associated with each item (such as song title, artist name, etc.) and provide the user with a simple query language to describe their desired collection. Our system automatically suggests a collection of items that conform to the userâs query. Since any query language has limited expressive power, and since users often create collections via exploration, we provide various refinement techniques that allow the user to expand, refine and explore their collection directly through examples.

For camera positioning, we do not have the advantage of having semantic metadata for each item, unlike in media collections. We instead create a proxy viewpoint goodness function which can be used to guide the solution of various selection problems involving camera viewpoints. This function is constructed from several different attributes of the viewpoint, such as how much surface area is visible, or how "curvy" the silhouette is. Since there are many possible viewpoint goodness functions, we conducted a large user study of viewpoint preference and use the results to evaluate thousands of different functions and find the best ones. While we suggest several goodness functions to the practitioner, our user study data and methodology can be used to evaluate any proposed goodness function; we hope it will be a useful tool for other researchers.
TR2010-924 2010 Henrique Andrade, Vibhore Kumar, and Kun-Lung Wu, A Universal Calculus for Stream Processing Languages Soulé, Robert; Hirzel, Martin; Grimm, Robert; Gedik, Buğra Abstract | PDF

Title: Henrique Andrade, Vibhore Kumar, and Kun-Lung Wu, A Universal Calculus for Stream Processing Languages

Author(s): Soulé, Robert; Hirzel, Martin; Grimm, Robert; Gedik, Buğra

Abstract:

Stream processing applications such as algorithmic trading, MPEG processing, and web content analysis are ubiquitous and essential to business and entertainment. Language designers have developed numerous domain-specific languages that are both tailored to the needs of their applications, and optimized for performance on their particular target platforms. Unfortunately, the goals of generality and performance are frequently at odds, and prior work on the formal semantics of stream processing languages does not capture the details necessary for reasoning about implementations. This paper presents Brooklet, a core calculus for stream processing that allows us to reason about how to map languages to platforms and how to optimize stream programs. We translate from three representative languages, CQL, StreamIt, and Sawzall, to Brooklet, and show that the translations are correct. We formalize three popular and vital optimizations, data-parallel computation, operator fusion, and operator re-ordering, and show under which conditions they are correct. Language designers can use Brooklet to specify exactly how new features or languages behave. Language implementors can use Brooklet to show exactly under which circumstances new optimizations are correct. In ongoing work, we are developing an intermediate language for streaming that is based on Brooklet. We are implementing our intermediate language on System S, IBM's high-performance streaming middleware.
Ph.D. Thesis 2010 Analysis of Mass Spectrometry Data for Protein Identification In Complex Biological Mixtures Spivak, Marina Abstract | PDF

Title: Analysis of Mass Spectrometry Data for Protein Identification In Complex Biological Mixtures

Candidate: Spivak, Marina

Advisor(s): Greengard, Leslie

Abstract:

Mass spectrometry is a powerful technique in analytical chemistry that was originally designed to determine the composition of small molecules in terms of their constituent elements. In the last several decades, it has begun to be used for much more complex tasks, including the detailed analysis of the amino acid sequence that makes up an unknown protein and even the identification of multiple proteins present in a complex mixture. The latter problem is largely unsolved and the principal subject of this dissertation.

The fundamental difficulty in the analysis of mass spectrometry data is that of ill-posedness. There are multiple solutions consistent with the experimental data and the data is subject to significant amounts of noise. In this work, we have developed application-specific machine learning algorithms that (partially) overcome this ill-posedness. We make use of labeled examples of a single class of peptide fragments and of the unlabeled fragments detected by the instrument. This places the approach within the broader framework of semi-supervised learning.

Recently, there has been considerable interest in classification problems of this type, where the learning algorithm only has access to labeled examples of a single class and unlabeled data. The motivation for such problems is that in many applications, examples of one of the two classes are easy and inexpensive to obtain, whereas the acquisition of examples of a second class is difficult and labor-intensive. For example, in document classification, positive examples are documents that address specific subject, while unlabeled documents are abundant. In movie rating, the positive data are the movies chosen by clients, while the unlabeled data are all remaining movies in a collection. In medical imaging, positive (labeled) data correspond to images of tissue affected by a disease, while the remaining available images of the same tissue comprise the unlabeled data. Protein identification using mass spectrometry is another variant of such a general problem.

In this work, we propose application-specific machine learning algorithms to address this problem. The reliable identification of proteins from mixtures using mass spectrometry would provide an important tool in both biomedical research and clinical practice.
Ph.D. Thesis 2010 Matrix Approximation for Large-scale Learning Talwalkar, Ameet Abstract | PDF

Title: Matrix Approximation for Large-scale Learning

Candidate: Talwalkar, Ameet

Advisor(s): Mohri, Mehryar

Abstract:

Modern learning problems in computer vision, natural language processing, computational biology, and other areas are often based on large data sets of thousands to millions of training instances. However, several standard learning algorithms, such as kernel-based algorithms, e.g., Support Vector Machines, Kernel Ridge Regression, Kernel PCA, do not easily scale to such orders of magnitude. This thesis focuses on sampling-based matrix approximation techniques that help scale kernel-based algorithms to large-scale datasets. We address several fundamental theoretical and empirical questions including:

What approximation should be used? We discuss two common sampling-based methods, providing novel theoretical insights regarding their suitability for various applications and experimental results motivated by this theory. Our results show that one of these methods, the Nystrom method, is superior in the context of large-scale learning.

Do these approximations work in practice? We show the effectiveness of approximation techniques on a variety of problems. In the largest study to-date for manifold learning, we use the Nystrom method to extract low-dimensional structure from high-dimensional data to effectively cluster face images. We also report good empirical results for kernel ridge regression and kernel logistic regression.

How should we sample columns? A key aspect of sampling-based algorithms is the distribution according to which columns are sampled. We study both fixed and adaptive sampling schemes as well as a promising ensemble technique that can be easily parallelized and generates superior approximations, both in theory and in practice.

How well do these approximations work in theory? We provide theoretical analyses of the Nystrom method to understand when this technique should be used. We present guarantees on approximation accuracy based on various matrix properties and analyze the effect of matrix approximation on actual kernel-based algorithms.

This work has important consequences for the machine learning community since it extends to large-scale applications the benefits of kernel-based algorithms. The crucial aspect of this research, involving low-rank matrix approximation, is of independent interest within the field of numerical linear algebra.
TR2010-935 2010 Learning Image Decompositions with Hierarchical Sparse Coding Zeiler, Matthew D.; Fergus, Rob Abstract | PDF

Title: Learning Image Decompositions with Hierarchical Sparse Coding

Author(s): Zeiler, Matthew D.; Fergus, Rob

Abstract:

We present a hierarchical model that learns image decompositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely on a novel inference scheme that ensures each layer reconstructs the input, rather than just the output of the layer directly beneath, as is common with existing hierarchical approaches. This scheme makes it possible to robustly learn multiple layers of representation and we show a model with 4 layers, trained on images from the Caltech-101 dataset. We use our model to produce image decompositions that, when used as input to standard classification schemes, give a significant performance gain over low-level edge features and yield an overall performance competitive with leading approaches.