Instructions for submitting a technical report or thesis.
Title: General Algorithms for Testing the Ambiguity of Finite Automata
Author(s): Allauzen, Cyril; Mohri, Mehryar; Rastogi, Ashish
Abstract:
This paper presents efficient algorithms for testing the finite, polynomial, and exponential ambiguity of finite automata with $\epsilon$-transitions. It gives an algorithm for testing the exponential ambiguity of an automaton $A$ in time $O(|A|_E2)$, and finite or polynomial ambiguity in time $O(|A|_E3)$. These complexities significantly improve over the previous best complexities given for the same problem. Furthermore, the algorithms presented are simple and are based on a general algorithm for the composition or intersection of automata. We also give an algorithm to determine the degree of polynomial ambiguity of a finite automaton $A$ that is polynomially ambiguous in time $O(|A|_E3)$. Finally, we present an application of our algorithms to an approximate computation of the entropy of a probabilistic automaton.
Title: Competitive Hybridization Model
Author(s): Cherepinsky, Vera; Hashmi, Ghazala; Seul, Michael; Mishra, Bud
Abstract:
Microarray technology, in its simplest form, allows one to gather abundance data for target DNA molecules, associated with genomes or gene-expressions, and relies on hybridizing the target to many short probe oligonucleotides arrayed on a surface. While for such multiplexed reactions conditions are optimized to make the most of each individual probe-target interaction, subsequent analysis of these experiments is based on the implicit assumption that a given experiment gives the same result regardless of whether it was conducted in isolation or in parallel with many others. It has been discussed in the literature that this assumption is frequently false, and its validity depends on the types of probes and their interactions with each other. We present a detailed physical model of hybridization as a means of understanding probe interactions in a multiplexed reaction. The model is formulated as a system of ordinary di.erential equations (ODE.s) describing kinetic mass action and conservation-of-mass equations completing the system.
We examine pair-wise probe interactions in detail and present a model of .competition. between the probes for the target.especially, when target is in short supply. These e.ects are shown to be predictable from the a.nity constants for each of the four probe sequences involved, namely, the match and mismatch for both probes. These a.nity constants are calculated from the thermodynamic parameters such as the free energy of hybridization, which are in turn computed according to the nearest neighbor (NN) model for each probe and target sequence.
Simulations based on the competitive hybridization model explain the observed variability in the signal of a given probe when measured in parallel with di.erent groupings of other probes or individually. The results of the simulations are used for experiment design and pooling strategies, based on which probes have been shown to have a strong e.ect on each other.s signal in the in silico experiment. These results are aimed at better design of multiplexed reactions on arrays used in genotyping (e.g., HLA typing, SNP or CNV detection, etc.) and mutation analysis (e.g., cystic .brosis, cancer, autism, etc.).
Title: Friendshare: A decentralized, consistent storage repository for collaborative file sharing
Candidate: Chiang, Frank
Advisor(s): Li, Jinyang
Abstract:
Data sharing has become more and more collaborative with the rise of Web 2.0, where multiple writers jointly write and organize the content in a repository. Current solutions use a centralized entity, such as Wikipedia or Google Groups, to serve the data. However, centralized solutions may be undesirable due to privacy concerns and censorship, which are problems that can be alleviated by switching to decentralized solutions.
The challenge of building a decentralized collaborative repository is achieving high data availability, durability, and consistency. Attaining these goals is difficult because peer nodes have limited bandwidth and storage space, low availability, and the repository has high membership churn.
This thesis presents Friendshare, a decentralized multiple-writer data repository. Separating the metadata from the data allows for efficient metadata replication across privileged admin nodes, thus increasing availability and durability. The primary commit scheme, where a primary node is responsible for determining the total order of writes in the repository, is employed to ensure eventual consistency. If the primary leaves the system unexpectedly, the remaining admin nodes run Paxos, a consensus protocol, to elect a new primary.
The Paxos protocol requires high node availability in order to be run efficiently, a criteria that is rarely met in typical peer-to-peer networks. To rectify this problem, we offer two optimizations to improve Paxos performance in low availability environments.
Friendshare has been implemented and deployed to gather real-world statistics. To offer theoretical predictions, we built a simulator to demonstrate the performance and service availability of Friendshare at various node online percentages. In addition, we show the performance improvements of our Paxos optimizations in comparison with the basic Paxos protocol.
Title: Factor Graphs for Relational Regression
Author(s): Chopra, Sumit; Thampy, Trivikaraman; Leahy, John; Caplin, Andrew; LeCun, Yann
Abstract:
Traditional methods for supervised learning involve treating the input data as a set of independent, identically distributed samples. However, in many situations, the samples are related in such a way that variables associated with one sample depend on other samples. We present a new form of relational graphical model that, in addition to capturing the dependence of the output on sample specific features, can also capture hidden relationships among samples through a non-parametric latent manifold. Learning in the proposed graphical model involves simultaneously learning the non-parametric latent manifold along with a non-relational parametric model. Efficient inference algorithms are introduced to accomplish this task. The method is applied to the prediction of house prices. A non-relational model predicts an ``intrinsic" price of the house which depends only on its individual characteristics, and a relational model estimates a hidden surface of ``desirability'' coefficients which links the price of a house to that of similar houses in the neighborhood.
Title: Verification of Transactional Memories and Recursive Programs
Candidate: Cohen, Ariel
Advisor(s): Pnueli, Amir
Abstract:
Transactional memory is a programming abstraction intended to simplify the synchronization of conflicting concurrent memory accesses without the difficulties associated with locks. In the first part of this thesis we provide a framework and tools that allow to formally verify that a transactional memory implementation satisfies its specification. First we show how to specify transactional memory in terms of admissible interchanges of transaction operations, and give proof rules for showing that an implementation satisfies its specification. We illustrate how to verify correctness, first using a model checker for bounded instantiations, and subsequently by using a theorem prover, thus eliminating all bounds. We provide a mechanical proof of the soundness of the verification method, as well as mechanical proofs for several implementations from the literature, including one that supports non-transactional memory accesses.
Procedural programs with unbounded recursion present a challenge to symbolic model-checkers since they ostensibly require the checker to model an unbounded call stack. In the second part of this thesis we present a method for model-checking safety and liveness properties over procedural programs. Our method performs by first augmenting a concrete procedural program with a well founded ranking function, and then abstracting the Procedural programs with unbounded recursion present a challenge to symbolic model-checkers since they ostensibly require the checker to model an unbounded call stack. In the second part of this thesis we present a method for model-checking safety and liveness properties over procedural programs. Our method performs by first augmenting a concrete procedural program with a well founded ranking function, and then abstracting the augmented program by a finitary state abstraction. Using procedure summarization the procedural abstract program is then reduced to a finite-state system, which is model checked for the property.
Title: Pointer Analysis, Conditional Soundness, and Proving the Absence of Errors
Author(s): Conway, Christopher L.; Dams, Dennis; Namjoshi, Kedar S.; Barrett, Clark
Abstract:
It is well known that the use of points-to information can substantially improve the accuracyof a static program analysis. Commonly used algorithms for computing points-to information are known to be sound only for memory-safe programs. Thus, it appears problematic to utilize points-to information to verify the memory safety property without giving up soundness. We show that a sound combination is possible, even if the points-to information is computed separately and only conditionally sound. This result is based on a refined statement of the soundness conditions of points-to analyses and a general mechanism for composing conditionally sound analyses.
Title: STUMP: Stereo Correspondence in the Cyclopean Eye under Belief Propagation
Candidate: Distler, George
Advisor(s): Geiger, Davi
Abstract:
The human visual system sees at any moment a static scene in three dimensions. This 3D view of the world is acquired by two images, one from the left eye and the other by the right eye. Fusing the left and right stereo pair of images yields a single cyclopean view portraying depth. Stereo vision can be applied to the field of computer vision via calibrated stereo cameras to capture the left and right images. Given a stereo pair of images, one can compute the field of depth via a stereo correspondence algorithm. We present a new approach to computing the disparity (depth) by means the STUMP algorithm.
The STUMP algorithm presents a solution to the stereo correspondence problem. We propose to solve the problem of discontinuities in disparity within epipolar lines by modeling geometric constraints of smooth, tilted, and occluded surfaces as well as unicity and opaqueness. Our algorithm runs upon a framework built upon the BP-TwoGraphs belief propagation estimation [17]. As a result, we provide a disparity map in the cyclopean coordinate system determined by a probability distribution computed in polynomial time.
Title: An Overlapping Schwarz Algorithm for Almost Incompressible Elasticity
Author(s): Dohrmann, Clark R.; Widlund, Olof B.
Abstract:
Overlapping Schwarz methods are extended to mixed finite element approximations of linear elasticity which use discontinuous pressure spaces. The coarse component of the preconditioner is based on a low-dimensional space previously developed for scalar elliptic problems and a domain decomposition method of iterative substructuring type, i.e., a method based on non-overlapping decompositions of the domain, while the local components of the preconditioner are based on solvers on a set of overlapping subdomains.
A bound is established for the condition number of the algorithm which grows in proportion to the square of the logarithm of the number of degrees of freedom in individual subdomains and the third power of the relative overlap between the overlapping subdomains, and which is independent of the Poisson ratio as well as jumps in the Lam\'e parameters across the interface between the subdomains. A positive definite reformulation of the discrete problem makes the use of the standard preconditioned conjugate gradient method straightforward. Numerical results, which include a comparison with problems of compressible elasticity, illustrate the findings.
Title: Learning Long-Range Vision for an Offroad Robot
Candidate: Hadsell, Raia
Advisor(s): LeCun, Yann
Abstract:
Teaching a robot to perceive and navigate in an unstructured natural world is a difficult task. Without learning, navigation systems are short-range and extremely limited. With learning, the robot can be taught to classify terrain at longer distances, but these classifiers can be fragile as well, leading to extremely conservative planning. A robust, high-level learning-based perception system for a mobile robot needs to continually learn and adapt as it explores new environments. To do this, a strong feature representation is necessary that can encode meaningful, discriminative patterns as well as invariance to irrelevant transformations. A simple realtime classifier can then be trained on those features to predict the traversability of the current terrain.
One such method for learning a feature representation is discussed in detail in this work. Dimensionality reduction by learning an invariant mapping (DrLIM) is a weakly supervised method for learning a similarity measure over a domain. Given a set of training samples and their pairwise relationships, which can be arbitrarily defined, DrLIM can be used to learn a function that is invariant to complex transformations of the inputs such as shape distortion and rotation.
The main contribution of this work is a self-supervised learning process for long-range vision that is able to accurately classify complex terrain, permitting improved strategic planning. As a mobile robot moves through offroad environments, it learns traversability from a stereo obstacle detector. The learning architecture is composed of a static feature extractor, trained offline for a general yet discriminative feature representation, and an adaptive online classifier. This architecture reduces the effect of concept drift by allowing the online classifier to quickly adapt to very few training samples without overtraining. After experiments with several different learned feature extractors, we conclude that unsupervised or weakly supervised learning methods are necessary for training general feature representations for natural scenes.
The process was developed and tested on the LAGR mobile robot as part of a fully autonomous vision-based navigation system.
Title: Modal Logic, Temporal Models and Neural Circuits: What Connects Them
Author(s): Kleinberg, Samantha; Antoniotti, Marco; Ramakrishnan, Naren; Mishra, Bud
Abstract:
Traditional methods for supervised learning involve treating the input data as a set of independent, identically distributed samples. However, in many situations, the samples are related in such a way that variables associated with one sample depend on other samples. We present a new form of relational graphical model that, in addition to capturing the dependence of the output on sample specific features, can also capture hidden relationships among samples through a non-parametric latent manifold.
Learning in the proposed graphical model involves simultaneously learning the non-parametric latent manifold along with a non-relational parametric model. Efficient inference algorithms are introduced to accomplish this task. The method is applied to the prediction of house prices. A non-relational model predicts an ``intrinsic" price of the house which depends only on its individual characteristics, and a relational model estimates a hidden surface of ``desirability'' coefficients which links the price of a house to that of similar houses in the neighborhood.
Title: Extension of Two-level Schwarz Preconditioners to Symmetric Indefinite Problems
Author(s): Leong, Alan
Abstract:
Two-level overlapping Schwarz preconditioners are extended for use for a class of large, symmetric, indefinite systems of linear algebraic equations. The focus is on an enriched coarse space with additional basis functions built from free space solutions of the underlying partial differential equation. GMRES is used to accelerate the convergence of preconditioned systems. Both additive and hybrid Schwarz methods are considered and reports are given on extensive numerical experiments.
Title: Nonlinear extraction of 'Independent Components' of elliptically symmetric densities using radial Gaussianization
Author(s): Lyu, Siwei; Simoncelli, Eero P.
Abstract:
We consider the problem of efficiently encoding a signal by transforming it to a new representation whose components are statistically independent (also known as factorial). A widely studied family of solutions, generally known as independent components analysis (ICA), exists for the case when the signal is generated as a linear transformation of independent non-Gaussian sources. Here, we examine a complementary case, in which the signal density is non-Gaussian but elliptically symmetric. In this case, no linear transform suffices to properly decompose the signal into independent components, and thus, the ICA methodology fails. We show that a simple nonlinear transformation, which we call radial Gaussianization (RG), provides an exact solution for this case. We then examine this methodology in the context of natural image statistics, demonstrating that joint statistics of spatially proximal coefficients in a multi-scale image representation are better described as elliptical than factorial. We quantify this by showing that reduction in dependency achieved by RG is far greater than that achieved by ICA, for local spatial neighborhoods. We also show that the RG transformation may be closely approximated by divisive normalization transformations that have been used to model the nonlinear response properties of visual neurons, and that have been shown to reduce dependencies between multi-scale image coefficients.
Title: Synthesizing Executable Programs from Requirements
Candidate: Plock, Cory
Advisor(s): Goldberg, Benjamin
Abstract:
Automatic generation of correct software from requirements has long been a ``holy grail'' for system and software development. According to this vision, instead of implementing a system and then working hard to apply testing and verification methods to prove system correctness, a system is rather built correctly by construction. This problem, referred to as synthesis, is undecidable in the general case. However, by restricting the domain to decidable subsets, it is possible to bring this vision one step closer to reality.
The focus of our study is reactive systems, or non-terminating programs that continuously receive input from an external environment and produce output responses. Reactive systems are often safety critical and include applications such as anti-lock braking systems, auto-pilots, and pacemakers. One of the challenges of reactive system design is ensuring that the software meets the requirements under the assumption of unpredictable environment input. The behavior of many of these systems can be expressed as regular languages over infinite strings, a domain in which synthesis has yielded successful results.
We present a method for synthesizing executable reactive systems from formal requirements. The object-oriented requirements language of Live Sequence Charts (LSCs) is considered. We begin by establishing a mapping between various subsets of the language and finite-state formal models. We also consider LSCs which can express time constraints over a dense-time domain. From one of these models, we show how to formulate a winning strategy that is guaranteed to satisfy the requirements, provided one exists. The strategy is realized in the form of a controller which guides the system in choosing only non-violating behaviors. We describe an implementation of this work as an extension of an existing tool called the Play-Engine.
Title: Theory and Algorithms for Modern Machine Learning Problems and an Analysis of Markets
Candidate: Rastogi, Ashish
Advisor(s): Cole, Richard; Mohri, Mehryar
Abstract:
The unprecedented growth of the Internet over the past decade and of data collection, more generally, has given rise to vast quantities of digital information, ranging from web documents and images, genomic databases to a vast array of business customer information. Consequently, it is of growing importance to develop tools and models that enable us to better understand this data and to design data-driven algorithms that leverage this information. This thesis provides several fundamental theoretical and algorithmic results for tackling such problems with applications to speech recognition, image processing, natural language processing, computational biology and web-based algorithms.
Probabilistic automata provide an efficient and compact way to model sequence- oriented data such as speech or web documents. Measuring the similarity of such automata provides a way of comparing the objects they model, and is an essential first step in organizing this type of data. We present algorithmic and hardness results for computing various discrepancies (or dissimilarities) between probabilistic automata, including the relative entropy and the Lp distance; we also give an efficient algorithm to determine if two probabilistic automata are equivalent. In addition, we study the complexity of computing the norms of probabilistic automata.
Organizing and querying large amounts of digitized data such as images and videos is a challenging task because little or no label information is available. This motivates transduction, a setting in which the learning algorithm can leverage unlabeled data during training to improve performance. We present novel error bounds for a family of transductive regression algorithms and validate their usefulness through experiments.
Widespread success of search engines and information retrieval systems has led to large scale collection of rating information which is being used to provide personalized rankings. We examine an alternate formulation of the ranking problem for search engines motivated bythe requirement that in addition to accurately predicting pairwise ordering, ranking systems must also preserve the magnitude of the preferences or the difference between ratings. We present algorithms with sound theoretical properties, and verify their efficacy through experiments.
Finally, price discovery in a market setting can be viewed as an (ongoing) learning problem. Specifically, the problem is to find and maintain a set of prices that balance supply and demand, a core topic in economics. This appears to involve complex implicit and possibly large-scale information transfers. We show that finding equilibrium prices, even approximately, in discrete markets is NP-hard and complement the hardness result with a matching polynomial time approximation algorithm.We also give a new way of measuring the quality of an approximation to equilibrium prices that is based on a natural aggregation of the dissatisfaction of individual market participants.
Title: Measuring biomolecules: an image processing and length estimation pipeline using atomic force microscopy to measure DNA and RNA with high precision
Candidate: Sundstrom, Andrew
Advisor(s): Mishra, Bud
Abstract:
Background. An important problem in molecular biology is to determine the complete transcription profile of a single cell, a snapshot that shows which genes are being expressed and to what degree. Seen in series as a movie, these snapshots would give direct, specific observation of the cell.s regulation behavior. Taking a snapshot amounts to correctly classifying the cell.s ~300 000 mRNA molecules into ~30 000 species, and keeping accurate count of each species. The cell.s transcription profile may be affected by low abundances (1-5 copies) of certain mRNAs; thus, a sufficiently sensitive technique must be employed. A natural choice is to use atomic force microscopy (AFM) to perform single-molecule analysis. Reed et al. ("Single molecule transcription profiling with AFM", Nanotechnology , 18:4 , 2007) developed such an analysis that classifies each mRNA by first multiply cleaving its corresponding synthesized cDNA with a restriction enzyme, then constructing its classification label from ratios of the lengths of its resulting fragments. Thus, they showed the transcription profiling problem reduces to making high-precision measurements of cDNA backbone lengths . correct to within 20-25 bp (6-7.5 nm).
Contribution. We developed an image processing and length estimation pipeline using AFM that can achieve these measurement tolerances. In particular, we developed a biased length estimator using James-Stein shrinkage on trained coefficients of a simple linear regression model, a formulation that subsumes the models we studied.
Methods. First, AFM images were processed to extract molecular objects, skeletonize them, select proper backbone objects from the skeletons, then compute initial lengths of the backbones. Second, a linear regression model was trained on a subset of molecules of known length, namely their computed image feature quantities. Third, the model.s coefficients underwent James-Stein shrinkage to create a biased estimator. Fourth, the trained and tuned model was applied to the image feature quantities computed for each test molecule, giving its final, corrected backbone length.
Results. Training data: one monodisperse set of cDNA molecules of theoretical length 75 nm. Test data: two monodisperse sets of cDNA molecules of unknown length. Corrected distributions of molecular backbone lengths were within 6-7.5 nm from the theoretical lengths of the unknowns, once revealed.
Conclusions. The results suggest our pipeline can be employed in the framework specified by Reed et al. to render single-molecule transcription profiles. The results reveal a high degree of systematic error in AFM measurements that suggests image processing alone is insufficient to achieve a much higher measurement accuracy.
Title: Geometric Modeling with High Order Derivatives
Candidate: Tosun, Elif
Advisor(s): Zorin, Denis
Abstract:
Modeling of high quality surfaces is the core of geometric modeling. Such models are used in many computer-aided design and computer graphics applications. Irregular behavior of higher-order differential parameters of the surface (e.g. curvature variation) may lead to aesthetic or physical imperfections. In this work, we consider approaches to constructing surfaces with high degree of smoothness.
One direction is based on a manifold-based surface definition which ensures well-defined high-order derivatives that can be explicitly computed at any point. We extend previously proposed manifold-based construction to surfaces with piecewise-smooth boundary and explore trade-offs in some elements of the construction. We show that growth of derivative magnitudes with order is a general property of constructions with locally supported basis functions and derive a lower bound for derivative growth and numerically study flexibility of resulting surfaces at arbitrary points.
An alternative direction to using high-order surfaces is to define an approximation to high-order quantities for meshes, with high-order surface implicit. These approximations do not necessarily converge point-wise, but can nevertheless be successfully used to solve surface optimization problems. Even though fourth-order problems are commonly solved to obtain high quality surfaces, in many cases, these formulations may lead to reflection-line and curvature discontinuities. We consider two approaches to further increasing control over surface properties.
The first approach is to consider data-dependent functionals leading to fourth-order problems but with explicit control over desired surface properties. Our fourth-order functionals are based on reflection line behavior. Reflection lines are commonly used for surface interrogation and high-quality reflection line patterns are well-correlated with high-quality surface appearance. We demonstrate how these can be discretized and optimized accurately and efficiently on general meshes.
A more direct approach is to consider a poly-harmonic function on a mesh, such as the fourth-order biharmonic or the sixth-order triharmonic. The biharmonic and the triharmonic equations can be thought of as a linearization of curvature and curvature variation Euler-Lagrange equations respectively. We present a novel discretization for both problems based on the mixed finite element framework and a regularization technique for solving the resulting, highly ill-conditioned systems of equations. We show that this method, compared to more ad-hoc discretizations, has higher degree of mesh independence and yields surfaces of better quality.