Instructions for submitting a technical report or thesis.
Title: An Efficient Reduction of Ranking to Classification
Author(s): Ailon, Nir; Mohri, Mehryar
Abstract:
This paper describes an efficient reduction of the learning problem of ranking to binary classification. As with a recent result of Balcan et al. (2007), the reduction guarantees an average pairwise misranking regret of at most $2r$ using a binary classifier with regret $r$. However, our reduction applies to a broader class of ranking loss functions, admits a simpler proof, and the expected running time complexity of our algorithm in terms of number of calls to a classifier or preference function is improved from $\Omega(n2)$ to $O(n \log n)$. Furthermore, when the top $k$ ranked elements only are required ($k \ll n$), as in many applications in information extraction or search engines, the time complexity of our algorithm can be further reduced to $O(k \log k + n)$. Our reduction and algorithm are thus practical for realistic applications where the number of points to rank exceeds several thousands. Much of our results also extend beyond the bipartite case previously studied.
Title: N-Way Composition of Weighted Finite-State Transducers
Author(s): Allauzen, Cyril; Mohri, Mehryar
Abstract:
Composition of weighted transducers is a fundamental algorithm used in many applications, including for computing complex edit-distances between automata, or string kernels in machine learning, or to combine different components of a speech recognition, speech synthesis, or information extraction system. We present a generalization of the composition of weighted transducers, \emph{$n$-way composition}, which is dramatically faster in practice than the standard composition algorithm when combining more than two transducers. The expected worst-case complexity of our algorithm for composing three transducers $T_1$, $T_2$, and $T_3$\ignore{ depending on the strategy used, is $O(|T_1|_E|T_2|_Q|T_3|_E + |T|)$ or $(|T_1|_Q|T_2|_E|T_3|_Q + |T|)$, } is $O(\min(|T_1|_E|T_2|_Q|T_3|_E, |T_1|_Q|T_2|_E|T_3|_Q) + |T|)$, where $T$ is the result of that composition and $|T_i| = |T_i|_Q + |T_i|_E$ with $|T_i|_Q$ the number of states and $|T_i|_E$ the number of transitions of $T_i$, $i = 1, 2, 3$. In many cases, this significantly improves on the complexity of standard composition. Our algorithm also leads to a dramatically faster composition in practice. Furthermore, standard composition can be obtained as a special case of our algorithm. We report the results of several experiments demonstrating this improvement. These theoretical and empirical improvements significantly enhance performance in the applications already mentioned.
Title: Scaling Data Servers via Cooperative Caching
Candidate: Annapureddy, Siddhartha
Advisor(s): Mazieres, David
Abstract:
In this thesis, we present design techniques -- and systems that illustrate and validate these techniques -- for building data-intensive applications over the Internet. We enable the use of a traditional bandwidth-limited server in these applications. A large number of cooperating users contribute resources such as disk space and network bandwidth, and form the backbone of such applications. The applications we consider fall in one of two categories. The first type provide user-perceived utility in proportion to the data download rates of the participants; bulk data distribution systems is a typical example. The second type are usable only when the participants have data download rates above a certain threshold; video streaming is a prime example.
We built Shark, a distributed file system, to address the first type of applications. It is designed for large-scale, wide-area deployment, while also providing a drop-in replacement for local-area file systems. Shark introduces a novel locality-aware cooperative-caching mechanism, in which clients exploit each other's file caches to reduce load on an origin file server. Shark also enables sharing of data even when it originates from different servers. In addition, Shark clients are mutually distrustful in order to operate in the wide-area. Performance results show that Shark greatly reduces server load and reduces client-perceived latency for read-heavy workloads both in the wide and local areas.
We built RedCarpet, a near-Video-on-Demand (nVoD) system, to address the second type of applications. nVoD allows a user to watch a video starting at any point after waiting for a small setup time. RedCarpet uses a mesh-based peero-peer (P2P) system to provide the nVoD service. In this context, we study the problem of scheduling the dissemination of chunks that constitute a video. We show that providing nVoD is feasible with a combination of techniques that include network coding, avoiding resource starvation for different chunks, and overay topology management algorithms. Our evaluation, using a simulator as well as a prototype, shows that systems that do not optimize in all these dimensions could deliver significantly worse nVoD performance.
Title: Shape Analysis by Abstraction, Augmentation, and Transformation
Candidate: Balaban, Ittai
Advisor(s): Pnueli, Amir; Zuck, Lenore
Abstract:
The goal of shape analysis is to analyze properties of programs that perform destructive updates of linked structures (heaps). This thesis presents an approach for shape analysis based on program augmentation (instrumentation), predicate abstraction, and model checking, that allows for verification of safety and liveness properties (which, for sequential programs, usually corresponds to program invariance and termination).
One of the difficulties in abstracting heap-manipulating programs is devising a decision procedure for a sufficiently expressive logic of graph properties. Since graph reachability (expressible by transitive closure) is not a first order property, the challenge is in showing that a decision procedure exists for a rich enough subset of first order logic with transitive closure.
Predicate abstraction is in general too weak to verify liveness properties. Thus an additional issue dealt with is how to perform abstraction while retaining enough information. The method presented here is domain-neutral, and applies to concurrent programs as well as sequential ones.
Title: On the Computation of the Relative Entropy of Probabilistic Automata
Author(s): Cortes, Corinna; Mohri, Mehryar; Rastogi, Ashish; Riley, Michael
Abstract:
We present an exhaustive analysis of the problem of computing the relative entropy of two probabilistic automata. We show that the problem of computing the relative entropy of unambiguous probabilistic automata can be formulated as a shortest-distance problem over an appropriate semiring, give efficient exact and approximate algorithms for its computation in that case, and report the results of experiments demonstrating the practicality of our algorithms for very large weighted automata. We also prove that the computation of the relative entropy of arbitrary probabilistic automata is PSPACE-complete.
The relative entropy is used in a variety of machine learning algorithms and applications to measure the discrepancy of two distributions. We examine the use of the symmetrized relative entropy in machine learning algorithms and show that, contrarily to what is suggested by a number of publications, the symmetrized relative entropy is neither positive definite symmetric nor negative definite symmetric, which limits its use and application in kernel methods. In particular, the convergence of training for learning algorithms is not guaranteed when the symmetrized relative entropy is used directly as a kernel, or as the operand of an exponential as in the case of Gaussian Kernels.
Finally, we show that our algorithm for the computation of the entropy of an unambiguous probabilistic automaton can be generalized to the computation of the norm of an unambiguous probabilistic automaton by using a monoid morphism. In particular, this yields efficient algorithms for the computation of the Lp -norm of a probabilistic automaton.
Title: Magnitude-Preserving Ranking Algorithms
Author(s): Cortes, Corinna; Mohri, Mehryar; Rastogi, Ashish
Abstract:
This paper studies the learning problem of ranking when one wishes not just to accurately predict pairwise ordering but also preserve the magnitude of the preferences or the difference between ratings, a problem motivated by its crucial importance in the design of search engines, movie recommendation, and other similar ranking systems. We describe and analyze several algorithms for this problem and give stability bounds for their generalization error, extending previously known stability results to non- bipartite ranking and magnitude of preference-preserving algorithms. We also report the results of experiments comparing these algorithms on several datasets and contrast these results with those obtained using an AUC-maximization algorithm.
Title: Domain Decomposition for Less Regular Subdomains: Overlapping Schwarz in Two Dimensions
Author(s): Dohrmann, Clark R.; Klawonn, Axel; Widlund, Olof B.
Abstract:
In the theory of domain decomposition methods, it is often assumed that each subdomain is the union of a small set of coarse triangles or tetrahedra. In this study, extensions to the existing theory which accommodates subdomains with much less regular shape are presented; the subdomains are only required to be John domains. Attention is focused on overlapping Schwarz preconditioners for problems in two dimensions with a coarse space component of the preconditioner which allows for good results even for coefficients which vary considerably. It is shown that the condition number of the domain decomposition method is bounded by C(1 + H/δ)(1 + log(H/h))
^{2}
, where the constant C independent of the number of subdomains and possible jumps in coefficients between subdomains. Numerical examples are provided which confirm the theory and demonstrate very good performance of the method for a variety of subregions including those obtained when a mesh partitioner is used for the domain decomposition.
Title: Democratizing Content Distribution
Candidate: Freedman, Michael
Advisor(s): Mazieres, David
Abstract:
In order to reach their large audiences, today's Internet publishers primarily use content distribution networks (CDNs) to deliver content. Yet the architectures of the prevalent commercial systems are tightly bound to centralized control, static deployments, and trusted infrastructure, inherently limiting their scope and scale to ensure cost recovery.
To move beyond such shortcomings, this thesis contributes a number of techniques that realize cooperative content distribution. By federating large numbers of unreliable or untrusted hosts, we can satisfy the demand for content by leveraging all available resources. We propose novel algorithms and architectures for three central mechanisms of CDNs: content discovery (where are nearby copies of the client's desired resource?), server selection (which node should a client use?), and secure content transmission (how should a client download content efficiently and securely from its multiple potential sources?).
These mechanisms have been implemented, deployed, and tested in production systems that have provided open content distribution services for more than three years. Every day, these systems answer tens of millions of client requests, serving terabytes of data to more than a million people.
This thesis presents five systems related to content distribution. First, Coral provides a distributed key-value index that enables content lookups to occur efficiently and returns references to nearby cached objects whenever possible, while still preventing any load imbalances from forming. Second, CoralCDN demonstrates how to construct a self-organizing CDN for web content out of unreliable nodes, providing robust behavior in the face of failures. Third, OASIS provides a general-purpose, flexible anycast infrastructure, with which clients can locate nearby or unloaded instances of participating distributed systems. Fourth, as a more clean-slate design that can leverage untrusted participants, Shark offers a distributed file system that supports secure block-based file discovery and distribution. Finally, our authentication code protocol enables the integrity verification of large files on-the-fly when using erasure codes for efficient data dissemination.
Taken together, this thesis provides a novel set of tools for building highly-scalable, efficient, and secure content distribution systems. By enabling the automated replication of data based on its popularity, we can make desired content available and accessible to everybody. And in effect, democratize content distribution.
Title: Declarative Syntax Tree Engineering* Or, One Grammar to Rule Them All
Author(s): Grimm, Robert
Abstract:
Grammars for many parser generators not only specify a language's syntax but also the corresponding syntax tree. Unfortunately, most parser generators pick a somewhat arbitrary combination of features from the design space for syntax trees and thus lock in specific trade-offs between expressivity, safety, and performance. This paper discusses the three major axes of the design space---specification within or outside a grammar, concrete or abstract syntax trees, and dynamically or statically typed trees---and their impact. It then presents algorithms for automatically realizing all major choices from the same, unmodified grammar with inline syntax tree declarations. In particular, this paper shows how to automatically (1) extract a separate syntax tree specification, (2) embed an abstract syntax tree within a concrete one, and (3) infer a strongly typed view on a dynamically typed tree. All techniques are implemented in the Rats! parser generator and have been applied to real-world C and Java grammars and their syntax trees.
Title: Typical: Taking the Tedium Out of Typing
Author(s): Grimm, Robert; Harris, Laune; Le, Anh
Abstract:
The implementation of real-world type checkers requires a non-trivial engineering effort. The resulting code easily comprises thousands of lines, which increases the probability of software defects in a component critical to compiler correctness. To make type checkers easier to implement and extend, this paper presents Typical, a domain-specific language and compiler that directly and concisely captures the structure of type systems. Our language builds on the functional core for ML to represent syntax trees and types as variants and to traverse them with pattern matches. It then adds declarative constructs for common type checker concerns, such as scoping rules, namespaces, and constraints on types. It also integrates error checking and reporting with other constructs to promote comprehensive error management. We have validated our system with two real-world type checkers written in Typical, one for Typical itself and the other for C.
Title: Joint Inference for Information Extraction and Translation
Candidate: Ji, Heng
Advisor(s): Grishman, Ralph
Abstract:
The traditional natural language processing pipeline incorporates multiple stages of linguistic analysis. Although errors are typically compounded through the pipeline, it is possible to reduce the errors in one stage by harnessing the results of the other stages.
This thesis presents a new framework based on component interactions to approach this goal. The new framework applies all stages in a suitable order, with each stage generating multiple hypotheses and propagating them through the whole pipeline. Then the feedback from subsequent stages is used to enhance the target stage by re-ranking these hypotheses, and then produce the best analysis.
The effectiveness of this framework has been demonstrated by substantially improving the performance of Chinese and English entity extraction and Chinese-to-English entity translation. The inference knowledge includes mono-lingual interactions among information extraction stages such as name tagging, coreference resolution, relation extraction and event extraction, as well as cross-lingual interaction between information extraction and machine translation.
Such symbiosis of analysis components allows us to incorporate information from a much wider context, spanning the entire document and even going across documents, and utilize deeper semantic analysis; it will therefore be essential for the creation of a high- performance NLP pipeline.
Title: An analysis of a FETI--DP algorithm on irregular subdomains in the plane
Author(s): Klawonn, Axel; Rheinbach, Oliver; Widlund, Olof B.
Abstract:
In the theory for domain decomposition algorithms of the iterative substructuring family, each subdomain is typically assumed to be the union of a few coarse triangles or tetrahedra. This is an unrealistic assumption, in particular, if the subdomains result from the use of a mesh partitioner in which case they might not even have uniformly Lipschitz continuous boundaries.
The purpose of this study is to derive bounds for the condition number of these preconditioned conjugate gradient methods which depend only on a parameter in an isoperimetric inequality and two geometric parameters characterizing John and uniform domains. A related purpose is to explore to what extent well known technical tools previously developed for quite regular subdomains can be extended to much more irregular subdomains.
Some of these results are valid for any John domains, while an extension theorem, which is needed in this study, requires that the subdomains are uniform. The results, so far, are only complete for problems in two dimensions. Details are worked out for a FETI--DP algorithm and numerical results support the findings. Some of the numerical experiments illustrate that care must be taken when selecting the scaling of the preconditioners in the case of irregular subdomains.
Title: Degeneracy Proof Predicates for the Additively Weighted Voronoi Diagram
Candidate: Millman, David
Advisor(s): Yap, Chee
Abstract:
This thesis focuses on the Additively Weighted Voronoi diagram. It begins by presenting the history of the diagram and some of the early algorithms used for its generation [OBSC00, Aur91]. The paper then addresses the more recent incremental approach to calculating the diagram, as seen in the 2D Apollonius Graphs (Delaunay Graphs of Disks) package of CGAL [KY06]. Next, the algorithm of Boissonnat et al. [BD05] for calculating Convex Hulls is presented. We then apply the predicates presented by Bossonnat to the CGAL implementation of the AW-Voronoi diagram, and the results are discussed. The main contribution of this paper results in predicates of the AW-Voronoi diagram, with a lower algebraic degree which also handle degeneracies in such a way as to produce a conical result.
Title: Cellstorm: A bioinformatics software system to visualize subcellular networks
Candidate: Neves, Ana
Advisor(s): Shasha, Dennis
Abstract:
Cellstorm is a software system that allows a rapid visualization of genes and subcellular networks. Given a set of genes, expression levels, structural hierarchy and network's data, Cellstorm displays the networks at any level of the hierarchy and provides a set of user options such as zooming, network selection and list filtering.
Although Cellstorm is mainly aimed at biological applications, it can be used in any field that needs to display networks. Cellstorm achieves this by avoiding application-specific assumptions.
Title: Authentication Mechanisms for Open Distributed Systems
Candidate: Nicolosi, Antonio
Advisor(s): Mazieres, David; Shoup, Victor
Abstract:
While authentication within organizations is a well-understood problem, traditional solutions are often inadequate at the scale of the Internet, where the lack of a central authority, the open nature of the systems, and issues such as privacy and anonymity create new challenges. For example, users typically establish dozens of web accounts with independently administered services under a single password, which increases the likelihood of exposure of their credentials; users wish to receive email from anyone who is not a spammer, but the openness of the email infrastructure makes it hard to authenticate legitimate senders; users may have a rightful expectation of privacy when viewing widely-accessed protected resources such as premium website content, yet they are commonly required to present identifying login credentials, which permits tracking of their access patterns.
This dissertation describes enhanced authentication mechanisms to tackle the challenges of each of the above settings. Specifically, the dissertation develops: 1) a remote authentication architecture that lets users recover easily in case of password compromise; 2) a social network-based email system in which users can authenticate themselves as trusted senders without disclosing all their social contacts; and 3) a group access-control scheme where requests can be monitored while affording a degree of anonymity to the group member performing the request.
The proposed constructions combine system designs and novel cryptographic techniques to address their respective security and privacy requirements both effectively and efficiently.
Title: New Design Criteria for Hash Functions and Block Ciphers
Candidate: Puniya, Prashant
Advisor(s): Dodis, Yevgeniy
Abstract:
Cryptographic primitives, such as hash functions and block ciphers, are integral components in several practical cryptographic schemes. In order to prove security of these schemes, a variety of security assumptions are made on the underlying hash function or block cipher, such as collision-resistance, pseudorandomness etc. In fact, such assumptions are often made without much regard for the actual constructions of these primitives. In this thesis, we address this problem and suggest new, and possibly better, design criteria for hash functions and block ciphers.
We start by analyzing the design criteria underlying hash functions. The usual design principle here involves a two-step procedure: First, come up with a heuristically-designed and ``hopefully strong'' fixed-length input construction (i.e. the compression function), then use a standard domain extension technique, usually the cascade construction, to get a construction that works for variable-length inputs. We investigate this design principle from two perspectives:
We next move on to discuss the Feistel network, which is used in the design of several popular block ciphers such as DES, Triple-DES, Blowfish etc. Currently, the celebrated result of Luby-Rackoff (and further extensions) is regarded as the theoretical basis for using this construction in block cipher design, where it was shown that a four-round Feistel network is a (strong) pseudorandom permutation (PRP) if the round functions are independent pseudorandom functions (PRFs). We study the Feistel network from two different perspectives:
We give a positive answer to the first question and a partial positive answer to the second question. In the process, we undertake a combinatorial study of the Feistel network, that might be useful in other scenarios as well. We provide several practical applications of our results for the Feistel network.
Title: Empirical Bayes least squares estimation without an explicit prior
Author(s): Raphan, Martin; Simoncelli, Eero P.
Abstract:
Bayesian estimators are commonly constructed using an explicit prior model. In many applications, one does not have such a model, and it is difficult to learn since one does not have access to uncorrupted measurements of the variable being estimated. In many cases however, including the case of contamination with additive Gaussian noise, the Bayesian least squares estimator can be formulated directly in terms of the distribution of noisy measurements. We demonstrate the use of this formulation in removing noise from photographic images. We use a local approximation of the noisy measurement distribution by exponentials over adaptively chosen intervals, and derive an estimator from this approximate distribution. We demonstrate through simulations that this adaptive Bayesian estimator performs as well or better than previously published estimators based on simple prior models.
Title: DNA Hash Pooling and its Applications
Author(s): Shasha, Dennis; Amos, Martyn
Abstract:
In this paper we describe a new technique for the characterisation of populations of DNA strands. Such tools are vital to the study of ecological systems, at both the micro (e.g., individual humans) and macro (e.g., lakes) scales. Existing methods make extensive use of DNA sequencing and cloning, which can prove costly and time consuming. The overall objective is to address questions such as: (i) (Genome detection) Is a known genome sequence present at least in part in an environmental sample? (ii) (Sequence query) Is a specific fragment sequence present in a sample? (iii) (Similarity Discovery) How similar in terms of sequence content are two unsequenced samples?
We propose a method involving multiple filtering criteria that result in ``pools" of DNA of high or very high purity. Because our method is similar in spirit to hashing in computer science, we call the method {\it DNA hash pooling}. To illustrate this method, we describe examples using pairs of restriction enzymes. The {\it in silico} empirical results we present reflect a sensitivity to experimental error. The method requires minimal DNA sequencing and, when sequencing is required, little or no cloning.
Title: Being Lazy and Preemptive at Learning toward Information Extraction
Candidate: Shinyama, Yusuke
Advisor(s): Sekine, Satoshi
Abstract:
This thesis proposes a novel approach for exploring Information Extraction scenarios. Information Extraction, or IE, is a task aiming at finding events and relations in natural language texts that meet a user's demand. However, it is often difficult to formulate, or even define such events that satisfy both a user's need and technical feasibility. Furthermore, most existing IE systems need to be tuned for a new scenario with proper training data in advance. So a system designer usually needs to understand what a user wants to know in order to maximize the system performance, while the user has to understand how the system will perform in order to maximize his/her satisfaction.
In this thesis, we focus on maximizing the variety of scenarios that the system can handle instead of trying to improve the accuracy of a particular scenario. In traditional IE systems, a relation is defined a priori by a user and is identified by a set of patterns that are manually crafted or acquired in advance. We propose a technique called Unrestricted Relation Discovery, which defers determining what is a relation and what is not until the very end of the processing so that a relation can be defined a posteriori. This laziness gives huge flexibility to the types of relations the system can handle. Furthermore, we use the notion of recurrent relations to measure how useful each relation is. This way, we can discover new IE scenarios without fully specifying definitions or patterns, which leads to Preemptive Information Extraction, where a system can provide a user a portfolio of extractable relations and let the user choose them.
We used one year news articles obtained from the Web as a development set. We discovered dozens of scenarios that are similar to the existing scenarios tried by many IE systems, as well as new scenarios that are relatively novel. We have evaluated the existing scenarios with Automatic Content Extraction (ACE) event corpus and obtained reasonable performance. We believe this system will shed new light on IE research by giving various experimental IE scenarios.
Title: Constituent Parsing by Classification
Candidate: Turian, Joseph
Advisor(s): Melamed, I. Dan
Abstract:
We present an approach to constituent parsing, which is driven by classifiers induced to minimize a single regularized objective. It is the first discriminatively-trained constituent parser to surpass the Collins (2003) parser without using a generative model. Our primary contribution is simplifying the human effort required for feature engineering. Our model can incorporate arbitrary features of the input and parse state. Feature selection and feature construction occur automatically, as part of learning. We define a set of fine-grained atomic features, and let the learner induce informative compound features. Our learning approach includes several novel approximations and optimizations which improve the efficiency of discriminative training. We introduce greedy completion, a new agenda-driven search strategy designed to find low-cost solutions given a limit on search effort. The inference evaluation function was learned accurately enough to guide the deterministic parsers to the optimal parse reasonably quickly without pruning, and thus without search errors. Experiments demonstrate the flexibility of our approach, which has also been applied to machine translation (Wellington et. al, AMTA 2006; Turian et al., NIPS 2007).
Title: Enhanced Security Models for Network Protocols
Candidate: Walfish, Shabsi
Advisor(s): Dodis, Yevgeniy
Abstract:
Modeling security for protocols running in the complex network environment of the Internet can be a daunting task. Ideally, a security model for the Internet should provide the following guarantee: a protocol that "securely" implements a particular task specification will retain all the same security properties as the specification itself, even when an arbitrary set of protocols runs concurrently on the same network. This guarantee must hold even when other protocols are maliciously designed to interact badly with the analyzed protocol, and even when the analyzed protocol is composed with other protocols. The popular Universal Composability (UC) security framework aims to provide this guarantee.
Unfortunately, such strong security guarantees come with a price: they are impossible to achieve without the use of some trusted setup. Typically, this trusted setup is global in nature, and takes the form of a Public Key Infrastructure (PKI) and/or a Common Reference String (CRS). However, the current approach to modeling security in the presence of such setups falls short of providing expected security guarantees. A quintessential example of this phenomenon is the deniability concern: there exist natural protocols that meet the strongest known security notions (including UC) while failing to provide the same deniability guarantees that their task specifications imply they should provide.
We introduce the Generalized Universal Composability (GUC) framework to extend the UC security notion and enable the re-establishment of its original intuitive security guarantees even for protocols that use global trusted setups. In particular, GUC enables us to guarantee that secure protocols will provide the same level of deniability as the task specification they implement. To demonstrate the usefulness of the GUC framework, we first apply it to the analysis and construction of deniable authentication protocols. Building upon such deniable authentication protocols, we then prove a general feasibility result showing how to construct protocols satisfying our security notion for a large class of two-party and multi-party tasks (assuming the availability of some reasonable trusted setup). Finally, we highlight the practical applicability of GUC by constructing efficient protocols that securely instantiate two common cryptographic tasks: commitments and zero-knowledge proofs.
Title: Tree-Structured Models of Multitext: Theory, Design and Experiments
Candidate: Wellington, Benjamin
Advisor(s): Melamed, I. Dan
Abstract:
Statistical machine translation (SMT) systems use empirical models to simulate the act of human translation between language pairs. This dissertation surveys the ability of currently popular syntax-aware SMT systems to model real-world multitext, and shows different types of linguistic phenomena occurring in natural language translation that these popular systems cannot capture. It then proposes a new grammar formalism, Generalized Multitext Grammar (GMTG), and a generalization of Chomsky Normal Form, that allows us to build an efficient SMT system using previously developed parsing techniques. The dissertation addresses many software engineering issues that arise when doing syntax-based SMT using large corpora and lays out a object-oriented design for a translation toolkit. Using the toolkit, we show that a tree-transduction based SMT system, which uses modern machine learning algorithms, outperforms a generative baseline.
Title: Formal Verification Using Static and Dynamic Analyses
Candidate: Zaks, Aleksandr
Advisor(s): Pnueli, Amir
Abstract:
One of the main challenges of formal verification is the ability to handle systems of realistic size, which is especially exacerbated in the context of software verification. In this dissertation, we suggest two related approaches that, while both rely on formal method techniques, they can still be applied to larger practical systems. The scalability is mainly achieved by restricting the types of properties we are considering and guarantees that are given.
Our first approach is a novel run-time monitoring framework. Unlike previous work on this topic, we expect the properties to be specified using Property Specification Language (PSL). PSL is a newly adopted IEEE P1850 standard and is an extension of Linear Temporal Logic (LTL). The new features include regular expressions and finite trace semantics, which make the new logic very attractive for run-time monitoring of both software and hardware designs. To facilitate the new logic we have extended the existing algorithm for LTL tester construction to cover the PSL specific operators. Another novelty of our approach is the ability to use partial information about the program that is being monitored while the existing tools only use the information about the observed trace and the property under consideration. This allows going beyond the focus of traditional run-time monitoring tools -- error detection in the execution trace, towards the focus of static analysis -- bug detection in programs.
In our second approach, we employ static analysis to compute SAT-based function summaries to detect invalid pointer accesses. To compute function summaries, we propose new techniques for improving the precision and performance in order to reduce the false error rates. In particular, we use BDDs to represent a symbolic simulation of functions, where BDDs allow an efficient representation of path-sensitive information and high level simplification. In addition, we use light-weight range analysis technique for determining lower and upper bounds for program variables, which can further offload the work form the SAT solver. Note that while in our current implementation the analysis happens at compile time, we can also use the function summaries as a basis for run-time monitoring.