Dennis Shasha's Research Summary


Goals

I work on quite a few different projects. Most end up resembling puzzles on large data and pattern matching or machine learning. Areas of interest include computational biology (mostly on plants) and biomedicine (data analysis, experimental design), time series (fast algorithms for fundamental problems such as correlation and burst detection as well as applications like forecasting, and pattern matching in trees and labeled graphs.

Since 2013, I've become interested in millimeter wireless problems under the auspices of NYU WIRELESS and in close collaboration with Sundeep Rangan, Aditya Dhananjay, and Marco Mezzavilla. This has included projects having to do with high performance digital radios, robotics to try to find the best placement of base stations and possibly fire-fighting using drones.

Meta-Algorithms for Machine Learning to Improve Accuracy

In addition to applied machine learning work, I have worked on the problem of designing a meta-algorithm called SafePredict that is allowed to refuse to accept the predictions of an underlying machine learning algorithm if it determines the predictions will incur an error rate higher than a user-specified bound. Our SafePredict framework can asymptotically achieve a higher correctness rate for any machine learning algorithm on the non-refused predictions. This is the thesis work of Anil Kocak whom I co-advised with Elza Erkip. David Ramirez is also a collaborator in that work. Shantanu Jain is also working on making the software usable by users of machine learning packages. With Ben Peherstorfer , we are extending that work to numerical problems. Alexis Joly, Titouan Lorieul and I are applying similar reasoning to inferring sets of labels for an object when finding the exact label of that object is too uncertain.

Biological Computing

I've worked on several molecular biology projects since the early 2000s, notably with the plant biology labs of Gloria Coruzzi, Ken Birnbaum, Rich Bonneau, Philip Benfey, and Rodrigo Gutierrez.

Graph Algorithms

With Alfredo Ferro, Rosalba Giugno, Alfredo Pulverenti, Giovanni Micale, Vinzenzo Bonnici, Antonio Di Maria, and I have worked on two kinds of graph matching problems: (i) given a small graph find all instances of that subgraph in a large graph, a well known NP-complete problem but one with many good heuristics. Lately we have been able to deal with graphs in which nodes can have zero or more labels and so can edges. We have dealt with both exact and inexact matching. (ii) Given a large graph, find the patterns in that graph that appear "unusually often". What this means can depend on the graph generating process. This ongoing project has had the delightful side effects of my visiting the beautiful island of Lipari and keeping me from forgetting all my Italian.

Finding Bugs in Black Box Workflows

Raoni Lourenco, Juliana Freire and I have worked on the problem of finding the root causes of bugs in complex workflows. Given a setting where an execution configuration can be characterized by parameter-value settings, new configurations can be executed at will, and there is some application-based notion of success or failure, our method can find minimal root causes consisting of disjunctions of conjunctions of parameter-value pairs, often in linear time in the number of parameters.

Version Climber

Christophe Pradal, Sarah Cohen-Boulakia, Patrick Valduriez and I have worked on the problem of updating software configurations. Imagine a software system in which there are many packages. Each has a certain version, e.g. some version of Python 3, some version of a graphics package etc. Now, we would like to update one or more of these packages to a newer version. This can be a time-intensive trial-and-error process. Version Climber does an automatic systematic exploration, using heuristics and parallelism do version upgrades "without tears". We build this on top of Conda.

Formal Verification of Concurrent Search Structures

Thomas Wies, Siddharth Krishna, Nisarg Patel, and I have worked on formal verification of concurrent search structure algorithms using separation logic and a framework I developed some time around the stone age. The approach can already automatically verify a lot of concurrent algorithms that are now verified by hand.

Data Science Support for Linguistics

Chris Collins and Richard Kayne from NYU Linguistics, along with Linguistics doctoral student Michael Taylor, worked with the computer science team consisting of Sangeeta Vishwanath, Hiral Rajani, Jillian Kozyra, and me to build a system called Syntactic Structures of the World's Languages (SSWL) several years ago. The idea is to allow the comparison of the syntax of hundreds or even thousands of languages and already permits questions to be answered on a scale never before possible in linguistics. Hilda Koopman has led and greatly expanded the linguistics effort for the last few years, bringing the number of participating linguists to the hundreds. Ross Affenberger and Alex Lobascio then Marco Liberati, then Hannan Butt along with Shailesh Vasandani have worked on the implementation resulting in Terraling a more flexible version of the software.

Acronym Expansion

Haven't we all had the problem that we cannot understand an acronym in an article even when it might have been defined earlier? Our acronym expander project uses parsing to find acronym definitions within a paper p and document similarity to find the definitions in other papers p' that are similar to p. We are working on an end-to-end system that I hope will be used by millions. This is joint work with Helena Galhardas, Joao Pereira currently and earlier Kshitiz Sethia and Ben Turtel.

Finding Correlated Time Series

Alexandra Levchenko, Boyan Kolev, Djamel-Edine Yagoubi, Reza Akbarinia, Florent Maeeglia, Themis Palpanas, Patrick Valduriez, and I have created a system that, given a query time series, finds the closest time series to that one in a (possibly large) collection of time series. In order to avoid a linear scan of that collection, the system BestNeighbor uses either a derivation of the iSAX system or a random projection approach. It turns out that random projections work best when the time series have a lot of energy in their high frequency components.

Short Review of Other Projects

A general pattern? I like puzzles. A second general pattern is that I program a lot in a fast, extremely exressive language called K A third pattern is that I work with excellent people -- undergraduates, master's students, doctoral students, post-docs, other profs, and non-academics.

If you have skill, energy, and initiative and if you like what you have seen here, then drop me a line. I might have a project for you. My philosophy is to try to find something that is close to your heart and close to mine. All projects take work. You have to believe in the goal and take pleasure in the means to get there.

Look at Marianne Winslett's interview of me in which I try to explain the way I work.


Older Work

Pattern Recognition

Tamper-resistant file systems

With David Mazieres, I worked on a network file system that supports the following scenario (among others): a group of people work together in a distributed fashion but trust neither one another nor their system administrator. For example, they might outsource their system administration to an organization that they don't necessarily trust. Historically, that organization could do subtle changes to their data, read their data, and so on. The system we have designed makes makes tampering changes quickly detectable. (Secrecy can be achieved with straightforward cryptographic techniques.) The only assumption we make is that each client has a secret signature key. Please see our PODC paper here . Some promising performance results due to the great efforts of David, Jinyuan Li, and Maxwell Krohn appear in OSDI. With Radu Sion and Peter Worth of Stony Brook and then with Arthur Meacham then at NYU, we did work having to do with provably secure database outsourcing. Suppose a mutually trusting group of clients want to use the software provided by an outsourcer. The guarantee is that the outsourcer will not be able to understand the client data (because it is encrypted when the outsourcer sees it), nor will the outsourcer know which data any client accesses and the clients will enjoy full transactional guarantees.

AQuery: a database system for querying ordered data

An order-dependent query is one whose result (interpreted as a multi-set) changes if the order of the input records is changed. In a stock-quotes database, for instance, retrieving all quotes concerning a given stock for a given day does not depend on order, because the collection of quotes does not depend on order. By contrast, finding the five price moving average in a trade table gives a result that depends on the order of the table. Query languages based on the relational data model can handle order-dependent queries only through add-ons. SQL:1999, for example, permits the use of a data ordering mechanism called a ``window'' in limited parts of a query. As a result, order-dependent queries become difficult to write in those languages and optimization techniques for these features, applied as pre- or post-enumerating phases, are generally crude. The goal of our work is to show that when order is a property of the underlying data model and algebra, writing order-dependent queries in a language can be natural as is their optimization. We introduce AQuery, an SQL-like query language and algebra t has from-the-ground-up support for order. We also present a framework for optimization of the order-dependent queries categories it expresses. The framework is able to take advantage of the large body of query transformations on relational systems while incorporating new ones described here. We show by experiment that the resulting system is orders of magnitude faster than current SQL:1999 systems on many natural order-dependen You can see our paper here. You can see a power point presentation here. Joint work with Alberto Lerner.

AJAX: a data cleaning system

Le Subscribe: a publish-subscribe system

Fault Tolerant Parallel Programming

Database Internals Work

Database Tuning and Wall Street

Real-Time Scheduling

Thinksheet and StratPal

Dave Tanzer's thesis is on efficient backwards reasoning in the thinksheet context. The work has been completed.

Recently, we (Stacey Kuznetsov and I) have designed a new system called Stratpal. Stratpal is a simplified thinksheet that can be used to model laws and strategies. Its key features is that given a linear document, it is easy to create a StratPal application that can be improved incrementally over time.

Benchmarks

K. Jacob of Morgan Stanley and I have designed a benchmark for financial time series queries, called FinTime which database vendors and customers may find interesting. Also on the subject of benchmarks, Yunyue Zhu and I have designed a benchmark for bitemporal database management systems, called SpyTime .

Software

Statstream (incremental correlation in time series) software description

Tree searching (finding a small tree in a database of large trees; where the order among siblings doesn't matter) software description

Tree difference (finding the difference between trees, where the order among siblings does matter) software description

Graph comparison (heuristic techniques for comparing graphs) software description

Graph clustering (finding interesting motifs in graphs) software description

SchedMatch (find the differences between partial orders) software description

Fun Stuff


Last modified July 2020. The November 2018 version was translated to French by Deepak Khanna The January 2015 version was translated to Estonian by Weronika Pawlak