I
work on quite a few different projects  most having
to do with large data and pattern matching or machine learning.
Area interests include computational biology and
biomedicine (data analysis,
visualization, experimental design), time series (fast algorithms
for fundamental problems such as correlation and burst detection
as well as applications like query by humming),
and pattern matching in trees and graphs.
A general pattern? I like puzzles.
A second general pattern is that I program a lot in a fast,
extremely exressive language
called
K.
A third pattern is that I work with excellent people  undergraduates,
master's students, doctoral students, postdocs, other profs, and
nonacademics.
If you have skill, energy, and initiative and if you like what you see below,
then drop me a line.
I might have a project for you.
Look at
Marianne Winslett's interview of me
in which I try to express my research philosophy.
Biological Computing
I've worked on several projects over the last few years,
notably with the plant biology labs of Gloria Coruzzi, Ken Birnbaum,
Philip Benfey, and Rodrigo Gutierrez.
More recently with Rich Bonneau's computational group.

Software to contribute to the visualization of the intersections
and unions of collections of
multiple experiments, multiple genomes or even multiple
baseball players.
Sungear
has been called a Venn diagram on steroids.
It should be useful for social scientists, cancer researchers
and sports fanatics  anyone concerned with trying
to derive interesting information from several long lists of
items (genes, proteins, people, players).
Besides supporting set intersection and unions on items,
Sungear relates those items to functional categories.
This is joint work with Chris Poultney, Rodrigo Gutierrez, Manpreet Katari,
Miriam Gifford, Brad Paley, and Gloria Coruzzi.

Combinatorial
design software
to specify the design of experiments over
several input variables where most of those variables
are considered to be unimportant.
The goal is to explore a large search space with few experiments
while guaranteeing certain properties.
Joint work with Gloria Coruzzi, Peter Palenchar, Rodrigo Gutierrez,
Andrei Kouranov, Laurence Lejay,
and Michael Chou.
(A followon paper with Charles Colbourne gave me an Erdos number of 2.)

Analysis software to find analog circuits given the results
of experiments.
This is also joint work with the Coruzzi lab.

Machine learning work to infer the function of genes, the redundancy
of genes, and regulatory networks.
This is joint work with Gloria Coruzzi, Ken Birnbaum, Huangwen Chen,
Aris Tsirigos, Lee Parnell, as well as help from Mehryar Mohri and
Corrina Cortez.

Work to aid the first stages of protein docking.
We call this protein speeddating.
This is joint work with Chris Poultney and the Bonneau lab.

Software to find the functionality of genes based on
species traits.
This has discovered flagella genes for example,
but I've also applied it to plants.
Please
see a description here.
Joint work with Mitchell Levesque,
Wook Kim, Michael G. Surette, and Philip Benfey.

Software to try to infer binding sites of transcription factors
as well as causality and repression among transcription factors.
We have started with yeast, but the software is quite general.
It uses a mix of pattern analysis and heuristics based on statistics.
Joint work with Philip Benfey and Ken Birnbaum and Aris Tsirigos.
Ken Birnbaum has led a project concerned with
cellsorting and analysis for Arabidopsis that was published in
Science.
Ken, Sunayan Bandyopadhyay, YuanChien Yang,
and I have also embarked on a project
to predict which genes are likely to
be redundant in a collection.

Algorithms for phylogenetic reconstruction based on graphs (a.k.a., networks)
instead of trees.
Our
algorithm
takes as input the best phylogenetic trees for
some set of genes and forms a parsimonious graph.
Joint work primarily with Ken Birnbaum, Matt Olim, and ChienI Liao,
though with helpful summer projects by
Chandni Rajan Valiathan
and David Almond.
Linguistics
Chris Collins and Richard Kayne from NYU Linguistics, along with Linguistics
doctoral student Michael Taylor, have worked
with the computer science team consisting of Sangeeta Vishwanath,
Hiral Rajani, Jillian Kozyra, and I to build a
website to enable
the syntactic comparison of the world's languages.
The design ensures flexibility and high functionality
and already permits questions to be answered on a scale never before possible
in linguistics.
Pattern Recognition

Software and algorithms to find
the highest correlated streams
among thousands of streams extremely efficiently.
That is joint work with
awardwinning
Yunyue Zhu, Xiaojian Zhao,
and Zhihua Wang.
We have a paper in VLDB 2002 describing the algorithms.
More recent work has involved Richard Cole,
Tyler Neylon, and LeeAd Gotlieb.
You can find Zhihua's and Xiaojian's theses on the
CS department's thesis web site
.
Work on "uncooperative" time series (time series in which the
power of the time series
is spread over all Fourier coefficients) that uses random projections
is discussed
here.
Tyler Neylon has extended the pairwise correlation work
to finding approximate linear dependencies among multiple
time series as you can see in his
thesis
Further work along those lines allows us to do query by humming
We have a paper in SIGMOD 2003 describing the algorithms.
Our demo was attacked by hackers but will hopefully return.
Students who have worked on query by humming are:
Zhihua Wang, Steve Toub, Michael Schidlowsky, Kevin Cox, and Megan McNulty.
These algorithms are summarized in the book
High Performance Discovery in Time Series: techniques and case studies
published by Springer Verlag.
There are a few
>
errata
in the book.
We currently have more than 1000 songs, including most of the Beatles
and other popular folk and melodious rock songs.
The demo makes use of dynamic time warping, statisticsbased filters,
and adaptive boosting.
You can download both midi files and lyrics.
If you hum a tune that has a match in our database
and you upload both your humming and the match information,
we will pay $2 to Nature Conservancy.

Software and algorithms to
find bursts in time series data.
We have a paper in KDD 2003 describing the algorithm.
That is joint work with Zhihua Wang, Xiaojian Zhao, and Yunyue Zhu.
Xin Zhang is working on a very nice thesis to improve this further.

Software to make tree, graph and structure searching as fast
as keyword searching.
A paper describing that was published in ACM Pods 2002 as an invited
tutorial
in pdf.
This uses a combination of geometric hashing, combinatorial
techniques from approximate tree matching, and generalizations
of suffix trees.
To download software for the current version
of the graph search software, please see
graphblast (an improvement on graphgrep).
We have deployed a similar module for use within the cytoscape software
called
To cluster graphs, please see
GraphClust.
If you need to compare schedules or partial orders please see
SchedMatch.
If you need tree comparison for ordered trees (where
sibling order matters), please see
treegrep .
If you need searching among unordered trees
(with applications to XML among others), please
see
unordered tree searcher .
If you want to compare unordered trees, please see
our cousindistance based unordered comparison software
You can find our PODS 2002 presentation
here.
An important application of this work has been to perform
structural searches in phylogenetic databases. As of November 2003,
about 500 users worldwide have accessed the tools over 7000 times.
The
tools
have been integrated into
Joint work with Jason Wang, Kaizhong Zhang, Rosalba Giugno,
Diego Reforgiato Recupero
and Alfredo Ferro.
Here is
Rosalba Giugno's very nice thesis.
In addition, we have written several papers:
1. A review of approximate tree matching algorithms by
(the final version is in the book
Pattern Matching in Strings, Trees, and Arrays
by Apostolico
and Galil published by Oxford University Press)
paper in reverse order, just for fun.
2. Approximate graph matching for acyclic graphs
postscript.
3. Discovering patterns in protein sequences
postscript.

With the group of Alfredo Ferro (including Rosalba Giugno for
GraphGrep, and
Domenico Cantone, Alfredo Pulvirenti, Tarcisio
Maugeri, and Giuseppe Piqola for the computer science
side of clustering), we have developed
a set of tools for
finding multiple alignments that we believe improve on the
Clustal package as well as innovative clustering algorithms.

Our goal is to make it possible to discover patterns (i.e.
do data mining) in strings,
trees, and graphs given a pattern metric.
On the way we have worked out or borrowed algorithms
to match a pattern against data and find a distance between the pattern
and the data.
We are really good on trees
and are getting good on graphs.
We have software available by anonymous ftp
and certain experimental software is available from me.
Collaborators: Kaizhong Zhang (U of Western Ontario),
Jason Wang (NJ Institute of Technology),
and Bruce Shapiro (National Cancer Institute).
We have edited a book on this subject:
Pattern Discovery in Biomolecular Data:
Tools, Techniques, and Applications
Jason Wang, Bruce Shapiro, and Dennis Shasha (Eds.)
Oxford University Press, 1999.
Partly on the strength of that book, I have become
the editor of a book series in Genomics
and Bioinformatics . The intent of the series
is to publish graduate level texts for working researchers
in the field.
We are honored to have a superb advisory board:
Michael Ashburner,
Amos Bairoch,
David Botstein,
Charles Cantor,
Lee Hood,
Minoru Kanehisa,
Raju Kucherlapati, and
Craig Venter.
More recently, we have edited a second book on the subject of data mining
Data Mining in Bioinformatics
by J. T. L. Wang, M. J. Zaki, H. T. T. Toivonen and D. Shasha.
published by SpringerVerlag in 2005.
Tamperresistant file systems
With David Mazieres, I have been working on a network file system
that supports the following scenario (among others): a group of
people work together in a distributed fashion but trust neither one
another nor their system administrator.
For example, they might outsource their
system administration to an organization that
they don't necessarily trust.
Historically, that organization could do subtle changes to their
data, read their data, and so on.
The system we have designed makes
makes tampering changes quickly
detectable.
(Secrecy can be achieved with straightforward cryptographic techniques.)
The only assumption we make is that each client has a secret signature key.
Please
see our PODC
paper here .
Some promising performance results
due to the great efforts of David, Jinyuan Li, and Maxwell Krohn
appear
in OSDI.
With Radu Sion and Peter Worth of Stony Brook, we are doing related
work
having to do with provably secure database outsourcing.
Suppose a mutually trusting group of clients want to use
the software provided by an outsourcer.
The guarantee is that the outsourcer will not be able to understand
the client data (because it is encrypted when
the outsourcer sees it), nor will the outsourcer know which
data any client accesses and the clients will enjoy full transactional
guarantees.
AQuery: a database system for querying ordered data
An orderdependent query is one whose result
(interpreted as a multiset)
changes if the order of the input records is changed.
In a stockquotes database, for instance, retrieving all quotes
concerning a given stock for a given day does not depend on order, because
the collection of quotes does not depend on order.
By contrast, finding the five price moving average in a trade table
gives a result that depends on the order of the table.
Query languages based on the relational data model
can handle orderdependent queries only through addons.
SQL:1999, for example,
permits the use of a data ordering mechanism called
a ``window''
in limited parts of a query.
As a result, orderdependent queries
become difficult to write in those languages and
optimization techniques for these features,
applied as pre or postenumerating phases, are generally crude.
The goal of our work is to show that when order
is a property of the underlying data model and algebra, writing
orderdependent queries in a language can be natural as is their optimization.
We introduce AQuery, an SQLlike query language and algebra
t has fromthegroundup support for order.
We also present a framework for optimization of the orderdependent
queries categories it expresses.
The framework is able to take advantage of the large body of
query transformations on relational systems
while incorporating new ones
described here.
We show by experiment that the resulting system is orders of magnitude
faster than current SQL:1999 systems on many natural orderdependen
You can see our paper here.
You can see a power point presentation here.
Joint work with Alberto Lerner.
AJAX: a data cleaning system

Ajax is a framework for data cleaning.
It includes an implementation of comparison, clustering, and
schema tracking for all aspects of data cleaning.
It's also a framework for future additions to data cleaning.
Joint work with Helena Galhardas, Dana Florescu, and Eric Simon of Inria.
Le Subscribe: a publishsubscribe system

We (this is joint work with Francoise Fabret and Francois
Llirbat, and Joao Pereira at INRIA)
have implemented a publishsubscribe system for extremely high
performance and distributed functionality.

Our subscriptions are conjunctions of the form (attribute; value; relop)
e.g.
(movie; toy story II; =),
(price; < ; $10)
Our events are also conjunctions but of the form (attribute; value)
and implicitly on equality, e.g.
(movie; toy story II), (city; paris)

Our performance is the following:
For 400,000 subscriptions having
5 attributes of which one is inequality and four are equality,
and
events with 5 attributes, we
can process events at 5 milliseconds per event
on a machine with Linux SO, i686 CPU at 500MHz with 1G of RAM.
Fault Tolerant Parallel Programming

Our Persistent Linda project extends the Linda system
developed primarily by Dave Gelernter and Nick Carriero at Yale.
We use a slightly weakened form of transaction combined with
checkpoints to support fault tolerance (appeared in Proc of 13th
Symp on Fault Tolerant Distributed Systems)
postscript.
You can get a copy of PLinda from
our web site.
Collaborators: Brian Anderson, Karp Jeong, Suren Talla, Peter Wyckoff, Bin Li,
all students or former students at NYU
In addition, Ekkart Kindler has done
a formal proof method
for verifying longrunning parallel computations.
Database Internals Work
Database Tuning and Wall Street

Database tuning is the activity of making your database system run faster.
Though each vendor will tell you a different story about the subject,
it turns out that the underlying principles are the same.
(As a consultant, I've applied these principles for companies in
telecommunications, finance, online travel companies, and
online gaming.)
If you are interested in
notes on the subject,
then please go ahead and download
My latest book on the subject, coauthored with Philippe Bonnet,
is called
Database Tuning: principles, experiments, and troublieshooting
techniques
published by MorganKaufmann in 2002.
You can find the experiments from that work
here.
With Arthur Whitney, Steve Apter, and the
rest of the K community,
I have been working on simplifying transaction
processing in a large main memory setting.
The technique uses a very fast, interpreted vectorprocessing language.
It avoids concurrency control but allows concurrency.
It is tailored to financial applications.
postscript version and
pdf version .
In Sigmod 1997, I present some lessons learned from
my experience on Wall Street.
The lessons have to do with configuration for global
distributed systems, tuning, and language issues.
Lessons from Wall Street (postscript) .
In Sigmod 2001, Arthur Whitney and I have a paper describing
time series databases for finance that he built.
Lots o' Ticks: realtime high
performance time series queries on billions of trades and quotes
(pdf) .
RealTime Scheduling

We have developed algorithms for scheduling overloaded
sporadic realtime tasks in a uniprocessor setting (in Siam J. comp)
postscript
and in a multiprocessor setting (appeared
in Theoretical Computer Science, July 1994)
postscript.
We have found algorithms and bounds for scheduling
sporadically arriving periodic tasks.
That is, the instances of
each task arrive at regular periods but the task
can arrive at any instant.
Our twist on this problem is that we allow certain instances of such tasks
to be skipped. We look at schedulability in this context.
pdf
.
Collaborator: Gilad Koren
Thinksheet and StratPal

A tool to tailor information flow for readers of complex (or boring) documents
such as laws and a problemsolving tool generalizing spreadsheets.
The tool integrates spreadsheets, rules, databases, and hypermedia.
Collaborators: Roman Yangarber, Peter Piatko, Daoi Lin, Minna Cha,
Dave Tanzer, Alex Shenker, Mike Leder, Julia Tolpin, Mirella Shannon,
and Chris Jones,
all students at NYU.
Dave Tanzer's thesis is on efficient
backwards reasoning in the thinksheet context.
The work has been completed.
Recently, we (Stacey Kuznetsov and I) have designed a new system called
Stratpal.
Stratpal is a simplified thinksheet that can be used to model laws
and strategies. Its key features is that given a linear document,
it is easy to create a StratPal application that can be improved
incrementally over time.
Benchmarks
K. Jacob of Morgan Stanley and I have designed a benchmark
for financial time series queries, called
FinTime
which database vendors and customers may find interesting.
Also on the subject of benchmarks, Yunyue Zhu
and I have designed a benchmark for bitemporal database management
systems, called
SpyTime .
Software
Statstream (incremental correlation in time series)
software description
Tree searching (finding a small tree in a database of large trees;
where the order among siblings doesn't matter)
software description
Tree difference (finding the difference between trees, where
the order among siblings does matter)
software description
Graph comparison (heuristic techniques for comparing graphs)
software description
Graph clustering (finding interesting motifs in graphs)
software description
SchedMatch (find the differences between partial orders)
software description
Fun Stuff

I have written a game to teach children arithmetic
and elementary algebra called Superply.
One of my kids beats me at it regularly, alas.
Chris Poultney is the primary author of a game I designed
called the
Voronoi game.
It was written about in France because it appeared in many
science fairs.
A game to teach statistics in a fun way
requires a player to find the causes of a pandemic
so it's called the
Pandemic Game

Dr. Ecco, a mathematical detective cracks
mysteries by solving puzzles.
Some are combinatoric, e.g. what is the smallest number of people
who could be at a party in which everyone has shaken hands with three
other people, except one person who has only shaken hands with one other
person?
Others involve algorithmic aspects, including the
simplest zero knowledge protocols known to (wo)man.
The first books about him were first published by
W. H. Freeman, (1)2125769400:
The Puzzling Adventures of Dr. Ecco in 1988
(republished by Dover in 1998) and
Codes, Puzzles, and Conspiracy in 1992 (now retitled
Dr. Ecco: mathematical detective in the Dover edition).
See
Professor Scarlet's Notebook
a companion book to teach real mathematics through puzzles.
Dr. Ecco's Cyberpuzzles : 36 Puzzles for Hackers
and Other Mathematical Detectives
published by W. W. Norton in 2002.
This had the first collection of puzzles from Dr. Dobb's Journal
Puzzling Adventures
published by W. W. Norton in January 2005.
This had the first collection of puzzles from my Scientific American column.
The Puzzler's Elusion
published by Avalon Press, March 2006.
A combination of puzzles
from Scientific American and Dr. Dobb's Journal.
Puzzles for Programmers and Pros
published by Wiley in May 2007.
 As suggested by
these books, I have had the pleasure of writing a mathematical puzzle
column for Dr. Dobb's Journal
and currently
write the monthly puzzle
column for Scientific American
(look under recreations).

You can see a talk called
Upstart Puzzles
that I gave at the Canadian Mathematical Society
summer meeting at Edmonton in June 2003.

You can also hear puzzles on the radio
in Arkansas.

Out of Their Minds: the lives and discoveries of 15
great computer scientists is a book of biographies
of 15 great computer scientists.
You can see me on realplayer video
talking about the book.

My latest book Natural Computing
is a book about the work of computer scientists, roboticists, and
other innovators about the future of computing.
Here are some of the
reviews.

All the smart Russian students I've had has inspired
me to collaborate with playwright Marina Shron
in a book about recent Russian immigrants entitled
Red Blues: voices from the last wave of Rusian immigrants.
You can find excerpts of the book
here.

Here is a list of the publishers who have
translated my puzzle books in several countries:

China, People's Republic: Hunan Science and Technology press.

Czech Republic: Mlada

France: Odile Jacob

Germany: MVG Verlag

Hungary: Typotex

Japan: Nikkei Scientific

Korea: Kyungmoon

Poland: Spolddzielnia WydawniczoHandlowa Ksiazka i Wiedza

Portugal: Gradiva

Spain: Labor and Gredisa

Slovenia: Drzavna Zalozba Slovenije, zbirka z logiko

Taiwan, Republic of China: The Eurasian Publishing Group
and Chiu Chang Math. Books & Puzzles Co.

Turkey: Tubitak

These publishers have translated {\em Database Tuning: principles, experiments,
and troubleshooting techniques}
by Philippe Bonnet and me:

China, People's Republic: Publishing House of Electronics

Korea: KCC (Brain Korea Publishing Co.)

Russia: Kudits Obraz

These publishers translated {\em Out of their Minds: the lives
and discoveries of 15 great computer scientists}
by Cathy Lazere and me:

China, People's Republic: Hebei University Publishers

Japan: Nikkei business publications

Korea: Sejong

Taiwan: YuanLiou