Computer Science Colloquium
Robust Analytics on Data Streams
Flip Korn, AT&T Labs
February 19, 2014
Warren Weaver Hall, 1302
251 Mercer Street
New York, NY 10012
Spring 2014 Colloquia Calendar
How can one make sense of fast and voluminous data?
How can the value of Big Data be extracted when data
As the quantity of digitized data explodes,
the quality of this data can be poor when generated
by fallible users (e.g., crowd sourcing) and
unreliable hardware (e.g., sensors),
sent across volatile networks (e.g., wireless)
and stored in complex systems (e.g., "the cloud").
Existing data cleansing techniques are aimed at solving
specific problems, such as record linkage, but it is
unknown data quality problems that are hardest to detect
and often the most pernicious. Hence, analytics queries
must be applied robustly to avoid misleading answers.
In this talk I will first discuss how to perform robust,
complex analytics, built from quantiles and frequent items
primitives, for IP network traffic data at streaming speeds.
These queries are implemented as a library of user-defined
aggregate functions (UDAFs) in a Data Stream Management
System developed at AT&T called GS Tool.
In the second part, I will discuss an exploratory approach
to data quality where the user poses hypotheses to test,
in the form of constraints such as functional dependencies,
and the system performs multidimensional analysis to summarize
when and where the data satisfies (or fails) the hypotheses.
This approach is only useful if implemented scalably, at
interactive speeds. I will describe a fast lazy evaluation
strategy for this. Then I will mention novel constraints
(e.g., Sequential Dependencies and Conservation Rules)
that exploit structural properties found in many
data warehouses to discover potential errors.
Flip Korn is a member of the Database Research Department
at AT&T Shannon Labs. His Ph.D. is from the University of
Maryland, College Park. His background is in data mining
and data streams, and the current focus of his research
endeavors is in the area of data quality.