Better Burst detection

Xin Zhang
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
xinzhang@cs.nyu.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html



Motivation

Many aspects of our lives are described by events. An unexpectedly large number of events occurring within some certain temporal or spatial region is called a burst, suggesting unusual behaviors or activities. Bursts, as noteworthy phenomena, come up in many natural and social processes. Detecting a burst over any of $k$ window sizes, a problem we call elastic burst detection, in a stream of length $N$ naively requires $O(kN)$ time. Our previous work showed that a simple Shifted Binary Tree structure can reduce this time substantially (in very favorable cases near to $O(N)$) by filtering away obvious non-bursts. Unfortunately, for certain data distributions, the filter marks many windows of events as possible bursts, even though a detailed check shows them to be non-bursts.

This work explores efficient data structures and algorithms for high performance burst detection. We present a family of data structures, called Shifted Aggregation Trees (SAT), and heuristic algorithm to search for an efficient data structure given the input time series and the window thresholds. The advantage of this framework is that it is adaptive to different inputs. A technical report can be found here (TR2005-876).


Software

The software package can be found here. The code is written in standard C++ and the binary is built using Microsoft Visual Studio 6.0.

Usage

  • To train a shifted aggregation tree from a small sample data:

    search thresholdFile sampleDataFile satFile [maxNumberActiveState numberFinalStateToStop]

    Where The formats of these files are explained below. Please refer to the technical report for these two state space algorithm parameters.

  • To detect bursts:

    detect thresholdFile dataFile satFile
    dynamicDetect thresholdFile dataFile satFile

    Where dataFile contains the time series to be detected, and satFile contains the shifted aggregation tree used for the detection.
    The output is printed to the standard output.
  • Semantics of the input

    The input time series is stored in a binary date file of the following format:
    x1
    x2
    x3
    ...
    The input thresholds are of the following text format:
    w1 f(w1)
    w2 f(w2)
    w3 f(w3)
    ... ...
    Where w1 < w2 < w3 < ... and f(w1) < f(w2) < f(w3) < ...
    If the sum of the subsequence in a window with size w_i is larger than f(w_i), we say that the window experiences a burst.
    An example:
    4 10.4
    9 21.5
    18 43.2
    40 90
    ... ...

    Semantics of a SAT file

    A SAT file defines the structure of a shifted aggregation tree. It has the following format:

    N W_N
    S_1 W_1 C^1_1 C^2_1 ...
    S_2 W_2 C^1_2 C^2_2 ...
    ... ... ... ... ...
    S_N W_N C^1_N C^2_N ...
    Where N is the number of levels in this SAT, S_i is the shift at level i, W_i is the window size of level i and C^i_j is the window size of the ith child's at level j.

    For example, the following SAT file corresponds to a shifted binary tree of maximum window size of 32:
    6 32
    1 1 1
    1 2 1 1
    2 4 2 2
    4 8 4 4
    8 16 8 8
    16 32 16 16

    Semantics of the output

    The output of the program is a list of tuples as follows.
    Starting Position, Window Size, Aggregate within this window
    Each tuple means that in the time series, the window starting at Starting Position with size Window Size experiences a burst having Aggregate value.

    Contact

    Please email shasha@cs.nyu.edu or xinzhang@cs.nyu.edu for any questions or comments.