\documentclass{acm_proc_article-sp}
\usepackage[dvipdfm,bookmarks]{hyperref}
\usepackage{epsfig}\usepackage{amssymb}\usepackage{amsmath}\usepackage{amsfonts}
\setlength{\voffset}{0.5in}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{problem}{Problem}
\newtheorem{corollary}{Corollary}


\begin{document}
\title{Efficient Elastic Burst Detection in Data Streams
\thanks{Work supported in part by U.S. NSF grants IIS-9988636, MCB-0209754 and
N2010-0115586.}
}

\toappear{
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. \\
SIGKDD '03, August 24-27, 2003, Washington, DC, USA.\\
Copyright 2003 ACM 1-58113-737-0/03/0008...\$5.00. 
}

\numberofauthors{2}\author{
\alignauthor Yunyue Zhu\\ \affaddr{Department of Computer Science} \\
      \affaddr{Courant Institute of Mathematical Sciences} \\  \affaddr{New York University, New York, NY, 10012} \\ \email{yunyue@cs.nyu.edu}
\alignauthor Dennis Shasha\\ \affaddr{Department of Computer Science} \\ 
	\affaddr{Courant Institute of Mathematical Sciences} \\ \affaddr{New York University, New York, NY, 10012} \\ \email{shasha@cs.nyu.edu}}
\maketitle

\begin{abstract}
Burst detection is the activity of finding abnormal aggregates in data streams. Such aggregates are based on sliding windows over data streams. In some applications, we want to monitor many sliding window sizes simultaneously and to report those windows with aggregates significantly different from other periods. We will present a general data structure for detecting interesting aggregates over such elastic windows in near linear time. We present applications of the algorithm for detecting Gamma Ray Bursts in large-scale astrophysical data. Detection of periods with high volumes of trading activities and high stock price volatility is also demonstrated using real time Trade and Quote (TAQ) data from the New York Stock Exchange (NYSE). Our algorithm beats the direct computation approach by several orders of magnitude.
\end{abstract}

\section{Introduction}
Consider the following application that motivates this research. An astronomical telescope, Milagro\cite{milagro} was built in New Mexico by a group of prestigious astrophysicists from the Los Alamos National Laboratory and many universities. This telescope is actually an array of light-sensitive detectors covering a large pool of water about the size of a football field. It is used to constantly observe high-energy photons from the universe. When many photons observed, the scientists assert the existence of a Gamma Ray Burst. The scientists hope to discover primordial black holes or completely new phenomena by the detection of Gamma Ray Bursts. The occurrences of Gamma Ray Bursts are highly variable, flaring on timescale of minutes to days. Once such a burst happens, it should be reported immediately. Other telescopes could then point to that portion of sky to confirm the new astrophysical event. The data rate of the observation is extremely high. Hundreds of photons can be recorded within a second from a tiny spot in the sky\cite{Atkins00,Smith01}.

There are also many applications in data stream mining and monitoring when people are interested in discovering time intervals with unusually high numbers of events. For example:
\begin{itemize}
\item  In telecommunication, a network anomaly might be indicated if the number of packages lost within a certain time period exceeds some threshold.
\item  In finance, stocks with unusually high trading volumes would attract the notice of the traders (or regulators). Also stocks with unusually high price fluctuations within a short time period provide more opportunity of speculation. Therefore they would be watched more closely.
\end{itemize}

Intuitively, given an aggregate function $F$ (such as sum or count), the problem of interest is to discover subsequences $s$ of a time series stream such that $F(s)>>F(s')$ for most subsequences $s'$ of size $|s|$. In the case of burst detection, the aggregate is sum. If we know the duration of the time interval, we can maintain the sum over sliding windows of a known window size and sound an alarm when the moving sum is above a threshold. Unfortunately, in many cases, we cannot predict the length of the burst period. In fact, discovering that length is part of the problem to be solved. In the above example of Gamma Ray Burst detection, a burst of photons associated with a special event might last for a few milliseconds, a few hours, or even a few days. There are different thresholds associated with different durations. A burst of 10 events within 1 second could be very interesting. At the same time, a burst that lasts longer but with lesser density of events, say 50 events within 10 seconds, could be of interest too.

Suppose that we want to detect bursts for a time series of size $n$ and we are interested in all the $n$ sliding window sizes. A brute-force search has to examine all the sliding window sizes and starting positions. Because there are $O(n^2)$ windows, the lower bound of the time complexity is $O(n^2)$.  This is very slow for many applications. Fortunately, because we are interested only in those few windows that experience bursts, it is possible to design a nearly linear time algorithm. In this paper we present a burst detection algorithm with time complexity approximately proportional to the size of the input plus the size of the output, i.e. the number of windows with bursts.

\subsection{Problem Statement}
There are two categories of time series data stream monitoring: {\em point monitoring} and {\em aggregate monitoring}. In point monitoring, the latest data item in the stream is of interest. When the latest item falls in some predefined domain, an alarm would be sounded. For example, a stock trader who places a limited sell order on Enron informs the stock price stream monitor to raise an alarm (or automatically sell the stock) once the price of stock fall under \$10 to avoid further losses. Since only the latest data in the stream need to be considered, point monitoring can be implemented without much effort.

Aggregate monitoring is much more challenging. Aggregates of time series are computed based on certain time intervals (windows). There are three well-known window models that are the subjects of many research projects \cite{GGR00,GKS01,GKMS01,ZS02}.

\begin{enumerate}
\item {\bf Landmark windows:}
Aggregates are computed based on the values between a specific time point called the landmark and the present. For example, the average stock price of IBM from Jan 1st, 2003 to today is based on a landmark window.

\item {\bf Sliding windows:}
In a sliding window model, aggregates are computed based on the last $w$ values in the data stream. The size of a sliding window $w$ is predefined. For example, the running maximum stock price of IBM during the previous 5 days is based on sliding windows.

\item {\bf Damped window:}
In a damped window model the weights of data decrease exponentially into the past. For example, the damping moving average $avg_{new}$ after a new data item $x$ is inserted can be updated as follows:
$$avg_{new}=avg_{old}*p+x *(1-p), 0<p<1$$
\end{enumerate}

The sliding window model is the most widely used in data stream monitoring. Motivated by the Gamma Ray example, we have generalized this to the {\em elastic window model}. In an elastic window model, the user needs to specify only the range of the sliding window sizes, the aggregate function and alarms domains, and will be notified of all those window sizes in the range with aggregates falling in the corresponding alarm domains.

Here we give the formal definition of the problem of monitoring data stream over elastic windows.
\begin{problem}
For a time series $x_1,x_2,...,x_n$, given a set of window sizes $w_1,w_2,...,w_m$, an aggregate function $F$ and threshold associated with each window size, $f(w_j), j=1,2,...,m$, monitoring elastics window aggregates of the time series is to find all the subsequences of all the window sizes such that the aggregate applied to the subsequences cross their window sizes' thresholds, i.e.
\[ \forall i \in 1..n,\forall j \in 1..m ,\\
 s.t.\ F(x[i\ ..\ i\!+\!w_j\!-\!1]) \ge f(w_j)
\]
\end{problem}

The threshold above can be estimated from the historical data or the model of the time series. Elastic burst detection is a special case of monitoring data streams on elastic windows. In elastic burst detection, the alarm domain is $[f(w_j),\infty)$. Note that it is also possible for the alarm domain to be $(-\infty, f(w_j)]$. 

\subsection{Our Contributions}
The contributions of the paper are as follows.
\begin{itemize}
\item We introduce the concept of monitoring data streams on elastic windows and show several important applications of this model.
\item We design an innovated data structure, called the Shifted Wavelet Tree, for efficient elastic burst monitoring. This data structure is applied to general aggregate monitoring and burst detection in higher dimensions.
\item We apply our algorithm to real world data including the Milagro Gamma Ray data stream, NYSE real time tick data and text data. Our method is up to several magnitudes faster than a direct computation, which means that a multi-day computation can be done in a matter of minutes.
\end{itemize}

\section{Data Structure and Algorithm}
In this section, we first give some background on the wavelet data structure. In section \ref{subsec:alg} we discuss the Shifted Wavelet Tree and the algorithm for efficient elastic burst detection in an offline setting. This is extended to a streaming algorithm in section \ref{subsec:streaming}. Our algorithm will also be generalized to other problems in data stream monitoring over elastic windows in section \ref{subsec:general} and to higher dimensions in section \ref{subsec:2d}.

\subsection{Wavelet Data Structure}\label{subsec:wavelet}
In wavelet analysis, the wavelet coefficients of a time series are maintained in a hierarchical structure. Let us consider the simplest wavelet, the Haar wavelet. For simplicity of notation, suppose that the size of time series $n$ is a power of two. This would not affect the generality of the results.  The original time series makes up of level $0$ in a wavelet tree.  The pair wise (normalized) averages and differences of the adjacent data items at level $0$ produce the wavelet coefficients at level $1$. The process is repeated for the averages at level $i$ to get the averages and differences at level $i+1$ until there is only one average and difference at the top level. Table \ref{tabel:wavdec} shows the process in computing the Haar wavelet decomposition. The Haar wavelet coefficients include the average in the highest level and the differences in each level. From these wavelet coefficients, the original time series can be constructed without loss of information. Usually a few wavelet coefficients can represent the trend of the time series, and they are selected as a compact representation of the original time series.

\begin{table*}
\centering
\caption{Haar Wavelet decomposition} \label{tabel:wavdec}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline
Level 3&\multicolumn{4}{c}{$\frac{a_1+a_2+a_3+a_4+a_5+a_6+a_7+a_8}{2\sqrt 2}$}\vline
 &\multicolumn{4}{c}{$\frac{(a_1+a_2+a_3+a_4)-(a_5+a_6+a_7+a_8)}{2\sqrt 2}$}\vline\\ \hline
Level 2&\multicolumn{2}{c}{$\frac{a_1+a_2+a_3+a_4}{2}$}\vline&\multicolumn{2}{c}{$\frac{a_5+a_6+a_7+a_8}{2}$}\vline
             &\multicolumn{2}{c}{$\frac{(a_1+a_2)-(a_3+a_4)}{2}$}\vline&\multicolumn{2}{c}{$\frac{(a_5+a_6)-(a_7+a_8)}{2}$}\vline
             \\ \hline
Level 1&$\frac{a_1+a_2}{\sqrt 2}$&$\frac{a_3+a_4}{\sqrt 2}$&$\frac{a_5+a_6}{\sqrt 2}$&$\frac{a_7+a_8}{\sqrt 2}$
       &$\frac{a_1-a_2}{\sqrt 2}$&$\frac{a_3-a_4}{\sqrt 2}$&$\frac{a_5-a_6}{\sqrt 2}$&$\frac{a_7-a_8}{\sqrt 2}$\\ \hline
Level 0&$a_1$&$a_2$&$a_3$&$a_4$&$a_5$&$a_6$&$a_7$&$a_8$ \\ \hline
\hline
\end{tabular}
\end{table*}

\begin{figure*}
\begin{center}
\includegraphics[width=0.45\textwidth]{wavtree}
\includegraphics[width=0.45\textwidth]{shiftree}
\end{center}
\caption{(a)Wavelet Tree (left) and (b)Shifted Wavelet Tree(right)} \label{tree}
\end{figure*}

The wavelet coefficients above can also be viewed as the aggregates of the time series at different time intervals. Figure~\ref{tree}-a shows the time interval hierarchy in the Haar wavelet decomposition. At level $i$, there are $n2^{-i}$ consecutive windows with size $2^i$. All the windows at the same level are disjoint. The aggregates that the Haar wavelet maintains are the (normalized) averages and differences. In our discussion of burst detection, the aggregate of interest is the sum. Obviously, such a wavelet tree can be constructed in $O(n)$ time.

The first few top levels of a wavelet tree yield concise multi-resolution information of the time series. This gives the wavelet lots of applications. However, for our purpose of burst detection, such a data structure has a serious problem. Because the windows at the same level are non-overlapping, a window of arbitrary start position and arbitrary size might not be included in any window in the wavelet tree, except the window at the highest level that includes everything. For example, the window consisting of three time points in the middle, $(n/2-1,n/2,n/2+1)$, is not contained in any window in the wavelet tree except the largest one. This makes wavelets inconvenient for the discovery of properties of arbitrary subsequences.

\subsection{Shifted Wavelet Tree} \label{subsec:alg}

In a {\it shifted wavelet tree} (SWT) (figure ~\ref{tree}-b), the adjacent windows of the same level are half overlapping. In figure~\ref{tree}, we can see that the size of a SWT is approximately double that of a wavelet tree, because at each level, there is an additional ``line'' of windows. These additional windows provide valuable overlapping information for the time series. They can be maintained either explicitly or implicitly. If we keep only the aggregates for a traditional wavelet data structure, the aggregates of the overlapping windows at level $i$ can be computed from the aggregates at level $i-1$ of the wavelet data structure.

To build a SWT, we start from the original time series and compute the pair wise aggregate (sum) for each two consecutive data items. This will produce the aggregates at level $1$. A downsampling on this level will produce the input for the higher level in the SWT. Downsampling is simply sampling every second item in the series of aggregates. In figure \ref{tree}-b, downsampling will retain the upper consecutive non-overlapping windows at each level. This process is repeated until we reach the level where a single window includes every data point. Figure \ref{build_tree} gives a pseudo-code to build a SWT. Like regular wavelet trees, the SWT can also be constructed in $O(n)$ time.

\begin{figure}\begin{center}
\begin{tt}
\begin{tabbing}
1 \= 1 \= 1 \= 1 \= 1 \= \kill
Given : x[1..n],n=$2^a$\\
Return: shifted wavelet tree SWT[1..a][1..]\\
\\
b=x;\\
FOR i = 1 TO a //remember $a=\log_2 n$\\
\>//merge consecutive windows and form\\
\>//level $i$ of the shifted wavelet tree\\
\>FOR j = 1 TO size(b)-1 STEP 2\\
\>\> SWT[i][j]=b[j]+b[j+1];\\
\>ENDFOR\\
\>//downsample, retain a non-overlapping cover
\\
\>FOR j = 1 TO size(SWT[i])/2\\
\>\> b[j]=SWT[i][2*j-1];\\
\>ENDFOR\\
ENDFOR
\end{tabbing}
\end{tt}
\end{center}\caption{Algorithm to construct shifted wavelet tree} \label{build_tree}
\end{figure}

For a subsequence starting and ending at arbitrary positions, there is always a window in the SWT that tightly includes the subsequence as figure \ref{spot_demo} shows and the following lemma proves.

\begin{figure*}
\begin{center}
\includegraphics[width=0.80\textwidth]{spot_demo}
\end{center}
\caption{Examples of the windows that include subsequences in the shifted wavelet tree} \label{spot_demo}
\end{figure*}

\begin{lemma}\label{include}
Given a time series of length $n$ and its shifted wavelet tree, any subsequence of length $w, w \le 2^i$ is included in one of the windows at level $i+1$ of the shifted wavelet tree.
\end{lemma}

\begin{proof}
The windows at level $i+1$ of the shifted wavelet tree are:
$$[(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i],j=1,2,...,\frac{n}{2^i}-1.$$
A subsequence with length $2^i$ starting from an arbitrary position $c$ will be included in at least one of the above windows, because
$$[c\ ..\ c\!+\!2^i\!-\!1]\subseteq [(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i],j=\lfloor \ \frac{c-1}{2^i} \rfloor+1.$$
Any subsequence with length $w, w \le 2^i$ is included in some subsequence(s) with length $2^i$, and therefore is included in one of the windows at level $i+1$. We say that windows with size $w,2^{i-1} <w \le 2^i$, are monitored by level $i+1$ of the SWT.
\end{proof}

Because for time series of non-negative numbers the aggregate sum is monotonically increasing, the sum of the time series within a sliding window of any size is bounded by the sum of its including window in the shifted wavelet tree. This fact can be used as a filter to eliminate those subsequences whose sums are far below their thresholds.

\begin{figure}\begin{center}
\begin{tt}
\begin{tabbing}
1 \= 1 \= 1 \= 1 \= 1 \= \kill
Given : time series x[1..n], $n=2^a$, \\
\> shifted wavelet tree SWT[1..a][1..],\\
\> window size $w$, threshold $f(w)$\\
Return: Subsequences of x with burst \\
\\
i = $\lceil \log_2w \rceil$;\\
FOR j = 1 TO size(SWT[i+1])\\
\> IF (SWT[i+1][j]>f[w]) \\
\> //possible burst in subsequence x[$(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i$],\\
\> //first we compute the moving sums with  \\
\> //window size $2^i$ within this subsequence. \\
\>\> FOR c = $(j-1)2^i+1$ TO $j2^i $\\
\>\>\>y=sum(x[$c \ ..\ c\!+\!2^i\!-\!1$]);\\
\>\>\>IF y>f[w] \\
\>\>\>\> detailed search in x[$c \ ..\ c\!-\!1\!+\!2^i$]\\
\>\>\>ENDIF \\
\>\>ENDFOR\\
\>ENDIF\\
ENDFOR\\
\end{tabbing}
\end{tt}
\end{center}\caption{Algorithm to search for burst} \label{alg_search}\end{figure}

Figure \ref{alg_search} gives the pseudo-code for spotting potential subsequences of size $w,w\le 2^i,$ with sums above its threshold $f(w)$. The algorithm will search for burst in two stages. First, the potential burst is detected at the level $i+1$ in the SWT, which corresponds to the subsequence $x[(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i]$. In the second stage, those subsequences of size $2^i$ within $x[(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i]$ with sum less than $f(w)$ are further eliminated. The moving sums of sliding window size $2^i$ can be reused for burst detection of other window size $w'\ne w, w'\le 2^i$. The {\it detailed search} of burst on the surviving subsequences is then performed. A detailed search in a subsequence is to compute the moving sums with window size $w$ in the subsequence directly and to verify if these moving sums cross the burst threshold. 

In the spirit of the original work of \cite{AFS93} that uses lower bound technique for fast time series similarity search, we have the following lemma that guarantees the correctness of our algorithm.
\begin{lemma}
The above algorithm can guarantee no false negatives in elastic burst detection from a time series of non-negative numbers.
\end{lemma}
\begin{proof}

From lemma~\ref{include}, any subsequence of length $w,w\le 2^i$ is contained within a window in the SWT:
$$[c\ ..\ c+w-1] \subseteq  [c\ ..\ c+2^i-1] \subseteq [(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i]$$
Because the sum of the time series of non-negative numbers is monotonic increasing, we have 
$$\sum(x[c\ ..\ c\!+\!w\!-\!1])\le \sum(x[c\ ..\ c\!+\!2^i\!-\!1])\le \sum(x[(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i]).$$
By eliminating sequences with lengths larger than $w$ but with sums less than $f(w)$, we do not introduce false negatives because 
$$\sum(x[(j\!-\!1)2^i\!+\!1\ ..\ (j\!+\!1)2^i])<f(w) \Rightarrow \sum(x[c\ ..\ c\!+\!w\!-\!1])<f(w). $$
\end{proof}

In most applications, the algorithm will perform detailed search seldom and then usually only when there is a burst of interest. For example, suppose that the moving sum of a time series is a random variable from a normal distribution. Let the sum within sliding window $w$ be $S_o(w)$ and its expectation be $S_e(w)$, We assume that $$\frac{S_o(w)-S_e(w)}{S_e(w)} \sim Norm(0,1).$$ We set the threshold of burst $f(w)$ for window size $w$ such that the probability that the observed sums exceed the threshold is less that $p$, i.e., $Pr(S_o(w)\ge f(w))\le p$. Let $\Phi(x)$ be the normal cumulative distribution function, we have $$f(w)=S_e(w)(1- \Phi^{-1}(p)).$$ Because our algorithm monitors the burst based on windows with size $W=Tw, 1\le T < 2$, a real burst will always result in an alarm. However, it is possible that an alarm will be raised because there are more than $f(w)$ events in a window of $W$. We call this probability the false alarm rate $p_f$. Suppose that $S_e(W)=TS_e(w)$, we have $$\frac{S_o(W)-S_e(W)}{S_e(W)} \sim Norm(0,1).$$ 

\begin{align}  
p_f&=Pr(S_o(W)\ge f(w))= Pr\Big(\frac{S_o(W)-S_e(W)}{S_e(W)}\ge \frac{f(w)-S_e(W)}{S_e(W)}\Big)\notag\\
&=\Phi \Big(-\frac{f(w)-S_e(W)}{S_e(W)}\Big)=\Phi \big(1-\frac{f(w)}{TS_e(w)}\big)\notag\\
&=\Phi \big(1-\frac{1- \Phi^{-1}(p)}{T}\big)\notag
\end{align}

The false alarm rate is very small for small $p$, the primary case of interest. For example, let $p=10^{-6},T=1.5$, $p_f$ is $0.006$. In this model, the upper bound of false alarm rate is guaranteed.
%matlab code:p=10e-6;T=2;1-normcdf((norminv(1-p)+1)/T-1)

%For example, suppose we are given a probability $p$ and a data stream whose background noise is modeled by a Poisson arrival counting process with arrival rate $\lambda$. To detect a burst we set the threshold $f(w)$ such that the probability of events within window having size $w$ is less that $p$. In other words, when there is no burst activity, the probability that the number of events within a window $w$ exceeds the threshold $f(w)$ by chance is less than $p$, $\sum_{k=f(w)}^{\infty}\frac{e^{-w\lambda}(w\lambda)^k}{k!} \le p$. Because our algorithm monitors based on windows with size $W, W\ge w$, a real burst will always result in an alarm. However, it is possible that an alarm will be raised because there are more than $f(w)$ events in a window of size $W$, even though no window of size $w$ has $f(w)$ events. The probability of such a false alarm is $p(W)=\sum_{k=f(w)}^{\infty}\frac{e^{-W\lambda}(W\lambda)^k}{k!}$. False alarm rates are very small for small $\lambda$, the primary case of interest. For example, let $w\lambda=1,2,...,10, p=0.0001,W=2w$, $p(W)$ is bounded by $0.1$. 

The time for a detailed search in a subsequence of size $W$ is $O(W)$.
The total time for all detailed searches is linear in the number of false alarms and true alarms(the output size $k$). The number of false alarm depends on the data and the setting of thresholds, and it is approximately proportional to the output size $k$. So the total time for detailed searches is bounded by $O(k)$. To build the SWT takes time $O(n)$, thus the total time complexity of our algorithm is approximately $O(n+k)$, which is linear in the total of the input and output size.

%In fact, the number of alarm raised can measure the quality of the threshold function. If the system is overwhelmed by alarms, that means the setting of the thresholds is too permissive. Suppose that for a time series stream of length $n$ there are $h$ subsequences that experience bursts. When we use the SWT to detect bursts, if the sum of a subsequence is not above its threshold $\delta$ while the permissive estimation based on its including window exceeds the threshold $\delta$, we call it a ``false alarm''. Note that it is not actually false alarm because we will zero into that subsequence to examine it further. Suppose that the ratio of the number of the true alarms to that of all the alarms (true and ``false'') is approximately fixed to be $p$. Our method will zero in the time series $h/p$ times. To build the SWT takes time $O(n)$, thus the total time complexity of our algorithm is $O(n+\frac{h}{p})$, which is linear in the total of the input and output size.

\subsection{Streaming Algorithm}\label{subsec:streaming}
The SWT data structure in the previous section can also be used to support a streaming algorithm for elastic burst detection. Suppose that the set of window sizes in the elastic window model are $2^L<w_1<w_2<...<w_m\le 2^U$. For simplicity of explanation, assume that new data become available at every time unit.

Without the use of SWT, a naive implementation of elastic burst detection has to maintain the $m$ sums over the sliding windows. When a new data item becomes available, for each sliding window, the new data item is added to the sum and the corresponding expiring data item of the sliding window is subtracted from the sum. The running sums are then checked against the monitoring thresholds. This takes time $O(m)$ for each insertion of new data. The response time is one time unit if enough computing resources are available.

By comparison, the streaming algorithms based on SWT data structure will be much more efficient. For the set of window size $2^L<w_1<w_2<...<w_m\le 2^U$, we need to maintained the levels from $L+2$ to $U+1$ of the SWT that monitor those windows. There are two methods that provide tradeoffs between throughput and response time.

\begin{itemize}
\item {\bf Online Algorithm:}The online algorithm will have a response time of one time unit. In the SWT data structure, each data item is covered by two windows in each level. Whenever a new data item becomes available, we will update those $2(U-L)$ aggregates of the windows in the SWT immediately. Associated with each level, there is a minimum threshold. For level $i$, the minimum threshold $\delta_i$ is the min of the thresholds of all the windows monitored by level $i$, that is, $\delta_i=\min f(w_j), 2^{i-2} < w_j\le 2^{i-1}$. If the sum in the last windows at level $i$ exceeds $\delta_i$, it is possible that the sum of one of the window sizes monitored by level $i$ exceeds its threshold. We will perform a detailed search on those time intervals. Otherwise, the data stream monitor awaits insertions into the data stream. This online algorithm provides a response time of one time unit, and each insertion of the data stream requires $2(U-L)$ updates plus possible detailed searching.

\item {\bf Batch Algorithm:} The batch algorithm will be lazy in updating the SWT. Remember that the aggregates at level $i$ can be computed from the aggregates at level $i-1$. If we maintain aggregates at an extra level of consecutive windows with size $2^{L+1}$, the aggregates at levels from $L+2$ to $U+1$ can be computed in batch. The aggregate in the last window of the extra level is updated every time unit. An aggregate of a window at the upper levels in the SWT will not be computed until all the data in that window are available. Once an aggregate at a certain upper level is updated, we also check alarms for time intervals monitored by that level. A batch algorithm gives higher throughput though longer response time (with guaranteed bound of about the window size) than an online algorithm as the following lemmas state.
\end{itemize}

\begin{lemma}
The amortized processing time per insertion into the data stream for a batch algorithm is $2$. 
\end{lemma}
\begin{proof}
At the level $i, L+2\le i\le U+1,$ of the SWT, every $2^{i-1}$ time unit there is a window in which all the data are available. The aggregates at that window can be computed in time $O(1)$ based on the aggregates at level $i-1$. Therefore the amortized update time for level $i$ is $\frac{1}{2^{i-1}}$.
The total amortized update time for all levels (including the extra level) is $$1+\sum_{i=L+2}^{U+1}\frac{1}{2^{i-1}}\le 2.$$
\end{proof}

\begin{lemma}
The burst activity of a window with size $w$ will be reported with a delay less than $2^{\lceil\log_{2}w\rceil}$.
\end{lemma}
\begin{proof}
A window with size $w, 2^{i-1}< w\le 2^i,$ is monitored by level $i+1$ of the SWT. The aggregates of windows at level $i+1$ are updated every $2^i$ time units. When the aggregates of windows at level $i+1$ are updated, the burst activity of window with size $w$ can be checked. So the response time is less than $2^i=2^{\lceil\log_{2}w\rceil}$.
\end{proof}

\subsection{Other Aggregates}\label{subsec:general}
It should be clear that in addition to sum, the monitoring of many other aggregates based on elastic window could benefit from our data structure, as long as the following holds.
\begin{enumerate}
\item The aggregate $F$ is monotonically increasing or decreasing with respect to the window, i.e., if window $[a_1..b_1]$ is contained in window $[a_2..b_2]$, then $F(x[a_1..b_1]) \le F(x[a_2..b_2])$ (or $F(x[a_1..b_1]) \ge F(x[a_2..b_2])$) always holds.
\item The alarm domain is one sided, that is, $[threshold,\infty)$ for monotonic increasing aggregates and $(-\infty,threshold]$ for monotonic decreasing aggregates.
\end{enumerate}

The most important and widely used aggregates are all monotonic: {\em Max, Count} are monotonically increasing and {\em Min} is monotonically decreasing. Another monotonic aggregate is {\em Spread}. Spread measures the volatility or surprising level of time series. Spread of a time series $\vec{x}$ is
$$Spread(\vec{x})=Max(\vec{x})-Min(\vec{x}).$$
Spread is monotonically increasing. The spread within a small time interval is less than or equal to that within a larger time interval. A large spread within a small time interval is of interest in many applications in data stream because it indicates that the time series has experienced large movement.

\subsection{Extension to Two Dimensions}\label{subsec:2d}
\begin{figure}
\begin{center}
\includegraphics[width=0.075\textwidth]{wavtree2d}
\includegraphics[width=0.39\textwidth]{shiftree2d}
\end{center}
\caption{(a)Wavelet Tree (left) and (b)Shifted Wavelet Tree(right)} \label{tree2d}
\end{figure}

The one-dimensional shifted wavelet tree for time series can naturally be extended to higher dimensions. In this section we consider the problem of discovering elastic spatial bursts using a two-dimensional shifted wavelet tree. Given an image of scattering dots, we want to find the regions of the image with unexpectedly high density. In an image of the sky with many dots representing the stars, such regions might indicate galaxies or supernovas. As for the case for elastic bursts in time series, the problem is to report the positions of (spatial) sliding windows having different sizes, within which the density exceeds some predefined threshold.

The two-dimensional shifted wavelet tree is based on the two-dimensional wavelet structure. The basic wavelet structure separates a two-dimensional space into a hierarchy of windows as shown in figure \ref{tree2d}-a. Aggregate information will be computed recursively based on those windows to get a compact representation of the data. Our two-dimensional shifted wavelet tree will extend the wavelet tree in a similar fashion as in the one-dimensional case. This is demonstrated in figure \ref{tree2d}-b. At the same level of the wavelet tree, in addition to the group of disjoint windows that are the same as in the wavelet tree, there are another three groups of disjoint windows. One group of windows offsets the original group in the horizontal direction, one in the vertical direction and the third one in both directions.

Any square spatial sliding window with size $w\times w$ is included in one window of the two-dimensional SWT. The size of such a window is at most $2w\times 2w$. Using the techniques of section \ref{subsec:alg}, burst detection based on SWT-2D can report all the high density regions efficiently.

\section{Empirical Results}
Our empirical study will first demonstrate the desirability of elastic burst detection for some applications.  We also study the performance of our algorithm by comparing our algorithm with the brute force searching algorithm in section \ref{subsec:exp2}.

\begin{figure}
\begin{center}
\includegraphics[width=0.49\textwidth]{japan}
\includegraphics[width=0.49\textwidth]{russia}
\includegraphics[width=0.49\textwidth]{iraq}
\end{center}
\caption{Bursts of the number of times that countries were mentioned in the presidential speech of the state of the union} \label{union}
\end{figure}

\begin{figure*}
\begin{center}
\includegraphics[width=0.3\textwidth]{vis_1}
\includegraphics[width=0.3\textwidth]{vis_2}
\includegraphics[width=0.3\textwidth]{vis_3}
\end{center}
\caption{Bursts in Gamma Ray data for different sliding window size}
\label{vis}
\end{figure*}

\begin{figure*}
\begin{center}
\includegraphics[width=0.3\textwidth]{mapvis_1}
\includegraphics[width=0.3\textwidth]{mapvis_2}
\includegraphics[width=0.3\textwidth]{mapvis_3}
\end{center}
\caption{Bursts in population distribution data for different spatial sliding window size}
\label{mapvis}
\end{figure*}

\subsection{Effectiveness Study}\label{subsec:exp1}
As an emotive example, we monitor bursts of interest in countries from the presidential State of the Union addresses from 1913 to 2003. The same example was used by Kleinberg\cite{Klei02} to show the bursty structure of text streams. In figure \ref{union} we show the number of times that some countries were mentioned in the speeches. There are clearly bursts of interest in certain countries. An interesting observation here is that these bursts have different durations, varying from years to decades.

The rationale behind elastic burst detection is that a predefined sliding window size for data stream aggregate monitoring is insufficient in many applications. The same data aggregated in different time scales will give very different pictures as we can see in figure \ref{vis}. In figure \ref{vis} we show the moving sums of the number of events for about an hour's worth of Gamma Ray data. The sizes of sliding windows are $0.1$, $1$ and $10$ seconds respectively. For better visualization, we only show those positions with bursts. Naturally, bursts at small time scales that are extremely high will produce bursts in larger time scales too. More interestingly, bursts at large time scales are not necessarily reflected at smaller time scales, because those bursts at large time scales might be composed of many consecutive ``bumps''. Bumps are those positions where the numbers of events are high but not high enough to be ``bursts''. Therefore, by looking at different time scales at the same time, elastic burst detection will give more insight into the data stream.

We also show in figure \ref{mapvis} an example of spatial elastic bursts. We use the 1990 census data of the population in the continental United State. The population in the map are aggregated in a grid of $0.2^\circ \times 0.2^\circ$ in Latitude/Longitude. We compute the total population within sliding spatial windows with sizes $1^\circ \times 1^\circ$, $2^\circ \times 2^\circ$ and $5^\circ \times 5^\circ$. Those regions with population above the 98 percentile in different scales are highlighted. We can see that the different sizes of sliding windows give the distribution of high population regions at different scales.


\subsection{Performance Study}\label{subsec:exp2}
Our experiments were performed on a 1.5GHz Pentium 4 PC with 512 MB of main memory running Windows 2000. We tested the algorithm with two different types of data sets:
\begin{itemize}
\item The Gamma Ray data set :  This data set includes 12 hours of data from a small region of the sky, where Gamma Ray bursts were actually reported during that time. The data are time series of the number of photons observed (events) every $0.1$ second. There are totally 19,015 events in this time series of length 432,000.

\item The NYSE TAQ Stock data set : This data set includes four years of tick-by-tick trading activities of the IBM stock between July 1st, 1998 and July 1st, 2002. There are 5,331,145 trading records (ticks) in total. Each record contains trading time (precise to the second), trading price and trading volume.
\end{itemize}

In the following experiments, we set the thresholds of different window sizes as follows. We use the first few hours of Gamma Ray data and the first year of Stock data as training data respectively. For a window of size $w$, we compute the aggregates on the training data with sliding window of size $w$. This gives another time series $\vec{y}$. The thresholds are set to be $f(w)=avg(\vec{y})+\xi std(\vec{y})$, where $avg(\vec{y})$ and $std(\vec{y})$ are the average and standard deviation respectively. The factor of threshold $\xi$ is set to $8$. The list of window sizes is $5,10,...,5*N_w$ time units, where $N_w$ is the number of windows. $N_w$ varies from $5$ to $50$. The time unit is $0.1$ seconds for the Gamma Ray data and $1$ minute for the stock data.

First we compare the wall clock processing time of elastic burst detection from the Gamma Ray data in figure \ref{exp_nw_phy}. Our algorithm based on the SWT data structure is more than ten times faster than the direct algorithm. The advantage of using our data structure becomes more obvious as we examine more window sizes. The processing time of our algorithm is output-dependent. This is confirmed in figure \ref{exp_na_phy}, where we examine the relationship between the processing time using our algorithm and the number of alarms. Naturally the number of alarms increases as we examine more window sizes. We also observed that the processing time follows the number of alarms well. Recall that the processing time of the SWT algorithm includes two parts: building the SWT and the detailed searching of those potential portions of burst. Building the SWT takes only 200 milliseconds for the data set, which is negligible when compared to the time to do the detailed search. Also for demonstration purposes, we intentionally, to our disadvantage, set the thresholds lower and therefore got many more alarms than what physicists are interested in. If the alarms are scarce, as is the case for Gamma Ray burst detection, our algorithm will be even faster. In figure \ref{exp_thresh} we fix the number of windows to be $25$ and change the factor of threshold $\xi$. The larger $\xi$ is, the higher the thresholds are, and therefore the fewer alarms will be sounded. Because our algorithm is dependent on the output sizes, the higher the thresholds are, the faster the algorithm runs. In contrast, the processing time of the direct algorithm does not change accordingly.

For the next experiments, we test the elastic burst detection algorithm on the IBM Stock trading volume data. Figure \ref{exp_nw_ibmbu} shows that our algorithm is up to 100 times faster than a brute force method. We also zoom in to show the processing time for different output sizes in figure \ref{exp_na_ibmbu}.

\begin{figure}\begin{center}
\includegraphics[width=0.38\textwidth]{exp_nw_phy}
\end{center}
\caption{The processing time of elastic burst detection on Gamma Ray data for different number of windows}\label{exp_nw_phy}
\end{figure}

\begin{figure}\begin{center}
\includegraphics[width=0.38\textwidth]{exp_na_phy}
\end{center}
\caption{The processing time of elastic burst detection on Gamma Ray data for different output sizes}\label{exp_na_phy}
\end{figure}

\begin{figure}\begin{center}
\includegraphics[width=0.38\textwidth]{exp_thresh}
\end{center}
\caption{The processing time of elastic burst detection on Gamma Ray data for different thresholds}\label{exp_thresh}
\end{figure}

\begin{figure}
\begin{center}
\includegraphics[width=0.38\textwidth]{exp_nw_ibmbu}
\end{center}
\caption{The processing time of elastic burst detection on Stock data for different number of windows}\label{exp_nw_ibmbu}
\end{figure}

\begin{figure}\begin{center}
\includegraphics[width=0.38\textwidth]{exp_na_ibmbu}
\end{center}
\caption{The processing time of elastic burst detection on Stock data for different output sizes}
\label{exp_na_ibmbu}
\end{figure}

In addition to elastic burst detection, our SWT data structure works for other elastic aggregates monitoring too. In the following experiments, we search for big spreads on the IBM Stock data. Figure \ref{exp_nw_ibmsp} and \ref{exp_na_ibmsp} confirms the performance advantages of our algorithm. Note that for the aggregates of Min and Max, and thus Spread, there is no known deterministic algorithm to update the aggregates over sliding windows incrementally in constant time. The filtering property of SWT data structure gains more by avoiding unnecessary detailed searching. So in this case our algorithm is up to 1,000 times faster than the direct method, reflecting the advantage of a near linear algorithm as compared with a quadratic one.

\begin{figure}
\begin{center}
\includegraphics[width=0.38\textwidth]{exp_nw_ibmsp}
\end{center}
\caption{The processing time of elastic spread detection on Stock data for different number of windows}\label{exp_nw_ibmsp}
\end{figure}

\begin{figure}\begin{center}
\includegraphics[width=0.38\textwidth]{exp_na_ibmsp}
\end{center}
\caption{The processing time of elastic spread detection on Stock data for different output sizes}\label{exp_na_ibmsp}
\end{figure}

\section{Related work}
There is much recent interest in data stream mining and monitoring. An excellent survey of models and issues in data stream can be found in \cite{BBDMW02}. The sliding window is recognized as an important model for data stream. Based on the sliding window model, previous research studies the computation of different aggregates of data stream, for example, correlated aggregates \cite{GKS01}, count and other aggregates\cite{DGIM02}, frequent itemsets and clusters\cite{GGR00}, and correlation\cite{ZS02}. The work \cite{HSD02} studies the problem of learning models from time-changing streams without explicitly applying the sliding window model. The Aurora project\cite{CC+02} considers the systems aspect of monitoring data streams. Also the algorithm issues in time series stream statistics monitoring are addressed in StatStream\cite{ZS02}. In this paper we extend the sliding window model to the elastic sliding window model, making the choice of sliding window size more automatically.

Wavelets are heavily used in the context of data management and data mining, including selectivity estimation\cite{MVW98}, approximate query processing\cite{VW99,CGRS00}, dimensionality reduction\cite{CF99} and streaming data analysis\cite{GKMS01}. However, its use in elastic burst detection is innovative.

Data mining on bursty behavior has attracted more attention recently. Wang et al. \cite{WMC+02} study fast algorithms using self-similarity to model bursty time series. Such models can generate more realistic time series streams for testing data mining algorithm. Kleinberg\cite{Klei02} also discusses the problem of burst detection in data streams. The focus of his work is in modeling and extracting structure from text streams. Our work is different in that we focus on the algorithmic issue of counting over different sliding windows.

We have extended the data structure for burst detection to high-spread detection in time series. Spread measures the surprising level of time series. There is also work in finding surprising patterns in time series data. However the definition of surprise is application dependent and it is up to the domain experts to choose the appropriate one for their application. Jagadish et al.\cite{JKM99} use optimal histograms to mine deviants in time series. In their work deviants are defined as points with values that differ greatly from that of surrounding points. Shahabi et al.\cite{STZ00} also use wavelet-based structure(TSA-Tree) to find both trends and surprises in large time series dataset, where surprises are defined as large difference between two consecutive averages of a time series. In very recent work, Keogh et al.\cite{KLC02} propose an algorithm based on suffix tree structures to find surprising patterns in time series database. They try to learn a model from the previously observed time series and declare surprising for those patterns with small chance of occurrence. By comparison, an advantage of our definition of surprise based on spread is that it is simple, intuitive and scalable to massive and streaming data.

\section{Conclusions and Future Work}
This paper introduces the concept of monitoring data streams based on an elastic window model and demonstrates the desirability of the new model. The beauty of the model is that the sliding window size is left for the system to discover in data stream monitoring. We also propose a novel data structure for efficient detection of elastic bursts and other aggregates. Experiments on real data sets show that our algorithm is faster than a brute force algorithm by several orders of magnitude. We are currently collaborating with physicists to deploy our algorithm for online Gamma Ray burst detection. The monitoring of non-monotonic aggregates is a topic for future work.

\section{Acknowledgments} 
We are grateful to Prof. Allen I. Mincer of the Milagro Project, for giving us the preliminary tutorial of astrophysics.
We also thank the Milagro collaboration for making the Gamma Ray data available to us.

\bibliographystyle{abbrv}
\bibliography{timeseries}
\end{document}
