SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
Paper ID: 257
Title: Another Outlier Bites the Dust: Computing Meaningful Aggregates in Sensor Networks
Track Name: Research
Review Count: 3

Review: 1
Reviewer: Nicolas Anciaux
Email: nancia@prism.uvsq.fr
Organization: INRIA
Review:
 QuestionResponse
1Overall Rating Accept
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) The target of the paper is to take efficiently into account dirty and suspicious reads when processing (aggregate) queries on sensor networks. The proposed techniques builds on previous work [14] in three ways: (i) it extends the notion of outliers (and witnesses); (ii) the outlier detection technique is defined per query in the SQL order (the clause “MINIMUM SUPPORT” gives the number of concordant nodes witnessing each monitored value and the clause “CONSTRAIN TEST GROUP” indicates whether those witnesses must appear in a same group defined by the “GROUP BY” clause); (iii) the collection tree routing the data (outliers and witnesses sets and statistics) during query execution is triggered periodically at given epochs. The proposed techniques are compatible with dirty nodes detection as introduced in previous work [18]. The proposed techniques are evaluated through a simulator.
The problem studied is convincing (about the fact that detecting dirty/abnormal reads is crucial), and the proposed techniques well answer the problem.
12Three strong points of the paper (please number them S1,S2,S3) S1 The problem (dirty/outlier detection) and the limit of the related works are nicely presented.
S2 The paper is well constructed and written, and the presentation of the solution is pedagogic and clear, beginning by a global view on an example, followed by the details of the proposed algorithms and experiments.
S3 The study is comprehensive, considering many facets of the problem: transmitted number of messages and energy consumption, but also space/memory consumption (in particular, “compressed” forms are also used)
13Three weak points of the paper (please number them W1,W2,W3) W1 In the experiments, size of the internal structures is not considered (nor mentioned) which looks surprising since it had been addressed in the technical analysis
W2 The impact of tree reconstruction on the cache of statistics is not clear
W3 Some figures are difficult (if not impossible) to read
14Detailed comments (please number each point) 1. Section 3.4 explains the kind of data transmitted between nodes. At this point the reader may wonder if the technique would not suffer too much from transmissions and storage of histories. This will be to some extend alleviated in the following but not considered in the experiments.
2. Section 4 introduces the first proposed algorithm called SensibleAggr-supp. This algorithm is in charge of transmitting aggregates/witnesses/outliers among the sensor nodes. Similarity functions are used to determine whether monitored values can be considered as similar. A question which might rest unclear for the reader is the proportion of values which may be validated (i.e., gives support to an outlier value) using the same witness value, and the (or intuition of) impact it may have.
3. Section 4.3 deals with the management of histories required to perform the witness test. A structure is required to store those values. It is difficult here to have a clear idea of the impact on the overall model. A better explanation on this global cost on storage/transmission would be valuable.
4. The body of the algorithm is described in Section 4.4. Values are transmitted from one node to the other up to be able to perform a witness test (when enough values do exist in the cache). Further discussion on the number of performed tests on the cache per qualified measure would be interesting.
5. Another remark is that when reorganizing the tree, the existing cache seams to lose much interest (significant amount of values should be rebuild/retransmitted). The impact of tree reconstruction on the cache should be clarified.
6. In the experiments, the presented solution is compared with select * queries (collect everything and perform the outlier verification), TAG (without outlier detection) and Robust (algorithm presented in [14]). However, the measures are sometimes not detailed enough to understand the impact of each choice made in the algorithms (e.g., figure 2 and 4 give the total number of transmitted bits without giving the cost decomposition in outliers/witnesses/statistics/results).
7. Figures are nearly impossible to read on a printed version of the file.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 2
Reviewer: Jianzhong Li
Email: lijzh@hit.edu.cn
Organization: Harbin Institute of Technology, China
Review:
 QuestionResponse
1Overall Rating Reject
2Reject due to technical incorrectness No
3Novelty Low
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) This paper investigates the problem of detecting outliers while aggregating values in sensor networks. The authors design an algorithm to detect outliers in aggregating process, which can find outliers of flexible minimum support and not increase the communication cost significantly. Besides, this paper presents an algorithm to rebuild the aggregation tree by utilizing simple statistics on outlier detection, which can improve the aggregating effect.
12Three strong points of the paper (please number them S1,S2,S3) S1. Improve the algorithm proposed by reference [14]. The proposed algorithm to detect outliers in aggregating process can deal with flexible user-specify minimum support. It can also support group-by operation well.
S2. Design another algorithm to rebuild the aggregation tree according to the statistics on outlier detection, which can decrease the communication cost of aggregation.
S3. Show the efficiency of the proposed algorithm with a number of detailed experiments.
13Three weak points of the paper (please number them W1,W2,W3) W1. Considering the work done by reference [14], the contribution of this paper is very limited (just extend the algorithm in [14] to support a flexible minimum support and give an algorithm to rebuild aggregation tree).
W2. There is no proof about the correctness of the outlier detection algorithm. Actually, when the user-specify minimum support is greater than 1, this algorithm is wrong. That is because the algorithm only chooses one of the similar values as a witness while the number of these similar values can exceed the specified threshold.
W3. While the distribution of sensing data changes frequently, the rebuilding of aggregation tree according to last several statistical data will waste a lot of communication, comparing with a fixed structure.
14Detailed comments (please number each point) (1) The main ideas of this paper come from reference [14]. The works of this paper are extending the algorithm in [14] to detect outliers with flexible minimum support and giving an algorithm to rebuild the aggregation tree. But the extended outlier-detect algorithm is not right to deal with outliers of minimum support greater than 1. Besides, frequently rebuilding the aggregation tree according to last several statistical data may waste too much communication. So considering the reference [14], the contribution of this paper is very limited.
(2) The outlier detection algorithm (algorithm 1) is not right to deal with outliers of minimum support greater than 1. For instance, in Figure 1 of this paper, suppose S10 has a reading relatively higher but it can be supported only by S12, S4 and S5. Since during the aggregation process in S3, only one witness is maintained and transmitted up to the root, S10 is taken as an outlier by the root, given that the minimum support is set to 3. This case can be a counterexample for the correctness of Algorithm 1.
(3) The description of algorithm 2 is really hard to understand, especially the meaning of variable upwardDistance[p]. In my opinion, line 2 in algorithm 2 and the last paragraph in page 7 give different meanings of variable upwardDistance[p]. Generally speaking, if the data from Si and Sk are always support each other and the nearest common ancestor of them is not too far way, the algorithm will rebuild them as descendants of the nearest ancestor. So you can give a simpler and clearer description of this idea.
(4) algorithm 2 only uses last several statistical data to determine the rebuilding of aggregation tree. The algorithm must frequently rebuild different tree structure as the distribution of sensing data changes. On the other side, the rebuilding of the tree will cost a large amount of communication. So the profit of adopting optimum aggregation tree may not compensate the waste of rebuilding the tree.
(5). The paragraphs 2, 3 and 4 are hard to understand.
(4). The descriptions of all the algorithms are not very clear.
(6). All the figures in section 6 are hard to understand.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 3
Reviewer: Alfredo Ferro
Email: ferro@dmi.unict.it
Organization: Catania University
Review:
 QuestionResponse
1Overall Rating Weak Accept
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Very Good
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) In this paper authors aim to present a model for sensor networks result of the combination of two previous works. 1)It supports a query aggregation framework with group by predicates able to detect suspicious measurements obtained by outlier nodes. 2)The tree topology in which sensor nodes are organized is updated in order to save bandwidth.
12Three strong points of the paper (please number them S1,S2,S3) S1) The method supports a query aggregation framework with group by predicates able to detect suspicious measurements obtained by outlier nodes.

S2) The tree topology in which sensors nodes are organized is updated in order to save bandwidth.

S3) Easy to read
13Three weak points of the paper (please number them W1,W2,W3) W1) Experiments show a not significant improvement with respect to the previous work.

W2) It would be much more interesting to implement simulation with a tool like ns2.

W3) In significantly large networks (e.g. thousands of nodes), if the table F[i][j] is not sparse, then the method may result inadequate.
14Detailed comments (please number each point) 1)The query aggregation method, by making use of a “minimum support” strategy, maintains candidate outlier nodes together with their suspicious measurement at a given time. A list of witness nodes is also maintained. A witness is a representative node selected among those detecting similar measurements and whose count satisfy the minimum support. When the measurements of two or more sensors in the same group are similar each sensor can witness the others. The similarity is measured in several different way using the Correlation Coefficient, the extended Jaccard Coefficient, and Regression based Approximation. Furthermore to increase the significance of measures, the last ones obtained in a temporal window are maintained in a cache. With this strategy the measurements are aggregated in one value together with a witness node. This pair is propagated to the root node.
2) The collection tree guides the propagation of measurements to the root node. Authors propose a periodic reorganization of the tree topology based on simple statistics. For example if the witness test between two nodes often succeeds, then these nodes can select the same node as parent in the collection tree through which they expect to find the most witnesses at relatively short distances (in number of hops). This allows bandwidth saving.

3) The model proposed is a combination of techniques described in two cited papers (14 and 15 in the bibliography).

4) The paper is easy to read.

5) The novelty of the paper is the introduction of group by predicates together with the definition of the minimum support parameter allowing to classify a node as an outlier. Moreover the “collection tree” is periodically updated by making use of statistical analysis.

6) Experiments show a not significant improvement with respect to other models.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) Yes
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) 1) It is not clear in the experiments section how the AGGREGATION values are obtained.

2) If an intermediate node reads a measurement, must it wait for his children before sending the measurement to his father?

3) How is the message propagation realized, when multiple parents nodes are used?