|
SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
|
Review: 1 |
Reviewer:
| Karl Aberer |
Email:
| karl.aberer@epfl.ch |
Organization:
| EPFL Lausanne |
Review:
| Question | Response |
1 | Overall Rating |
Weak Reject
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Low
|
4 | Technical Depth |
Medium
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
yes, allowing for space limitations
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
not good
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
yes
|
8 | Presentation |
Not Good
|
9 | Reviewer Confidence |
High
|
10 | Name of External Reviewer (if applicable) |
Philippe Cudre-Mauroux
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
The paper proposes a new method, based on cumulative probability distributions, to estimate data distributions in a DHT. The scheme proposed supposedly works equally well regardless of the data distribution. Several optimizations are proposed to limit both the communication cost and the convergence latency. Simulations are given to illustrate the validity of the approach.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1: the main point of the paper (distribution free density estimation) makes sense in P2P settings and is generally speaking interesting S2: the paper proposes several optimizations to limit both the communication cost and the convergence latency of the proposed method S3: an analysis giving the error bound of the method is given
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1: poor presentation W2: limited performance evaluation W3: in the end, the method proposed only performs marginally better than state-of-the-art methods
|
14 | Detailed comments (please number each point) |
The main idea presented in the paper is in my opinion very interesting: to sample the output of a cumulative probability distribution function in order to reduce the sampling bias for skewed data distributions. The paper, however, suffers from my point of view from two main problems: its presentation does not reach Sigmod standards and the performance evaluation section is insufficient.
Detailed Comments:
1. The general introduction (Section 1) is from my perspective inappropriate: the authors seem to have very limited knowledge of P2P systems, and both the applications ("load balancing", "query processing") and new systems ("data mining") that are cited there sound out of place.
2. There has been quite some research on distribution-free approximation methods (e.g., "Distribution-Free Statistical Methods" book by J. S. Maritz). It would be good to give a more formal definition of your notion of distribution-free estimation (end of page 1); What is the exact difference between distribution free estimation and standard nonparametric estimation? Does distribution free estimation imply systematically smaller deviation for skewed data distributions and if it is the case can you prove it?
3. The series of optimizations you propose (Section 5 and 6) seem to be quite convoluted for the relatively standard problem you consider. In the end, the problem boils down to some hierarchical aggregation of the LCVs, plus interpolation and random sampling. It would be interesting to compare your approach to a more naïve approach using binary trees to sample and aggregate the LCVs in a parallel manner.
4. About the indexing method: How is the amplify parameter A chosen and what is the value you used for the experiments? 5. Property 2: what do you mean by "consistent"? 6. Why is the latency of the pull method (Section 5.3) smaller? Given an efficient update scheme, the push method should have a smaller latency, as it requires fewer messages. 7. Also, why does it reduce the propagation error?
8. The discussion on the index update threshold is in my opinion inadequate. First, you should clearly define the notion of "updating message". You should differentiate the cases when a peer disconnects from the network (lots of updates on close values in the hash space), enters the network (series of inserts based on the hash values of the tuples the peer brings) or updates one of its entries. 9. Also, contrary to what you say, insertions and deletions are typically not "counteractive operators" (both result in an update on the density function, except on the very rare occasions when an insertion coincides with a deletion on the same key).
10. In the end, what is the communication cost for computing CFDF with the concurrent method given a fixed update threshold?
The experimental evaluation section is not detailed enough, and thus the experiments non-repeatable. You argue that one cannot apply standard density estimation to P2P systems because (i) it is impossible to maintain a global index, (ii) index keys change over time, and (iii) peers join and leave the system over time. However, none of these reasons is examined in detail in the experimental evaluation. 11. In the simulation setting considered (10000 peers, each holding 20 tuples), a centralized index would be easy to build (index holding 200000 entries built for example using one broadcast query). 12. The method used to determine the time when the peers join / leave the network is simply not explained. 13. Basically, there is no result taking into account churn (all the methods always converge). 14. The communication cost (e.g., Figure 4, 6) seems to grow linearly with the number of peers. 15. The number of tuples per peer (twenty) is unrealistically low. The comparison study is extremely limited. 16. Only one alternative method is implemented (reference [26]), without giving any detail on the exact setting. 17. It is not clear what is taken into account in the number of routing hops ("time"). 18. Other naïve approaches (e.g., simplistic sampling based on histograms, [3]) should be implemented and compared to the method proposed. 19. Also, you should explain in detail under which circumstances your method performs better than other methods (Figure 11 shows a slightly faster convergence than [26]; does this hold for other distributions or settings? Is the relative error in the end always comparable to the one given by [26] or can you achieve better results in highly dynamic environments?).
Finally, the presentation of the paper is well below Sigmod standards. 20. There are too many typos, grammatical mistakes, presentation flaws, and random or somewhat disconnected sentences. The technical sections of the paper should be rewritten and a native speaker should revise the entire paper. The performance evaluation section, in particular, should be rewritten to provide all necessary details in order for a third-party to repeat the experiments.
|
15 | Comments for the Program Committee |
The main idea of this paper is interesting, though not revolutionary. The papers suffers from three main problems: a very poor presentation, an incomplete performance evaluation, and unimpressive results compared to sate-of-the-art methods.
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
Yes
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
- Please give a formal definition for distribution-free estimation w.r.t. nonparametric estimation techniques - How is the amplify parameter A chosen? - What do you mean by "consistent" in Property 2? - Why is the latency of the pull method (Section 5.3) smaller? - What is an "updating message" exactly? - What is the communication cost for computing CFDF with the concurrent method given a fixed update threshold? - How do the peer join / leave the network in the experiments? - What exactly is taken into account in the number of routing hops? - Under what circumstances does your method perform better than other methods?
|
|
Review: 2 |
Reviewer:
| Y.C. Tay |
Email:
| mattyc@nus.edu.sg |
Organization:
| National University of Singapore |
Review:
| Question | Response |
1 | Overall Rating |
Accept
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
High
|
4 | Technical Depth |
High
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
yes, allowing for space limitations
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
not good
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
no or not applicable
|
8 | Presentation |
Adequate
|
9 | Reviewer Confidence |
High
|
10 | Name of External Reviewer (if applicable) |
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
This is the first distribution-free method for estimating data density in a P2P network. The authors analyzed the complexity and cost/accuracy tradeoff.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1: First distribution-free sampling method for P2P data.
S2: O(log n) complexity (Thm 1), using Chord-like finger tables.
S3: Error bound (Thm 3) that guides choice of threshold delta.
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1: Data must be numerical and continuous (Sec3.1/para1); peers must have non-overlapping index keys (p1/col1/para2).
W2: The paper does not say how the non-overlapping index, DCD table (Fig. 3), etc. are to be updated when peers join/leave/fail.
W3: The comparison study (Sec7.3) has just one experiment (Fig. 11).
|
14 | Detailed comments (please number each point) |
W3:
(a) According to Fig. 7, the relative GCV error is smallest under Zipf, so Fig. 11 (with Zipf) compares the algorithm's best case against the gossip approach. Even so, Global CD Sampling is not obviously better than gossip (slightly better for small #hops and slightly worse for large #hops). How would the comparison look if, say, Normal is used instead of Zipf?
(b) An additional comparison against parametric methods (e.g. Kowalczyk and Vlassis's Newscast EM, not cited) would help the reader understand the power/limitation of this paper's technique.
Other comments:
* How many PCV senders for Fig. 11?
* h in Eqn. 5 is undefined.
* p4/col2/para3 needs rewriting.
* Reference [13] needs fixing (title and Modle).
|
15 | Comments for the Program Committee |
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
Review: 3 |
Reviewer:
| Patrick Valduriez |
Email:
| Patrick.Valduriez@inria.fr |
Organization:
| INRIA |
Review:
| Question | Response |
1 | Overall Rating |
Weak Reject
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Medium
|
4 | Technical Depth |
Medium
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
yes, allowing for space limitations
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
not good
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
no or not applicable
|
8 | Presentation |
Adequate
|
9 | Reviewer Confidence |
Medium
|
10 | Name of External Reviewer (if applicable) |
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
The paper proposes a new method, called distribution-free estimation, to estimate the global data distribution in a P2P network. The method is independent of data distribution and makes use of sampling over global cumulative distribution. The paper proposes algorithms to compute global cumulative distribution and gives a theoretical analysis of their cost. A performance evaluation based on simulation shows the effectiveness and efficiency of the method.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1. The distribution-free data density estimation method, using an intermediate cumulative probability distribution function, is effective.
S2. The theoretical analysis of the algorithms is solid.
S3. Given the problem definition and the assumptions regarding the indexing method in a P2P context, the algorithms are effective.
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1. The paper hesitates between two contributions: distribution-free estimation which is a general method and its application to a particular P2P network.
W2. The problem definition is very restricted, assuming single attribute tuples, partitioning of data domain in disjoint, continuous intervals, and a particular Chord like overlay.
W3. The performance evaluation is simulation only. No validation on arbitrary overlay networks with real/synthetic benchmarks.
|
14 | Detailed comments (please number each point) |
1. The first contribution of the paper (distribution-free estimation) is independent of the P2P context. And it could apply as well to other contexts. Thus, it should be compared with competing methods in more standard environments.
2. The second contribution (application to P2P) is restricted to a Chord-like network. And it is hard to see how it can be generalized to other networks as it depends heavily on the fact that peers store index keys from disjoint, continuous intervals (in order to avoid having to look at all other peers to do distribution estimation). Also, how can data replication which is an important feature of P2P systems can be considered?
3. Assuming that the method is used to do load balancing in the overlay network used in the paper, it is hard to see how index keys could be easily moved to different peers since this would probably mean propagating moves from peers to peers to maintain continuity of index keys.
4. There is one important missing references which deals with a similar problem: Y. Zhu and Y. Hu. Towards efficient load balancing in structured P2P systems. IPDPS, 2004.
|
15 | Comments for the Program Committee |
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
|