|
SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
|
Review: 1 |
Reviewer:
| Johann-Christoph Freytag |
Email:
| freytag@dbis.informatik.hu-berlin.de |
Organization:
| Humboldt-Universitaet zu Berlin |
Review:
| Question | Response |
1 | Overall Rating |
Weak Accept
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Medium
|
4 | Technical Depth |
High
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
yes, allowing for space limitations
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
adequate
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
yes
|
8 | Presentation |
Adequate
|
9 | Reviewer Confidence |
Medium
|
10 | Name of External Reviewer (if applicable) |
Frank Huber
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
The paper presents a novel approach for the problem of selectivity
estimation of set similarity selection queries. The authors
propose a selectivity estimation technique based on building a
priori. They show what the pitfalls and problems are when creating
samples dynamicaly or a priori. They show how each of these problems
can be solved by their approach. The impact of the novel approach
in this paper is a potential better plan selection by the optimizer
due to better selectivity estimation and an easy integration into
existing systems.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
1. The paper has a very good introduction and motivation.
2. The new approach is easy to implement in existing systems.
3. The good presentation of the experiments are a strong point.
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
1. The paper has some minor errors in the presented algorithms.
2. The presentation is sometimes very irritating, due to the overloading of symbols.
3. The lack of information of the environment setup for the experiments.
|
14 | Detailed comments (please number each point) |
1. The introduction motivates the problem of selectivity estimation of set similarity selection queries. It gives all necessary information about set similarity. Further it presents a simplistic approach of taking random sample at run time. Finaly, the authors argue that it is not an easy task to do sampling in
real time. The introduction is very well written and easy to read.
2. The second section "Essential Background" give the reader an introduction into the similarity measure TF/IDF and the TA/NRA algorithms. They show how all sets similar to a query can be retrieved using the inverted index. With Alogrithm 1 the authors try to show how the NRA algorithm works. But the algorithm has some minor errors. First, in line 10 it is open what bi= 1 means, maybe it should be b_i(s)=1, but then line 13 would be redundant. Second, with line 14 the authors want to say that completed sets with a score larger than tau are reported and removed. But a
completed set s do not have necessarily b[1,n]=1, as not every s has an entry in each q^i defined in "Formally, construct one inverted list per token q^i \e U that consists of one pair... per set s containting q^i. Also the use of r instead of s in the second forall part irritates and is needless. Line 14 say also I(q,s) which
should be I(q,r).
3. Preliminary ideas are given in section 3. In this section the authors give ideas and show the related problems using dynamic computed samples and a priori computed samples. This section is like the introduction part very well written and easy to follow.
4. In section 4 the authors present their approaches. In the first part they introduce the K Minimum Values (KMV) approach followed by the Fixed Percent Minimum Values (FMV). Both approach select parts of the inverted index, random permutated on their token. To achieve this permuation they use a group of hash functions. The authors introduce the domain of set ids as U, even through U is already assigned to the universe in section 2. This may lead to some irritations as well. To overcome some of the problems with KMV, FMV is introduced as a slidly different version. The end of section 4 is used to introduce the use of stratification and post
stratification.
5. Performance improvments are shortly introduced in section 5. The presented Algorithm 4 has the same problem like algorithm 1 of section 2.
6. The experimental section in Section 6 gives an overview of what kind of experiments were performed and what results were found. The experimental section lacks of information about the environment setup of the experiments. However, overall it is good written and result are comprehensible.
7. Section 7 gives an overview about related work, followed by the authors conclusion.
|
15 | Comments for the Program Committee |
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
Review: 2 |
Reviewer:
| Davood Rafiei |
Email:
| drafiei@cs.ualberta.ca |
Organization:
| University of Alberta & Google |
Review:
| Question | Response |
1 | Overall Rating |
Weak Reject
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Medium
|
4 | Technical Depth |
Low
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
to a limited extent
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
not good
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
no or not applicable
|
8 | Presentation |
Adequate
|
9 | Reviewer Confidence |
Medium
|
10 | Name of External Reviewer (if applicable) |
Fan Deng
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
This paper studies the problem of estimating the selectivity of set-similarity
queries under TF/IDF similarity measure. To solve this problem, a few sample-based
techniques are proposed. Some theoretical and experimental results are presented.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1: The problem is worth of studying.
S2: The presentation of the paper is ok.
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1: The proposed techniques are not analyzed adequately.
W2: It's not clear if the techniques will work well in practice.
|
14 | Detailed comments (please number each point) |
W1: Most techniques are presented conceptually without deeper analysis.
For example, KMV may not work well in practice in some cases. If would be
better those cases are formally discussed and identified. If not possible,
more experiments need to be done to draw some conclusion. Other methods
have the same situation. The presented theorems in the paper helps to some
degree, but not enough. In my opinion, analyzing one or two techniques
deeply is better than proposing more solutions without sufficient analysis.
W2: To use the techniques in practice, one may have to know the relationship
between sample size and the error. The paper doesn't provide these knowledge for
most of techniques. If theoretical analysis is not achievable, meaningful
experimental results should be provided.
Comments for possible revision:
The paper proposes and discusses some preliminary ideas (Sec. 3), which are not
expected to work well(experimental results also confirm this).
It is not clear why they're needed in the paper. One can understand other methods
(KMV, FMV,...) well without Sec. 3. the same question for KMV: if it won't work well in practice, why presenting it in the paper? Removing those negative results, the authors should have more space for those more promising techniques proposed in Sec. 4, which are not sufficiently analyzed analytically/theoretically in my opinion.
|
15 | Comments for the Program Committee |
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
Review: 3 |
Reviewer:
| Kyuseok Shim |
Email:
| shim@ee.snu.ac.kr |
Organization:
| Seoul National University |
Review:
| Question | Response |
1 | Overall Rating |
Weak Accept
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Low
|
4 | Technical Depth |
Low
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
yes, allowing for space limitations
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
adequate
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
yes
|
8 | Presentation |
Adequate
|
9 | Reviewer Confidence |
Medium
|
10 | Name of External Reviewer (if applicable) |
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
This paper studies the problem of selectivity estimation of set
similarity selection queries. The main contribution of the paper
is the sampled inverted index based on recently proposed DV[5] synopses.
Although ideas after that seem rather
straightforward, this paper well considers practical issues and proposes
reasonable solutions. The experimental results seem okay.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1: This is the first paper on the issue which is very timely.
S1: Pitfalls of alternatives well considered. Appropriate use of related
work.
S3: KMV and FMV can be easily integrated with standard merge techniques
like TA.
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1: The main idea and the results are from other papers. Applying the
idea to the inverted index and other heuristics like FMV is less
significant and rather straightforward.
W2: The performance benefit (runtime) of FMV over NRA (actual query
processing) is not good enough.
|
14 | Detailed comments (please number each point) |
D1: The problem is defined with set semantics in the introduction.
The multiset semantics is correct.
D2: I'm not totally convinced that post-stratification is very
practical. It might work for some data set but, we could easily devise a
situation where the exponential size of the lattice matters. A more
principled algorithm is desirable.
D3: Use consistent notation. For example, q={q^1,...,q^n} is
defined as the query, but q^i is also consistently used to denote the
inverted list of q^i. What do you mean by |q^1 U ... U q^n|? q^i there is
not a query term or just an inverted index.
D4: 'universe' is also used ambivalently. In section 2, it is the domain
of elements, but in 3.1, the meaning is different.
D5: Shouldn't 'universe size |U|' be '|D|' in
'provided that the universe size |U| is known in advance' in 3.2?
D6: Writing of 4.3 needs to be improved with the confusing notations.
D7. Many heuristics have been developed to efficiently merge the
inverted lists including [27]. Appropriate reference, analysis, and
comparison is required in section 5.
|
15 | Comments for the Program Committee |
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
|