SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
Paper ID: 233
Title: The FIA: a new automaton for mining frequent itemsets in Data Streams
Track Name: Research
Review Count: 3

Review: 1
Reviewer: Vasant Dhar
Email: vdhar@stern.nyu.edu
Organization: New York University
Review:
 QuestionResponse
1Overall Rating Accept
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) Authors present an approach to mining frequent itemsets in a datastream, which is a problem of increasing importance. The authors make a good case for a rich representation for enabling updates, and for using a sound statistical basis for determining the relevance of the itemsets.
12Three strong points of the paper (please number them S1,S2,S3) S1. Good articulation of the problem and of the state of the art.
S2. Good presentation of their model.
S3. Adequate empirical results.
13Three weak points of the paper (please number them W1,W2,W3) W1. Uneven writing. It appears like section 5.3 onwards is written by a different authors and all kinds of writing errors and clumsiness creep in!
W2. The dataset used for experiments isn't described, nor is the context, making the interpretation of this section a little opaque unless the reader invests time on familiarizing him/her self with that data.
W3. Better empirical validation with other datasets would greatly enhance the credibility of the model.
14Detailed comments (please number each point) I thought this was a good paper, but the writing became a little uneven towards the end. As stated above, the dataset used for experiments isn't described, nor is the context, making the interpretation of this section a little opaque unless the reader invests time on familiarizing him/her self with that data. I think that better empirical validation with other datasets would greatly enhance the credibility of the model.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 2
Reviewer: Rajeev Rastogi
Email: rastogi@bell-labs.com
Organization: Bell Labs Research, India
Review:
 QuestionResponse
1Overall Rating Weak Reject
2Reject due to technical incorrectness No
3Novelty Low
4Technical Depth Low
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Not Good
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) The paper presents a new automaton structure called FIA to detect frequent itemsets in data streams. The FIA has a DAG structure and is more compact/informative.
12Three strong points of the paper (please number them S1,S2,S3) S1: The FIA is an elegant structure and is much more compact than previous data structures like the FP-Tree, etc. In this respect, the paper has presented an innovative structure for representing frequent itemsets as an automaton that leads to minimal redundancy.

13Three weak points of the paper (please number them W1,W2,W3) W1: The topic of the paper is not that interesting. Frequest itemset mining was a topic that was introduced almost 15 years ago, and there has been a considerable amount of work on the topic. So yet another paper on mining frequent itemsets may be of limited interest to conference attendees.

W2: The authors apply the results of Toivonen (reference [19]) to prune states from the FIA that have support greater than theta-epsilon. But these results cannot be applied to our stream setting since such pruning only applies to a sample of the entire data stream, and a stream prefix is not really a sample of the stream.

W3: The Fiasco algorithm for constructing the FIA is very poorly described. Is it identical to previous algorithms like Apriori or FP-Growth? In that case, what is the algorithmic contribution of this paper. Even for the case of computing the FIA over a stream, the authors the algorithm of Hoshino et al. [12].

W4: Some analytical results showing the compactness of the FIA compared to the FP-Tree would be nice. This would help to establish that the FIA is provably more compact compared to the FP-Tree.
14Detailed comments (please number each point) See strengths and weaknesses mentioned above.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 3
Reviewer: Karine Zeitouni
Email: Karine.Zeitouni@prism.uvsq.fr
Organization: University of Versailles-Saint-Quentin
Review:
 QuestionResponse
1Overall Rating Weak Reject
2Reject due to technical incorrectness No
3Novelty Low
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? not good
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Adequate
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) This paper proposes a new data structure, FIA standing for Frequent Itemset Automaton, and an algorithm for mining frequent itemsets in data streams. The authors demonstrate some interesting features of the FIA that is (i) informative: indexes frequent (as well as closed) itemset ; (ii) adapts to incremental updates while maximizing the recall. This approach is based on an existing paper in the context of text mining and many algorithms exist for data streams. The only comparison in the experimentation is made with FP-growth and A-priori in static database (no comparison in the context of data streams or with the most recent approaches).
12Three strong points of the paper (please number them S1,S2,S3) S1: the clarity of the proposed approach
S2: the formalism
S3: the use of real datasets.
13Three weak points of the paper (please number them W1,W2,W3) W1: the lack of originality (using an automaton is not sufficient)
W2: the experiments are incomplete: evaluate different cases and datasets to show when your algorithm performs better than other.
W3: a large part of the paper and the experiments is given to the static database case, not to data streams.
14Detailed comments (please number each point) C1: The main objectives and contributions of this paper remain unclear. You should focus on data stream mining, otherwise the title has to be changed.

C2: The section "Problem definition" does not state any new problem, but introduces the notation for well known concepts. It is better to transfer it to the preliminaries section.
In the "previeus work" section , I suggest to reference the recent book:
C. Aggarwal, Data Streams: Models and Algorithms, Ed. Charu Aggarwal, Springer, 2007.

C3: Your paper does not discuss the limitations of the previous algorithms: those allowing false positives seem similar to your work, since you also need to adjust the epsilon value. By the way, the impact of this parameter does not appear in the experimentation.

C4: In section 5.1, it should be useful to provide some illustration examples. You may refer to the examples already included in section 5.2 after the third and the fourth definition.

C5: There is a contradiction in page 5 about the definition of DS*: in the top, you refer to infrequent itemsets, while in Definition 5, it becomes frequent itemsets.

C6: The comparison of the number of nodes in FP-tree and the number of states in FIA (in figure 10) is not relevant, since the required sizes of each is very different. Actually, this is confirmed by the non correlation with the total memory size in figure 9.

C7: The experimentation should rather focus on the data stream case and compare with the algorithms mentioned in section 3. It should study deeply the trade-off between the recall and the FIA size.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)