SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
Paper ID: 255
Title: REDUS: Finding Reducible Subspaces in High Dimensional Data
Track Name: Research
Review Count: 3

Review: 1
Reviewer: Margaret Dunham
Email: mhd@engr.smu.edu
Organization: Southern Methodist University
Review:
 QuestionResponse
1Overall Rating Accept
2Reject due to technical incorrectness No
3Novelty High
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Very Good
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) The authors propose a new algorithm to find reducible subspaces in high dimensional data.
12Three strong points of the paper (please number them S1,S2,S3) S1. The approach to find the reducible subspace is interesting and appears to be novel.

S2. Motivating examples are good.

S3. The paper is very well written and technically sound.
13Three weak points of the paper (please number them W1,W2,W3) W1. The experiments seemed somewhat simple. I would have liked to see more experiments reported on larger datasets.

W2. There are many more algorithms that could have been included in the related works section. However, the space limitations and detailed content of the rest of the paper limited this coverage.
14Detailed comments (please number each point) 1. Although I can not think of specific recommendations to improve the performance study, I had some issues with the approach. First of all, the datasets were somewhat limited in size (both in number of features and in number of data points). Secondly, the evaluation in terms of run time simply did not seem to be needed. Is run time really an important issue here?
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? Yes
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 2
Reviewer: Eamonn Keogh
Email: eamonn@cs.ucr.edu
Organization: UC Riverside
Review:
 QuestionResponse
1Overall Rating Reject
2Reject due to technical incorrectness No
3Novelty Low
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? not good
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Adequate
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) The authors investigate the problem of finding strong linear and nonlinear correlations hidden in feature subspaces of high dimensional data. The approach is reasonable, but there are other simple alternatives that might work as well.
12Three strong points of the paper (please number them S1,S2,S3)
13Three weak points of the paper (please number them W1,W2,W3)
14Detailed comments (please number each point) I find the experiments unsatisfying. You need to consider some reasonable strawman. I just cannot understand why you do not.

Consider the strawman which I have just coded up in matlab (14 lines of code). It is greedy forward search. It is slow of course, but on the 3 real datasets you have it terminates in less than a minute. It finds the linear correlations you find, and I suspect that I could tweak it to find the non-linear "correlations" without too much effort. Your synthetic datasets are not much bigger.

I am surprised you did not reference Elke [a], she does work on highly related stuff, with related approaches.
[a] Elke Achtert, Christian Böhm, Hans-Peter Kriegel, Peer Kröger, Ina Müller-Gorman, Arthur Zimek: Detection and Visualization of Subspace Cluster Hierarchies. DASFAA 2007: 152-163
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 3
Reviewer: Y.C. Tay
Email: mattyc@nus.edu.sg
Organization: National University of Singapore
Review:
 QuestionResponse
1Overall Rating Weak Reject
2Reject due to technical incorrectness No
3Novelty High
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? not good
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Adequate
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) The authors decompose the problem of low-dimensional structures in high-dimensio
nal data into: (Step1) identifying the small number of dimensions that are corre
lated through each structure and (Step2) applying standard low-dimensional techn
iques to the correlated dimensions. REDUS addresses Step1 for the case where th
e structures partition the dimensions into disjoint sets.
Weak theory and experiments.
12Three strong points of the paper (please number them S1,S2,S3) S1: A neat decomposition of the structure identification problem that works with
any intrinsic dimensionality metric ID.

S2: The authors propose definitions and desirable properties, and begin to deve
lop a theory for subspace reduction (Sec. 3).

S3: REDUS scales linearly with the number of features (Figs. 6b and 7b), and wor
ks for small but real data sets (Sec6.3).
13Three weak points of the paper (please number them W1,W2,W3) W1: No evidence that REDUS will work for the motivating example.

W2: Rigor is lacking.

W3: Experimental support is weak.
14Detailed comments (please number each point) S3: The real data sets are small (200 points with 28 features; 534 points with 1
1 features; 569 points with 30 features) in comparison to the "hundreds to thous
ands" of features that motivated this work. There are no experiments on gene da
ta.

W1: All the experiments are for cases where the number of data points N is bigge
r than the number of features M. However, the motivating example for this paper
(gene expression) typically has N < < M (ref. [9,15]). The authors provide no evi
dence that REDUS works for the problem (in a sense, an underdetermined system) t
hat motivated their work.

W2: Property3.5 is not well-defined.
(i) It implicitly assumes that two non-redundant feature subspaces must have the
same number of features (m).
(ii) The "iff" requires justification: intuitively, if f_a is correlated with U
and all f_u in U are correlated with V, then f_a is correlated with V; but why s
hould the converse hold (if f_a is correlated with V and all f_u in U are correl
ated with V, then f_a is correlated with U)?
The proof of the crucial Thm4.1 relies on this converse; without justifying the
converse, all we have is a "proof by definition".

Defn3.6 and Defn3.7 refer to "the core space of Y", which is misleading since co
re spaces are not unique (Thm4.1). This non-uniqueness also renders Defn3.7 amb
iguous, unless part(2) of the definition means "where V is any core space of Y".

Does the overall reducible subspace identified by REDUS depend on how the featur
es are ordered in Algo1's first loop (lines2-7)?


W3:
How does one choose epsilon? The authors set it between 0.002 and 0.25 (Sec6.1)
, a range of over 3 orders of magnitude! What would be the precision and recall
in Table 1 for dataset 1, if epsilon=0.25 (as per dataset 2)?

How does one choose the sampling size n? Table 2b shows that, if n/N is too sma
ll, the subspaces are erroneously fragmented. On the other hand, the point dist
ribution method (Sec5.2) to identify correlation relies on the number of sample
points in (d+1)-dimensional space being less than (ell^d)N, so a big n will erro
neously merge the subspaces.

How does one choose tau? Table 2a shows a slight deviation from 0.92 will fragm
ent or merge subspaces.

Real data are noisy, and one expects these parameters to be sensitive to the noi
se level. Unfortunately, Sec6.2 does not examine the impact of noise in the syn
thetic data, and Sec6.3 does not report the parameter values used for real data.


Other comments:

Is SIGMOD the right conference for this paper?

p4: V\in\Omega_F?

p7: Let 0 < ell < 1 be a small natural number?

p8: How are data values for f_31 to f_50 generated in synthetic dataset 2?
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)