
 Time Period: September 2006  present.
 Participants: Sumit Chopra, Yann LeCun (Courant Institute/CBLL),
Trivikraman Thampy, John Leahy, Andrew Caplin (Economics Dept, NYU).
 Description: We are developing a new type of relational
graphical models that can be applied to "structured regression
problem". A prime example of structured regression problem is the
prediction of house prices. The price of a house depends not only on
the characteristics of the house, but also of the prices of similar
houses in the neighborhood, or perhaps on hidden features of the
neighborhood that influence them. Our relational regression model
infers a hidden "desirability sruface" from which house prices
are predicted.
130. 
Sumit Chopra, Trivikraman Thampy, John Leahy, Andrew Caplin and Yann LeCun: Discovering the hidden structure of house prices with nonparametric latent manifold model, Proc. Knowledge Discovery in Databases (KDD'07), 2007, \cite{choprakdd07}.

422KB  DjVu 
1182KB  PDF 
17382KB  PS.GZ 
A majority of supervised learning algorithms process an input point
independently of other points, based on the assumptions that
the input data is sampled independently and identically from a fixed
underlying distribution. However in a number of realworld problems
the value of variables associated with each sample not only depends on
features specific to the sample, but also on the features and variables of
other samples related to it. We say the data possesses a
relational structure. Prices of real estate properties is one such example.
Price of a house is a function of its individual features,
such as the number of bedrooms, etc. In addition the price is also
influenced by features of the neighborhood in which it lies.
Some of these features are measurable, such as
the quality of the local schools. However most of them that
make a particular neighborhood desirable are very difficult to measure
directly, and are merely reflected in the price of houses in that
neighborhood. Hence the "desirability" of a location/neighborhood can
be modeled as a latent variable, that
must be estimated as part of the learning process, and efficiently
inferred for unseen samples. A number of authors have recently proposed architectures and learning
algorithms that make use of relational structure. The earlier
techniques were based on the idea of influence
propagation [1, 3, 6, 5].
Probabilistic Relational Models (PRMs)
were introduced in [4, 2] as an extension of Bayesian
networks to relational data. Their discriminative extensions,
called Relational Markov Networks (RMN) were later
proposed in [7].
This paper introduces a general framework for prediction in relational
data. An architecture is presented that allows efficient inference algorithms
for continuous variables with relational dependencies.
The class of models introduced is novel in several ways:
1. it pertains to relational regression problems in which the
answer variable is continuous;
2. it allows intersample
dependencies through hidden variables as well as through the
answer variables;
3. it allows loglikelihood functions that are nonlinear in the parameters
(non exponential family), which leads to nonconvex loss functions
but are considerably more flexible;
4. it eliminates the intractable partition function problem
through appropriate design of the relational and nonrelational
factors. The idea behind a relational factor graph is to have a single factor
graph that models the entire collection of data samples and their
dependencies.
The relationships between samples is captured by the factors that
connect the variables associated with multiple samples. We are
given a set of N training samples, each of which is described
by a samplespecific feature vector X^{i} and an answer to be
predicted Y^{i}. Let the collection of input variables be denoted
by X = {X^{i}, i = 1 … N}, the output variables by
Y = { Y^{i}, i = 1 … N}, and the latent variables by Z.
The EBRFG is defined by an energy function of the form
E(W,Z, Y,X) = E(W,Z, Y^{1}, …, Y^{N}, X^{1}, …, X^{N}),
in which W is the set of parameters to be estimated by learning.
Given a test sample feature vector X^{0}, the model is used to predict
the value of the corresponding answer variable Y^{0}.
One way to do this is by minimizing the following energy function
augmented with the test sample
(X^{0},Y^{0})
Y^{0*} = argmin_{Y0} {  
E(W, Z, Y^{0}, …, Y^{N}, X^{0}, …, X^{N})}.
(1) 
For it to be usable on new test samples without requiring excessive work,
the energy function must be carefully constructed is such a way that the
addition of a new sample in the arguments will not require retraining
the entire system, or reestimating some highdimensional hidden
variables. Moreover, the parameterization must be designed in such a
way that its estimation on the training sample will actually result in
good prediction on test samples. Training an EBRFG can be
performed by minimizing the negative log
conditional probability of the answer variables with respect to the
parameter W. We propose an efficient training and inference
algorithm for the general model. The architecture of the factor graph that was used for predicting
the prices of real estate properties is shown in Figure 1 (top).
The price of a house is modeled as a product of two quantities:
1. its "intrinsic" price which is dependent only on its individual features,
and 2. the desirability of its location.
A pair of factors E_{xyz}^{i} and
E_{zz}^{i} are associated with every house. E_{xyz}^{i} is nonrelational
and captures the sample specific dependencies. It is modeled as a
parametric function with learnable parameters W_{xyz}.
The parameters W_{xyz} are shared across all
the instances of E_{xyz}^{i}. The factor E_{zz}^{i} is relational and
captures the dependencies between the samples via the
"hidden" variables Z^{i}.
These dependencies influence the answer
for a sample through the intermediary hidden variable d^{i}. The variables
Z^{i} can be interpreted as the desirability of the location of the ith
house, and d^{i} can be viewed as the estimated desirability of the house
using the desirabilities of the houses related to it (those that lie in its
vicinity). This factor is modeled as a nonparametric function. In
particular we use a locally weighted linear regression, with weights
given by a gaussian kernel.
The model is trained by maximizing the likelihood of the training
data, which is realized by minimizing the negative log likelihood function
with respect to W and Z. However we show that
this minimization reduces to
L(W,Z)
=     (Y^{i} − (G(W_{xyz},X^{i}) +
H(Z_{ Ni},X^{i})))^{2} + R(Z),
(2) 
where R(Z) is a regularizer on Z (an L_{2} regularizer
in the experiments).
This is achieved by applying a type of deterministic generalized EM algorithm.
It consists of iterating through two phases. Phase 1 involves
keeping the parameters W fixed and minimizing the loss with respect to Z.
The loss is quadratic in Z and we show that its minimization
reduced to solving a large scale sparse quadratic system.
We used conjugate gradient method using an
adaptive threshold to minimize it.
Phase 2 involves fixing the hidden variables Z
and minimizing with respect to W. Since the parameters W are shared
among the factors, this can be done using gradient descent.
Inference on a new sample X^{o} involves
computing its neighboring training samples, and using the learnt values
of their hidden variables Z_{ No} to get an estimate of its
"desirability" d^{o};
passing the house specific features X_{h}^{o} through the learnt
parametric model to get its "intrinsic" price; and
combining the two to get its predicted price. The model was trained and tested on a very challenging and
diverse real world dataset
that included 42,025 sale transactions of houses in
Los Angeles county in the year 2004.
Each house was described using a set of 18 house specific
attributes like gps coordinates, living area,
year build, number of bedrooms, etc.
In addition, for each house, a number of neighborhood specific
attributes obtained from census tract data and the school district data were
also used. It included attributes like
average house hold income of that area, percentage of owner
occupied homes etc.
The performance of the proposed model
was compared with a number of standard nonrelational techniques that
that have been used in literature for this problem, namely
nearest neighbor, locally weighted regression, linear regression,
and fully connected neural network.
EBRFG gives the best prediction accuracy by far, compared to other
models. In addition we also plot the
"desirabilities" learnt by our model (Figure 1 (bottom)).
The plot shows that the model is actually able
to learn the "desirabilities" of various areas in a way that is
reflective of the real world situation.
References
[1]

S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using hyperlinks.
In. Proc. of ACM SIGMOD98, pages 307 – 318, 1998.
 [2]

N. Friedman, L. Getoor, D. Koller, and A. Pfeffer.
Learning probabilistic relational models.
In Proc. IJCAI99, pages 1300 – 1309, 1999.
 [3]

J. M. Klienberg.
Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46(5):604 – 632, 1999.
 [4]

D. Koller and A. Pfeffer.
Probabilistic framebased systems.
In Proc. AAAI98, pages 580 – 587, 1998.
 [5]

J. Neville and D. Jensen.
Iterative classification in relational data.
In Proc. AAAI00 Workshop on Learning Statistical Models From
Relational Data, pages 13 – 20, 2000.
 [6]

S. Slattery and T. Mitchell.
Discovering test set regularities in relational domain.
In. Proc. ICML00, pages 895 – 902, 2000.
 [7]

B. Taskar, P. Abbeel, and D. Koller.
Discriminative probabilistic models for relational data.
Eighteenth Conference on Uncertainty on Machine Intelligence
(UAI02), 2002.
Table 1: Prediction accuracies of various algorithms on the test set.
Absolute Relative Forecasting Error (fe) was measured. The error (fe_{i})
on the i^{th} sample is defined as
fe_{i} = A_{i} − Pr_{i}/A_{i},
where A_{i} is the actual selling price and Pr_{i} is the predicted price. Three
performance quantities on the test set are reported; percentage of houses
with a forecasting error of less than 5%, with less than 10% and with less
15%. 
Model Class  Model  < 5%  <10%  < 15% 
nonparametric  Nearest Neighbor  25.41  47.44  64.72 
nonparametric  Locally Weighted Regression  32.98  58.46  75.21 
parametric  Linear Regression  26.58  48.11  65.12 
parametric  Fully Connected Neural Network  33.79  60.55  76.47 
hybrid  Relational Factor Graph  39.47  65.76  81.04 
0.1in
Figure 1: (Top) A typical Energy Based Relational Factor
Graph showing the connections between three samples. The factors
E_{xyz}^{i} capture the dependencies between the features of
individual samples and their answer variable Y^{i}, as well as the
dependence on local latent variables d^{i}.
The factors E_{zz}^{i} captures the dependencies between the
hidden variables of multiple samples. The connection to these two
factors may exist only from a subset of samples that are related to
sample i. When the energy of factor E_{zz} is quadratic in d.
(Bottom) The color coded values of the desirability surface at
the location of the test samples. For every test sample, the
estimate of its desirability is computed and is color coded according
to its value. Blue color implies low desirability and red color implies high
desirability. 
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.

