Sliced Inverse Regression

From Teachwiki
Jump to: navigation, search

Introduction[edit]

Regression is a popular way of studying the relationship between a response variable  \,y and its explanatory variable \underline{x}, which is a \,p-dimensional vector. There are several approaches which come under the term of regression. We know parametric methods, such as multiple linear regression, but also non-parametric techniques, such as local smoothing. If we have high dimensional data, the number of observations needed to use local smoothing methods escalates exponentially. Therefore we need a tool for dimension reduction, which reveals us the most important directions of the data, on which it can be projected without loosing, in the best case, any information. Sliced Inverse Regression (SIR) is such a tool for dimension reduction. SIR uses the inverse regression curve, E(\underline{x}\,|\,y), which falls into the effective dimension reducing space under certain conditions, to perform a weighted principal component analysis, with which one identifies the effective dimension reducing directions. The talk will first introduce the reader to the aspect of dimension reduction and how it is performed in our model, then give a short review on inverse regression, to afterwards bring these pieces together. After having seen how to estimate the EDR-directions, it is closed with an example, implementing the techniques learned.

Model for Dimension Reduction[edit]

First of all, we have to set up a model on which the theoretical properties of SIR are investigated under.

Model[edit]

Given a response variable \,Y and a (random) vector X \in \R^p of explanatory variables, SIR is based on the model


Y=f(\beta_1^\top X,\ldots,\beta_k^\top X,\varepsilon)\quad\quad\quad\quad\quad(1)

where \beta_1,\ldots,\beta_k are unknown projection vectors. \,k is an unknown number (the dimensionality of the space we try to reduce our data to) and, of course, as we want to reduce dimension, smaller than \,p. \;f is an unknown function on \R^{k+1}, as it only depends on \,k arguments, and \varepsilon is the error with E[\varepsilon|X]=0 and finite variance  \sigma^2 . The model describes an ideal solution, where \,Y depends on X \in \R^p only through a \,k dimensional subspace. I.e. one can reduce to dimension of the explanatory variable from \,p to a smaller number \,k without loosing any information.

An equivalent version of \,(1) is: the conditional distribution of  \,Y given \, X depends on \, X only through the \,k dimensional variable (\beta_1,\ldots,\beta_k). This perfectly reduced variable can be seen as informative as the original  \,X in explaining \, Y .

The unknown \,\beta_i's are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted the effective dimension reducing space (EDR-space).

Some Basic Linear Algebra[edit]

To be able to visualize the model in our mind's eye, note a short review on vecor spaces:

For the definition of a vector space and some further properties I will refer to the article [Linear Algebra and Gram-Schmidt Orthogonalization] or any textbook in linear algebra and mention only the most important facts for understanding the model.

As the EDR-space is a \,kdimensional subspace, we need to know what a subspace is. A subspace of \R^n is defined as a subset U \in \R^n, if it holds that


\underline{a},\underline{b} \in U \Rightarrow \underline{a}+\underline{b} \in U

\underline{a} \in U, \lambda \in \R \Rightarrow \lambda \underline{a} \in U


Given \underline{a}_1,\ldots,\underline{a}_r \in \R^n, then V:=L(\underline{a}_1,\ldots,\underline{a}_r), the set of all linear combinations of these vectors, is called a linear subspace and is therefore a vector space. One says, the vectors \underline{a}_1,\ldots,\underline{a}_r span \,V. But the vectors that span a space \,V are not unique. This leads us to the concept of a basis and the dimension of a vector space:

A set B=\{\underline{b}_1,\ldots,\underline{b}_r\} of linear independent vectors of a vector space \,V is called basis of \,V, if it holds that

V:=L(\underline{b}_1,\ldots,\underline{b}_r)

The dimension of \,V (\in \R^n) is equal to the maximum number of linearly independent vectors in \,V. A set of \,n linear independent vectors of \R^n set up a basis of \R^n. The dimension of a vector space is unique, as the basis itself is not. Several bases can span the same space. Of course also dependent vectors span a space, but the linear combinations of the latter can give only rise to the set of vectors lying on a straight line. As we are searcing for a \,kdimensional subspace, we are interested in finding \,k linearly independent vectors that span the \,kdimensional subspace we want to project our data on.

Curse of Dimensionality[edit]

The reason why we want to reduce the dimension of the data is due to the curse of dimensionality and of course, for graphical purposes. The curse of dimensionality is due to rapid increase in volume adding more dimensinos to a (mathematical) space. For example, consider 100 observations from support [0,1], which cover the intervall quite well, and compare it to 100 observations from the corresponding 10 dimensional unit hypersquare, which are isolated points in a vast empty space. It is easy to draw inferences about the underlying properties of the data in the first case, whereas in the latter, it is not. For more information about the curse of dimensionality, see http://en.wikipedia.org/wiki/Curse_of_dimensionality.

Inverse Regression[edit]

Computing the inverse regression curve (IR) means instead of looking for

  • \,E[Y|X=x], which is a curve in \R^p

we calculate

  • \,E[X|Y=y], which is also a curve in \R^p, but consisting of \,p one dimensional regressions.

The center of the inverse regression curve is located at \,E[E[X|Y]]=E[X]. Therefore, the centered inverse regression curve is

  • \,E[X|Y=y]-E[X]

which is a \,p dimensional curve in \R^p. In what follows we will consider this centered inverse regression curve and we will see that it lies on a \,kdimensional subspace spanned by \,\Sigma_{xx}\beta_i\,'s.

But before seeing that this holds true, we will have a look at how the inverse regression curve is computed within the SIR-Algorithm, which will be introduced in detail later. What comes is the "sliced" part of SIR. We estimate the inverse regression curve by dividing the range of \,Y into \,H nonoverlapping intervalls (slices), to afterwards compute the sample means \,\hat{m}_h of each slice. These sample means are used as a crude estimate of the IR-curve, denoted as \,m(y). There are several ways to define the slices, either in a way that in each slice are equally much observations, or we define a fixed range for each slice, so that we then get different proportions of the \,y_i\,'s that fall into each slice.

Inverse Regression vs. Dimension Reduction[edit]

As mentioned a second before, the centered inverse regression curve lies on a \,kdimensional subspace spanned by \,\Sigma_{xx}\beta_i\,'s (and therefore also the crude estimate we compute). This is the connection between our Model and Inverse Regression. We shall see that this is true, with only one condition on the design distribution that must hold. This condition is, that:

\forall\,\underline{b} \in\R^p:\,E[b^\top X|\beta_1^\top X=\beta_1^\top x,\ldots,\beta_k^\top X=\beta_k^\top x)
=c_0+\sum_{i=1}^{k} c_i\beta_i^\top x

I.e. the conditional expectation is linear in \beta_1 X,\ldots,\beta_k X, that is, for some constants c_0,\ldots,c_K. This condition is satisfied when the distribution of \,X is elliptically symmetric (e.g. the normal distribution). This seems to be a pretty strong requirement. It could help, for example, to closer examine the distribution of the data, so that outliers can be removed or clusters can be separated before analysis

Given this condition and \,(1), it is indeed true that the centered inverse regression curve \,E[X|Y=y]-E[X] is contained in the linear subspace spanned by \,\Sigma_{xx}\beta_k(k=1,\ldots,K), where \,\Sigma_{xx}=Cov(X). The proof is provided by Duan and Li in Journal of the American Statistical Association (1991).

Estimation of the EDR-directions[edit]

After having had a look at all the theoretical properties, our aim is now to estimate the EDR-directions. For that purpose, we conduct a (weighted) principal component analysis for the sample means \,\hat{m}_h\,'s, after having standardized \,X to \,Z=\Sigma_{xx}^{-1/2}\{X-E(X)\}. Corresponding to the theorem above, the IR-curve \,m_1(y)=E[Z|Y=y] lies in the space spanned by \,(\eta_1,\ldots,\eta_k), where \,\eta_i=\Sigma^{1/2}_{xx} \beta_i. (Due to the terminology introduced before, the \,\eta_i\,'s are called the standardized effective dimension reducing directions.) As a consequence, the covariance matrix \,cov[E[Z|Y]] is degenerate in any direction orthogonal to the \,\eta_i\,'s. Therefore, the eigenvectors \,\eta_k (k=1,\ldots,K) associated with the \,K largest eigenvalues are the standardized EDR-directions.

Back to PCA. That is, we calculate the estimate for \,Cov\{m_1(y)\}:

\hat{V}=n^{-1}\sum_{i=1}^S n_s \bar{z}_s \bar{z}_s^\top

and identify the eigenvalues \hat{\lambda}_i and the eigenvectors \hat{\eta}_i of \hat{V}, which are the standardized EDR-directions. (For more details about that see next section: Algorithm.) Remember that the main idea of PC transformation is to find the most informative projections that maximize variance!


Note that in some situations SIR does not find the EDR-directions. One can overcome this difficulty by considering the conditional covariance \,Cov(X|Y). The principle remains the same as before, but one investigates the IR-curve with the conditional covariance instead of the conditional expectation. For further details and an example where SIR fails, see Applied Multivariate Statistical Analysis (Härdle and Simar 2003).

Algorithm[edit]

The algorithm to estimate the EDR-directions via SIR is as follows. It is taken from the textbook Applied Multivariate Statistical Analysis (Härdle and Simar 2003)


1. Let \,\Sigma_{xx} be the covariance matrix of \,X. Standardize \,X to


\,Z=\Sigma_{xx}^{-1/2}\{X-E(X)\}


(We can therefore rewrite \,(1) as

Y=f(\eta_1^\top Z,\ldots,\eta_k^\top Z,\varepsilon)

where \,\eta_k=\beta_k\Sigma_{xx}^{1/2}\quad\forall\; k For the standardized variable Z it holds that \,E[Z]=0 and \,Cov(Z)=I.)


2. Divide the range of \,y_i into \,S nonoverlapping slices \,H_s(s=1,\ldots,S).\; n_s is the number of observations within each slice and \,I_{H_s} the indicator function for this slice: n_s=\sum_{i=1}^n I_{H_s}(y_i)


3. Compute the mean of \,z_i over all slices, which is a crude estimate \,\hat{m}_1 of the inverse regression curve \,m_1:

\,\bar{z}_s=n_s^{-1}\sum_{i=1}^n z_i I_{H_s}(y_i)


4. Calculate the estimate for \,Cov\{m_1(y)\}:

\,\hat{V}=n^{-1}\sum_{i=1}^S n_s \bar{z}_s \bar{z}_s^\top


5. Identify the eigenvalues \,\hat{\lambda}_i and the eigenvectors \,\hat{\eta}_i of \,\hat{V}, which are the standardized EDR-directions.


6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by:


\,\hat{\beta}_i=\hat{\Sigma}_{xx}^{-1/2}\hat{\eta}_i


(which are not necessarily orthogonal)

Example[edit]

SIR will be used on the Boston Housing data set, which was collected by Harrison and Rubinfeld (1978). They comprise 506 observations for each census district of the Boston metropolitan area. For purpose of illustration, only variabels \,X_6,X_{13} and the response variable \,X_{14} are considered, where

  • \,X_6, the average number of rooms per dwelling
  • \,X_{13}, the percentage of lower status people of the population
  • \,X_{14}, median value of homes

SIR is then used to find the 2-dimensional EDR-directions. The EDR-directions for our example are

\,\hat\beta_1=(-0.93476;0.35528)^\top and \,\hat\beta_2=(0.99727;0.073889)^\top


Piconline.jpg

The figure shows in the upper right a three-dimensional plot of the variables. The left plots show the response versus the estimated EDR-directions. The lower right shows the eigenvalues \,\hat{\lambda}_i, denoted with crosses, and the cumulative sum, denoted by circles. If the upper right plot would be interactive, as it is in XploRe, you would see a spiral in the data. This structure is well found by SIR and is intimated by the left plots.

For further examples, see again Applied Multivariate Statistical Analysis (Härdle and Simar 2003). (More examples will enlarge this talk at a later date)

References[edit]

  • Sliced Inverse Regression for Dimension Reduction, Li , Journal of the American Statistical Association (1991)
  • Applied Multivariate Statistical Analysis, Härdle and Simar, Springer Verlag (2003)
  • Kurzfassung zur Vorlesung Mathematik II im Sommersemester 2005, A. Brandt
  • http://en.wikipedia.org/wiki/Curse\_of\_dimensionality

Kommentar[edit]

  • Die Arbeit ist sehr mathematisch im ersten Teil
  • Die weigthed Principal components hätten im Algorithmus nochmal deutlich angezeigt werden können
  • Eine andere Projektion mit besser sichtbarer Spirale wäre auch gut gewesen