Independent Component Analysis

From Teachwiki
Jump to: navigation, search

Motivation[edit]

Figure 1: The cocktail party problem. The original signals s(t) are recorded as x(t) and then estimated using ICA.

The goal of the independent component analysis (ICA) is to find a linear representation of nongaussian data, for example to capture the essential structure of the data. A famous example is the cocktail party problem, in which n signals are recorded with n microphones. The objective is to reconstruct the original signals s(t) = s_1(t)...s_n(t) from the recorded signals x(t) = x_1(t)...x_n(t) given a mixing matrix A:  
x(t) = A * s(t).
For illustration, consider the signals in the figure. The original wavesignals on top are recorded with the microphones and result in the signals on the bottom. The problem is to recover the original signal only using the recorded signals. In fact, given the mixing matrix A the problem would be solvable with classical methods. However, without the mixing matrix the problem becomes more difficult. The method of independent component analysis is one technique to estimate the mixing matrix in order to seperate the original source signals.

Method[edit]

Definition[edit]

The ICA Model can be formulated with the n-dimensional observations x(t), the independent components (IC) s(t) and the nonsingular transformation matrix W = A^{-1} as  
s(t) = W * x(t).
This statistical model is called independent component analysis and describes how the observed data are generated by a process of mixing the components  s_i . In this setup the independent components are assumed to be non-observably as well as the mixing matrix is unknown. The only known component is the random vector x(t) in order to estimate both A and s(t).

Properties[edit]

Figure 2: Examples of non-gaussian and gaussian distributions of two variables x1 and x2.

From the model, several properties can be derived: First, the variances of the ICs are not identifiable since both s(t) and A are unknown. Any scalar multiplier in one of the sources s_j could always be cancelled by dividing the corresponding column a of A by the same scalar. Therefore we can fix the magnitudes of the ICs: as they are random variables the most natural way is to assume unit variance  E[s_i^2]=1. However this leaves still the ambiguity of the sign: fortunately this is insignificant in most applications. Secondly, the order of ICs is not determinable, since a permutation matrix P can be substitued in the model:  x(t) = A P^{-1}Ps(t). Thirdly, gaussian variables are forbidden since any orthogonal transformation of the gaussian distribution has exactly the same distribution with the variables being independent.

For illustration consider the distributions displayed in the figure. The non-gaussian distribution of two variables x1 and x2 contains information on the directions of the columns of the mixing matrix A: the colums of the mixing matrix could for example be located along the edges of the joint density of the variables. However, the gaussian distribution of two variables x1 and x2 on the bottom does not contain any information on the directions of the colums of the mixing matrix.

Maximizing nongaussianity[edit]

Assume, that maximizing the nongaussianity leads to the independent components (ICs). Then the landscape of nongaussianity could be searched in the n-dimensional space of the recorded signal. This space would have 2n local maxima, two for each IC. Measure for nongaussianity could be Kurtosis or negentropy. Kurtosis is calculated as the fourth-order cumulant  kurt(y) = E[y^4] - 3(E[y^2])^2 . However, Huber (1985) showed that Kurtosis is very sensitive to outliers and therefore not a robust measure of nongaussianity.

The entropy is largest for gaussian variables among all random variables of equal variance and could therefore be used as a measure for nongaussianity. Entropy H of a random vector y with density f(y) is calculated as


H(y) = -\int f(y) log f(y) dy.

Negentropy is a slightly modified version of differential entropy. The negentropy of a random vector y and a gaussian random variable y_{gauss} of the same covariance matrix as y is  J(y) = H(y_{gauss})-H(y). However, in order to calculate the negentropy would require an estimate of the density f(y). Hyvarinen (1998b) proposed an approximation of negentropy with


J(y) \approx \sum_{i=1}^{p} k_i (E[G_i(y)] - E[G_i(v)] )^2,

where k_i are some positive constants, and v is a gaussian variable of zero mean and unit variance, and the functions G_i are some nonquadratic functions, e.g.  G_1(u) = 1/a_1 log cosh a_1 u or  G_2(u) = - exp(-u^2/2) , where  1 \le a_1 \le 2 is some suitable constant.

The justification for maximization of nongaussianity follows from the principle of mutual information. The mutual information between m random variables y_i is


I(y_1, y_2, ... , y_m) = \sum_{i=1}^m H(y_i) - H(y).

The mutual information is equivalent to the Kullback-Leibler divergence between the joint densitiy f(y) and the product of its marginal densities  I(y_1,y_2,...,y_m) = D_{KL}(P(y) || P(y_1)P(y_2)...P(y_n)), which is always nonnegatvie, and zero if and only if the variables are statically independent. Negentropy is then fundamentally related to mutual information by

 
I(y_1,y_2,...,y_m) = C - \sum_{i=1}^m J(y_i).

Preprocessing for ICA[edit]

Figure 3: Preprocessing for ICA. The joint distribution of the original signals s1 and s2 is recorded leading to the joint distibution of the observed mixtures x1 and x2. With whitening the joint distribution of the whitened mixtures x1' and x2' is a clearly rotated version of the original distribution.

Before applying ICA, in practical applications it is very useful to apply preprocessing to the data, like centering, whitening or band-pass filtering of the recorded signals. Centering subtracts the mean vector in order to make the recorded signal x(t) a zero-mean variable. Whitening is performed so that the components of x(t) are uncorrelated and their variances equal unity. This can be done e.g. with an eigen-value decomposition


x' = ED^{-1/2}E^T x

with the diagonal matrix D of its eigen-values  D^{-1/2}  = diag(d_1^{-1/2},...d_n^{-1/2}) and the orthogonal matrix of eigenvectors E of x. Further preprocessing could e.g. include band-pass filtering. The advantage can be seen in the figure: the joint distribution on the left of signal s1 and s2 is recorded and observed as the joint distribution of the recorded signals x1 and x2 in the center. With whitening the joint distribution of the x'1 and x'2 is then a square defining a distribution, which is now clearly a rotated version of the original square.

The FastICA Algorithm[edit]

In the FastICA algorithm (Hyvarinen, 2000), the approximation of the negentropy of w^T x are maximized at certain optima of  E[G(w^Tx)] with  W = (w^1,...,w^n)^T denoting  A^{-1} . According to the Kuhn-Tucker condition (Luenberger, 1969) the optima under the constraint  E[(w^Tx)^2] = ||w||^2 = 1 are at points where


E[xg(w^T x)^2] - bw = 0

with g the derivative of G. The constraint follows since the variance of  w^Tx is constrained to unity and for whitened data this is equivalent to constraining the norm of w to be unity. Given this optimaliy condition, the FastICA algorithm follows as:

  1. Choose an initial (e.g. random) weight vector w_j
  2. Let  w_j^+ = E[xg(w_j^Tx)] - E[g'(w_j^Tx)]w_j : the update follows from the Newton iteration to achieve optimality and follows a direction such that the projection  w_j^Tx maximizes optimality.
  3. Orthogonalization of w_j^+=w_j+-\sum_{k\ne j}(w_j^{+T}w_k)w_k: in oder to prevent different vectors from converging to the same maxima, the outputs  w_j^Tx must be decorrelated after every iteration.
  4. Normalization:  w_j = \frac{w_j}{||w||}
  5. If not converged, i.e. ||w_j^+ - w_j|| \ne 0 , go back to 2
  6. Set j = j +1. For  j \le d , go back to step 1

Study results[edit]

Weighted ICs[edit]

Figure 4: 5 minute closing prices of the Allianz share between April 2000 and August 2007 on the top. In the middle the normed price difference is shown on the left, and cumulated on the right. The bottom shows the weighted ICs for the normed price difference on the left, and cumulated on the right.

The data set consists of 5 minute closing prices p(t) for 28 stocks in the Deutsche Aktien Index (DAX). The missing stocks are TUI and Hypo Real Estate due to data inavailability. With a time frame of nearly 7 years between April 2001 and August 2006, the data set comprises 140.000 data points per stock. For illustration of the method, the Allianz share was chosen. The price of the Allianz share in the period is plotted in the figure on top. For further analysis, the price difference was calculated as x(t) = p(t) - p(t-1) and then the mean vector subtracted from the time series. The result is shown for the Allianz share in the second row of the figure on the left. On the right of the second row, the cumulated normed difference is plotted. Note that the cumulated normed difference does not(!) equal the original time series, because the mean of all price differences was subtracted from all price differences. However, since the mean of all price differences is subtracted, with  \bar{x} = \sum_{i=1}^n x_i = 0 follows instantly that  \sum_{i=1}^n x_i = 0 and therefore the transformed time series cumulate to zero. In fact, the transformed time series still resembles some of the properties of the original time series like a large drawdown in the case of the Allianz share price in 2002.

After calculation of the independent components, they are weighted with  a_{ij} for stock j and IC i. On the bottom of the figure, the 28 independent components weighted for Allianz are shown on the left. The cumulated sum of all independent component is shown on the bottom right, which again is equal to the transformed time series of the Allianz closing prices. This is intuitive because no information about the price movement is lost in the ICA algorithm, however is now distributed between the ICs.

Noise and thresholded ICs[edit]

Figure 5: Study results. In the left column, the cumulated weighted ICs are used to reconstruct the Allianz share price. In the right column the summed weighted ICs for the Allianz share price are thresholded to reconstruct the signal.

With the information of market movement now distributed to the independent components some interesting applications arise. Firstly, considering only the most-dominant ICs for reconstruction reveals some of the driving forces of market movement filtering noise plotted in the left column of the figure. The top shows the cumulated normed price difference of the Allianz share. The center shows the reconstruction of time series only using the 4 most dominant ICs. That is, the reconstruction weights the ICs with the the mixing vector for the Allianz share normed price difference and then only considers those ICs with the 4 highest absolute values. These ICs are then summed at each point in time and cumulated to the shown plot. Interestingly, these ICs already show the main properties of the time series. On the bottom, the reconstruction from the remaining ICs is shown with the same procedure, which contribute mainly noise to the original signal.

Another analysis, shown in the right column of the figure, considers the summed weighted ICs for the Allianz share price illustrated on the top. In this case, the ICs are thresholded at .3, which means that only signals above the threshold of .3 are considered for further analysis as shown in the center of the figure. The bottom finally shows the cumulated signal of the threshold ICs compared to the originally transformed cumulated time series. Interestingly, the few, but strong signals from the ICA already capture some of the original signal, like the large drawdown in 2002.

Conclusion[edit]

This study introduced the ICA algorithm following Hyvarinen (2000) and applied ICA to decompose the returns of 28 stocks into statistically independent components. The estimated ICs fall mainly into two categories:

  1. ICs responsible for the major changes in the stock prices
  2. ICs contributing only little to the overall level of the stocks

The analysis focusing on modeling the stock price of the Allianz share, showed that only the 4 most dominant ICs capture already the main characteristics of the original time series. Introducing a threshold to the IC analysis, also showed that the strongest signals seem to model the main price movements. The remaining ICs contribute mainly noise to the time series.

References[edit]

  • A.D. Back, A.S. Weigend, A first application of independent component analysis to extracting structure from stock returns, Int. J. on Neural Systems 8(4):473–484 (1997)
  • T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley (1991)
  • P. Huber, Projection Pursuit, The Annals of Statistics 13(2):435-475 (1985)
  • A. Hyvärinen, E. Oja, Independent Component Analysis: Algorithms and Applications, Neural Networks, 13(4-5):411-430 (2000)
  • A. Hyvärinen, New Approximations of differential entropy for independent component analysis and projection pursuit, In Advances in Neural Information Processing Systems 10:273-279, MIT Press (1998)
  • D. Luenberger, Optimization by Vector Space Methods, Wiley (1969)