Analysis of Mortality

From Teachwiki
Jump to: navigation, search

--Draco 10:41, 29 June 2007 (CEST)

Hucase vertical2.jpg

Abstract[edit]

In this paper, we evaluate the relationship between MORTALITY rate and a large scale of factors, including pollution indices, economiccal factors, social factors and so on. First, we present some descriptive of the dataset. Then, we executed some multivariate statisic analysis, including data transformation, regression, Principal components analysis, and Cluster analysis.

All the above studies are performed using Xplore.

Key Words: Mortality, Xplore, Transformation, Regression, Principal components analysis, cluster analysis

Introduction[edit]

The mortalities of cities different cities differ dramatically, and what effects take places grasp a huge amount of attentions. In this thesis, we are interested in the relation between the mortality rates with a large scale of factors, including pollution indices, economic factors, social life conditions, and so on. The dataset we study here is from McDonald, G.C. and Schwing, R.C. (1973). It is about different factors of 60 US cities in 1960. We have 16 variables, which are shown in the following table. MORT here is the response variable, all other variables are independent variables. Data source: [1]

Table 1. The dataset
Abbreviation Detailed Explanation
X1 PREC Average annual precipitation in inches
X2 JANT Average January temperature in degrees F
X3 JULT Average July temperature in degrees F
X4 OVER65  % of 1960 SMSA population aged 65 or older
X5 POPN Average household size
X6 EDUC Median school years completed by those over 22
X7 HOUS  % of housing units which are sound & with all facilities
X8 DENS Population per sq. mile in urbanized areas, 1960
X9 NONW  % non-white population in urbanized areas, 1960
X10 WWDRK  % employed in white collar occupations
X11 POOR  % of families with income < $3000
X12 HC Relative hydrocarbon pollution potential
X13 NOX Relative nitric oxides pollution potential
X14 SO2 Relative sulphur dioxide pollution potential
X15 HUMID Annual average % relative humidity at 1pm
X16 MORT Total age-adjusted mortality rate per 100,000

Descriptive Analysis[edit]

We examine the dataset roughly at its raw status, using a range of descriptive techniques. The techniques we use can be classified into two classes: univariate (including boxplots, histograms and kernel density, and scatterplots) and multivariate (including Chernoff-Flury faces, Andrew’s curves and parallel coordinates plot). The plots are shown below.

Boxplots[edit]

We can see from the boxplot that there are relatively more outliers from the upper bounds in variables 11-15, which are low income families, three pollutions factors, and humidity level, respectively. Though there is no statistically outlier in the mortality variable, it is obviously seen that this variable spreads more widely than others. Because of the existence of the outliers, we have to use data transformation afterwards.

Histograms and Kernel Density[edit]

The plots of histograms and kernel densities confirm what we observe from the boxplots, the distributions of variables 11-15 screw to the right. Moreover, we can see that the distributions of variables 2, 8, 9 also skew to the right, while those of variables 4-6 slightly skew to the left. Afterwards, we will transform the data according to the skewness. Generally speaking, if the data is right-skewed, a log transformation will be executed; if the data is left-skewed, a exponential transformation will be used.

Scatterplots[edit]

It is irrational to put all scatterplots of the relations of the 16 variables into a graph. Hence we choose some of the variables, which have more clear relations with mortality, to show. The five variables we choose here are X1 precipitation, X2 temperature in January, X6 education level, X9 non-white population, X13 nitric oxides pollutions, and the response variable X16 mortality. We mark the data points whose mortalities are higher than mean red, otherwise black. The plots in the bottom line show the relation between the five selected independent variables and the response variable. We can see that the first variable and the ninth variable are slightly positively related to the response variable, while the sixth variable is slightly negatively related to the response variable. Therefore, it seems true from the data that mortality is higher with higher precipitation, more non-white living in the community, or lower education level. We will do more interpretation of the result, if this is true after more sophisticated statistical tests.

Chernoff-Flury Faces[edit]

In Chernoff-Flury Faces plot, we assign the response variable mortality to the darkness of hair, size and curvature of mouth, and other independent variables to other face features. We do so because hair and mouths are relatively obvious in human’s faces. We assign the response variable to them to make it easier for us to find some relation between the response variable and others. But it is difficult to find the outliers from the face plot easily. The reason may be that there are too many variables.

Andrew’s Curve[edit]


From the Andrew’s Curves, it is hardly to find any clear relation between variables, since all curves are intertwine with others. Accompanying with the Chernoff-Flury Faces, we can see that once we move to higher dimensions, it is difficult to find any clear relation of the data just in a glance.

Parallel Coordinates Plots[edit]

From the PCP, we can find that there are somehow some positive and negative relationship. For example, X4(OVER65) and X5(POPN) are negatively correlated, while X6 (EDUC) and X7(HOUS) are positively correlated. The problem still exists that most of the relationship are not clear. Also, we can also find some outliners. As a result, a transformation is necessary.

Transformation and Linear Regression[edit]

Linear regression using original data[edit]

In order to find the relationship between the independent variables and the response variable MORT, our first method is the linear regression. We first try to take all the variables together and do a linear regression. The output of XploRe is the following:


We can see that the performance of this regression is not so good. Only two out of sixteen estimated parameters have p-values lower than or around 0.001. And the worse thing is, the correlations between variables are unacceptably large. Here is the XploRe output. Of course, the result is inplausible. We think that the reason may be: the asymmetry of the data and strong Multicollinearity among some of the variables.

Dataset Transformation[edit]

We have seen from the histograms and kernel densities of the variables previously that most of the variables are asymmetry. This will harm the performance our regression model. Then we will do some transformations on the data and make the regression better. We execute the transformation mainly in the following rule: If the variable is left-skewed, then we execute logithm; if it is right-skewed, we execute exponent.

Taking the logrithm or raising the variables to the power of something smaller than one helps to reduce the asymmetry. This is due to the fact that lower values move further away from each other, whereas the distance between greater values is reduced by these transformations. Figure 8 displays histogram for the transformed variables. The transformed variables' are more aymmetric and have less outliers than the original variables.


X(1)=X(1)^{1.5}/10

X(4)=X(4)

X(5)=e^{X(5)}

X(6)=X(6)^{2}/10

X(7)=X(7)^{2.5}/1000

X(8)=log(X(8))

X(9)=log(X(9))

X(11=log(X(11)

X(12)=log(X(12))

X(13)=log(X(13))

X(14)=log(X(14))

X(15)=X(15)^{\frac{2}{3}}

X(16)=log(X(16))

We can see that the kernel densities are much better after transformation.

Backwards Regression[edit]

Since we have totally 15 independent variables, though we have done some transformations, it is too difficult to say that they are really independent with each other, i.e. multicolinearity probably exists between the independent variables. To eliminate multicolinearity, we do not simple do linear regression, we use stepwise regression. Stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-tests, but other techniques are possible, such as t-tests, Adjusted R-square, Akaike information criterion, Bayesian information criterion, Mallows' Cp, or false discovery rate. To make it simple, we use F-test as the critical benchmark here. Backward regression involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant. We use F-test as a threshold value to determine which variable to be eliminated from the regression model. The XploRe output is the following:

Forwards Regression[edit]

Forward regression involves starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'. This is stopped if the added variable is not significant. The step is similar, but allows variables that are already in the model to leave, if they are no longer significant after adding the next variable. The XploRe output of forward regression is:

We can see that the results of backward regression and forward regression are the same. The regression model we get tells that X1 precipitation, X9 non-white population, X13 nitric oxides pollutions are positively related to mortality, while X2 temperature in January, and X6 education level are negatively related to mortality. The regression result is:  X16 = 6.8059 + 0.0035\times X1 - 0.0024\times X2 - 0.0079\times X6 + 0.0372\times X9 + 0.222\times X13

Principal Components Analysis[edit]

Now we are going to use another popular MVA method, PC Analysis. The main objective of principal component analysis (PC) is to reduce the dimension of the observations, and find the most informative projections that maximize variaces. Figure 10 present the ability of the PCs to explain the variation.

It is easy to identify that the first few PCs accumulately explained a marjority of the total variances. From the table below (Table 2), we find that the first five PCs have already explained more thann 80% of the total variance. As a result, we will only target on the analysis of the first five PCs afterwards.


Table 2 Proportion of variance of PC's



Now, we move further to analze the correlation between the first 15 variables and the first five PCs.


Table 3 Correlation between the original variables and the PCs

First, let's focus on the correlationship between PCs and the response variable X16(Mortality). It is obvious that X16 is strongly correlated with PC1 and PC2, and is only little correlated with other PCs. As a result, we can change to find the correlation between PC1 PC2 and the other 15 variables, e.g the first two coloums.. We find that X1, X3, and X11 are more or less have the similar correlations with PC1 and PC2, X2 is independent of all the other variables, X6, X7, X4 and X10 have similar correlations with PC1 and PC2, X9, X5, and X16 have similar correlations with PC1 and PC2, and X13, X12, X8, and X14 are similar. Please take a special look of the red circles. These variables in the red circles are exactly the variables we finally used in the regression before. And now, these variables are "Seeds" in their own PC group. From Figure 13, we may have a better view of the groups and "seeds".

Cluster Analysis[edit]

Cluster Analysis is a set of tools for building groups (clusters) from multivariate data objects. The aim is to construct groups with homogeneous properties out of heterogeneous large samples. Here we implement the Ward algorithm with the transformed data. The results is in Figure 14.

Summary[edit]

As we've presented before, the regression method and the PC method give us more or less the same results. From the method of regression, we use backwards/stepwise regression in order to avoid multicoll and finally find that mortality can be regressed by five variables, X1(PREC), X2(JANT), X6(EDUC), X9(NONW) and X13(NOX). From the method of PC, we find that the 15 variables can be divided into 5 subgroups. (X1, X3, X11), (X2), (X6, X4, X7, X10,X15), (X9, X5), (X13, X8, X12, X14), with X1,X2,X6,X9,X13 the "Seeds" in each groups. From the mothod of cluster analysis, we can still divide the observations into five clusters, but not so evident. I would like to use the regression equation we find as our final conclusion, that is:


 X16 = 6.8059 + 0.0035\times X1 - 0.0024\times X2 - 0.0079\times X6 + 0.0372\times X9 + 0.222\times X13


We found that based on the normal rate (the constant 6.8059), more precipitation, higher density of non-white people, and higher level of nitric oxides pollution come with higher mortality. Higher tempreture in winter and higher educational level comes with lower mortality. It is rational. Higher pollution is worse for people's health, and more precipitation just helps the pollutants to cumulate, rather than be gone with the wind. Higher density of non-white people tends to trigger more conflicts and crimes. These factors enhance mortality rate. Higher temperature in January can help people the get through the ruthless winter and enjoy a warm new year. People with higher educational level not only know how to take care of themselves, but also tend to help each others. These factors mitigate mortality rates.

We should notice that some of the variables here are transformed. We take log to non-white population, and nitric oxides, since the growing trends of these variables seem to have "economies of scale". That is, if the density of non-white population increase two times, the mortality may increase more than two times, since there may be more crimes or conflicts in the communities. The pollution level of nitric oxides also, the harm they make are growing more quikly than the growth of their magnitudes. We take exponent to precipitation level and educational level, since these factors have less effect growth as the growth of themselves, i.e., they have "diseconomies of scale". For example, the mortality of a community of people with 16 years education (all university graduates) may have lower mortality rate to the cummunity whose educational level is 8 years (middle school students), but not as much as the half. Precipitation also. These two factors themselves need to change more to have the same proportion of effect change on mortality than other factors, so we take exponent to help them.

There are many factors affect mortality, everyone can raise a tons of ingredients and examples. With the support of substantial statistic evidence, we argue that mortality is positively related to precipitations, density of non-white population, nitric oxides, and is negatively related to the temperature in January and educational level.

Reference[edit]

[Härdle, Klinke, Müller 2000] Härdle, W.; Klinke, S.; Müller, M.: Xplore Learning Guide. Springer Verlag Berlin-Heidelberg, 2000

[Härdle, Simar 2003] Härdle, W.; Simar, L.: Applied Multivariate Statistical Analysis. Springer Verlag Berlin-Heidelberg, 2003

[Hädle, Hlavka, Klinke 2000] Hädle, W.; Hlavka, Z.; Klinke, S.: XploRe Application Guide. Springer Verlag Berlin-Heidelberg, 2000

XploRe Help, http://www.xplore-stat.de/help/_Xpl_Start.html

Data source: [2]

Comments[edit]

  • Where does the data come from? Your hint leads to a description of the data, but not the data itself:.
  • Your variable list could include a classification of the variables, e.g. "pollution", etc.
  • Not all programs
  • Why do you choose these five variables; why not four or seven?
  • Plots: Variable names?
  • Where is the result of the correlation
  • How do you come to the specific transformations?
  • I would have liked to see AIC as decision criteria in backward regression!
  • What is the R^2 for your reduced model?
  • Typos
  • PCA: I would have choosen only 3 components
  • What is the reason of the cluster analysis? What do you learn?
  • If you use modified variables then it should show in your notation