Wage Statistic
Inhaltsverzeichnis
The Dataset
Our dataset is from the webpage http://lib.stat.cmu.edu/datasets/CPS_85_Wages. The datafile consist of 534 observations on 11 variables sampled from the Current Population Survey of 1985. Because there are too many variables and most of them are indicator, we only concentrated on 5 nonindicator variables to do the analysis.
Wag  Wage (dollars per hour) 

Edu  Number of years of education 
Exp  Number of years of experience 
Age  Age (years) 
Sex  Indicator variable for sex(1=female,0=male) 
In order to compare the gender effect of the data then find out whether there is a gender gap in wages, we use the variable “sex” to divide our dataset into two groups—288 observations for male and 246 observations for female.
Univariate analysis
Boxplot
We put the same variables of male and female into one scale to make it clearer for us to compare the differences between the genders.
In the upper left is the boxplot for wage. We can find that the mean and the median of female are lower than that of male. There is one extreme outlier — a female with very high wage. The spread of male’s wage is larger than that of female.
In the upper right is the education level. The mean and the median of the both gender’s education level are almost the same. But at the lower part of the plot there are more outliers in the male’s case than in the female’s case.
The lower two plots are experience and age. The mean and the median of the female are bigger than that of male in the both 2 plots. For experience, the spread of male is less than female, but for age, they are the same.
Histodiagram
From the boxplot, we have a general idea about the distribution of the dataset. Now we use histodiagram to see more detailed distribution.
On the left are plots for female. In the wage graph, there is a outlier which has a abnormal high value. Most observations concentrate around 5$/h. In the education graph, most observations have around 12—year’s education.
On the right are plots for male. In the wage graph, most observations concentrate around 5$/h and do not have abnormal high wage. In the education, most observations have around 12—year’s education.
Kernel Density
To smooth and compare the distribution, we draw the kernel density of these variables and put the same variable from the male and female in one graph. To differentiate the gender, we colored the variables from male as blue, while for those of female are red.
In the wage graph, although for both gender, the most observation locate around 5$/h, more female earn this wage than male. When come to the higher wage level, there are more male than female. For education level, most observations locate around 12 years education, but more female than male observation locate in this area. Some observations of male have less than 5 years education, while this does not happed in female. From the graph, we can say female has a higher education level than male in this sample.
Multivariate analysis
ChernoffFlury Faces
In this part we use ChernoffFlury faces to demonstrate the relationship among the variables . The characteristics we use to represent the variables are as follows:
Wag  face lines, darkness of hair(this is our main variable so we use the 2 most distinguishable characteristic to represent it) 

Edu  Upper and lower hair line 
Exp  Eyes slant 
Age  Eyes size 
Because we have over 200 observations for each gender, there are too many pictures of the face graph. We only chose the most representative one from each gender.
the graph on the left is the face plot for female. We can find a fat face with very dark hair and normal hair line but small and slant eyes in the middle of the graph, which mean that the person earn a very high wage but very young with little experience and normal education level. She is really an outlier just like we find in the univariate analysis before.
The graph on the right is the face plot for male. It is easy to find that there are some fat faces with dark hair longer hair lines and middle size, not slant eyes, which mean that they have high wage and with good education, middle age with some experience. All these characteristics are fit for the image of people who earn high wage. So these people should not kick as an outlier.
PCPs
To examine the relationship of the variables we use the PCPs. To highlight the relationship between the main variable wag (X1) between the rest of the variables, we colored all the observations in X1 which larger than the median of X1 as blue lines.
On the left is the PCPs for female. We can see X2 (edu)has a strong positive relation with X1(wag), because there are almost no blue lines drawn in the lower part of X2. For X3(exp) and X4(age), there are some black and blue lines mixed together, but it seems that black lines are more than blue lines in the lower part. We can say may be there is positive relation between X1 and X3,X4 ,but the relation is really weak. We will check it later.
On the right is the PCPs for male. It shows that there exist a strong positive dependence between X1 and X2,X3,X4, since there are no blue lines drawn in the lower part of the three variables.
Until now all the conclusions we get are from the plots. They are based on the visual ground, not accurate. If we want to get some precise conclusions, we must use some numerical methods.
Z test
As can be seen from the previous analysis, the distribution of male data and female data are in some level very similar, but small differences always exist. It is very natural to draw such a question, is the gender a big matter in deciding the wage level? We choose the Z test to explain this question, because Z test can best show whether the difference between the population mean and the sample mean is significantly large or not.
Generally speaking, when we use the Z test, we need to know the population standard deviation of the variable, since
(1)
(2)
Note: represents the sample mean, is the population mean, SE is the standard error, n is the number of observations
In our case, we don’t know the exact distribution of the variable wage, but fortunately, our sample is big: male data has 288 observations while female data has 246 observations, which makes it possible for us to use the standard deviation of sample to estimate the population standard deviation. So we could get an equation as followed:
(3)
Note: is the sample mean, is the population mean, S is the sample variance, n is the observation number.
From this equation, we could get a confidence interval, that is
(4)
We choose 5% as the significance level, according to Z test table, Z _{5%} is equal to 1.645. Calculation of the male wage data shows that, n = 288, , S = 5.2859 Calculation of the female wage data shows that, n = 246, , S=4.1024
Plug in these numbers to the equation (4), we can get the confidential interval of wage of male and female respectively:
(5)
(6)
Compare these two confidential interval, we could draw the conclusion that gender does play an important role in wage determination. The wage of female is averagely 2 dollar per hour less than the wage of male.
Regression
Univariate analysis
Two scatterplots are shown here, the first one is the scatterplot of the male dataset while the second one is the scatterplot of the female dataset, the detailed relationship between each variable could be seen clearly. Three assumptions could be drawn from it:
 Education plays a positive role in wage determination, since the slope between wage and education seems to be positive.
 Experience and age also give positive influences in wage determination, since the slope follows a positive distribution, but this relationship seems not to be as strong as the relationship between wage and education.
 Experience and age seem to be perfectly correlated with each other.
The matrix of correlation also can support these three assumptions:
×  Wage  Education  Experience  Age 

Wage  1  0.35768  0.18576  0.28353 
Education  0.35768  1  0.35741  0.13717 
Experience  0.18576  0.35741  1  0.97415 
Age  0.28353  0.13717  0.97415  1 
×  Wage  Education  Experience  Age 

Wage  1  0.49311  0.055818  0.1581 
Education  0.49311  1  0.35129  0.1677 
Experience  0.055818  0.35129  1  0.9817 
Age  0.1581  0.1677  0.9817  1 
For the male data, the correlation coefficient between wage and education is 0.35768, and the correlation coefficient between wage and experience is 0.18576, the correlation coefficient between wage and age is 0.28353, both these three numbers are positive, which means, these three variables: education, experience and age all play a positive influence in wage determination. But the order of these three coefficients shows that, for a male case, education plays a most important role in wage determination. Besides that, the correlation coefficient between experience and age is 0.97415, which reaches quite a high level. This phenomenon could be explained in this way, the older a person is, the more working experience he will have. Because most of education data in our sample is around 12 years, that is high school graduation, so the case like a 30year doctor with 0 working experience is not a common case in our sample, which could just be ignored.
For the female data, similar conclusions could be drawn: all these three variables play a positive role in wage determination, and experience and age correlate with each other perfectly, but two points should be paid attention, the first one is, the correlation coefficient between wage and education is 0.44037. This is a higher number compared with the male case, 0.35768. It means that for a female, education plays a more important role in wage determination; the second point is the correlation coefficient between wage and experience is 0.0030733, which means for a female, experience seems to be not so important.
Simple Regression
We use the data to do the simple regression. The three pictures above show the male data analysis, while the three pictures below show the female case. The left two pictures reflect the relationship between wage and education, the middle two pictures reflect the relationship between wage and experience, and the right two pictures reflect the relationship between wage and age.
From these six pictures, the three assumptions stated above could be proved again. All of the three variables give the positive influence in the wage determination, while for the male case, the slope of education regression is steeper than the other two variables, so we could guess, in the male case, education plays a more important role in wage determination. For the female case, the education is even steeper than the male case, while the slope of experience regression is almost flat, that means, the education plays a even more important role in wage determination, while the experience is not so important. This conclusion is consistent with the one we made before.
Multiregression
Multiregression using original data
We use the original data to make the regression. First we remove the outliers in our dataset and then we take the variable “Wage” as the dependent variable, while the other three variables are treated as independent variable.
For the male case, we have:
Contents of out

The regression equation could be written from this result:
Wage = 32.0000  2.0000 * edu  1. 0000* exp + 0.0000* age
This result does not make so much sense, since the coefficient of both education and experience appears to be negative, while the coefficient of age is 0, which means, the age has no influence in deciding the wage level. The ttest as well as Pvalue also prove this conclusion. Ttest results of four coefficients are all equal to 0, and Pvalues of them are 1, which means, the regression using the original data is a failure.
For the female case, we have:
Contents of out

The regression equation could be written from this result:
Wage = 5.4121 + 1.1003 * edu + 0.1895 * exp – 0.1218* age
Although this equation shows the positive coefficients of education and experience, it shows a negative coefficient of age. And the ttest results are too small to reject the hypotheses that b _{i} = 0. Pvalue results are also much bigger than 5% level. So the regression of female case using the original data also fails.
Why does the regression fail
Take the male data as an instance, we would like to have a further look at why does the previous regression fail. Table 5 shows some statistical characteristics of this dataset. We choose the mean, median, skewness, kurtosis, variance and square root as the reference of this dataset.
×  Mean  Median  Skewness  Kurtosis  Var  Sqrt 

Wag  9.9949  8.93  0.99984  3.5295  27.94  5.2859 
Edu  13.014  12  0.3512  4.0891  7.6595  2.7676 
Exp  16.965  14  0.81731  2.9701  147.25  12.135 
Age  35.979  34  0.60789  2.5273  130.9  11.441 
From the table showed above, we have found a phenomenon that the variance and square root of these four variables are too big, that means the fluctuations of these variables are strong. This will give big disturbance in the regression process. To stabilize this fluctuation, we need to transform the dataset. We tried three types of transformation method, they are:
A. Exponential
Table 6 shows the statistical characteristics of first possible transformation male dataset. We take the exponential of each variable in the male dataset. The result shows that this is not a good transformation method, since the variance and square root in this case becomes even bigger than the original dataset.
×  Mean  Median  Skewness  Kurtosis  Var  Sqrt 

Exp(Wag)  2.9571e+09  7555.3  9.4415  101.99  4.497e+20  2.1206e+10 
Exp(Edu)  7.2864e+06  1.6275e+05  2.8001  9.4323  3.0691e+14  1.7519e+07 
Exp(Exp)  3.6466e+21  1.2026e+06  14.576  224.9  2.3208e+45  4.8175e+22 
Exp(Age)  8.1037e+25  5.8346e+14  9.013  84.661  4.2056e+53  6.485e+26 
B. Devided by 100
The second possible transformation is to devide every variable by 100. The result is shown in Table 7. In this table, we can see that the variance and square root become much smaller than before, but it is not enough to indicate that this is a good transformation method. Compare the skewness and kurtosis in table 7 and table 5, we can find that the numbers don’t have any changes. That means the nature of this dataset doesn’t change at all, the reason why the variance becomes smaller is because every detailed observation in this dataset becomes 100 times smaller than before. The fluctuation of this dataset remains the same. So this transformation method is not good too.
×  Mean  Median  Skewness  Kurtosis  Var  Sqrt 

Wag/100  0.099949  0.0893  0.99984  3.5295  0.002794  0.052859 
Edu/100  0.13014  0.12  0.3512  4.0891  0.00076595  0.027676 
Exp/100  0.16965  0.14  0.81731  2.9701  0.014725  0.12135 
Age/100  0.35979  0.34  0.60789  2.5273  0.01309  0.11441 
C. Logtransformation
The third transformation method is to take the logarithm function of each variable. There is one point to illustrate in this method, for the variable experience, since there are some people who come into the career market as soon as they graduate from school, their experience is 0, if we take the logarithm in this case, it will make the whole dataset meaningless, so we choose the function of Log(Exp+0.1）to ensure that there will be no meaningless number. The result is shown in Table 8. In this table, we can see that not only variance and square root become much smaller than before, but also the skewness and kurtosis change. That means, the nature of the dataset has been changed and the fluctuation in this transformation method is really stabilized.
×  Mean  Median  Skewness  Kurtosis  Var  Sqrt 

log(Wag)  2.1653  2.1894  0.20718  3.0469  0.28564  0.53445 
log(Edu)  2.538  2.4849  2.445  15.702  0.066591  0.25805 
log(Exp+0.1)  2.4608  2.6462  1.8977  8.4007  1.2115  1.1007 
log(Age)  3.5336  3.5264  0.053342  2.2107  0.09913  0.31485 
Multiregression using transformed data
Using the transformed data, we do the multiregression again.
For the male case, we have:
Contents of out

The equation written from this result can be written as followed:
log(Wag) =  0.8383 + 0.8214 *log(edu) + 0.1794 * log(exp+0.1) + 0.1351 * log(age)
All of the three independent variables’ coefficients are positive, which means all of these three variables exert a positive influence in wage determination. Since the order of coefficients is arranged as: Log(edu) > Log (exp+0.1) > Log(age), that means, education plays a most important role in determining the wage, and then is the experience, and then is the age. This is consistent with the assumptions we get at the beginning of simple regression.
For the female case, we have:
Contents of out

The equation written from this result can be written as followed:
log(Wag) =  2.2458 + 1.2686 *log(edu) + 0.0712 * log(exp+0.1) + 0.2105 * log(age)
In the female case, the three independent variables’ coefficients are all positive, which means all of these three variables exert a positive influence in wage determination. But the order of coefficients is arranged as: Log(edu) > Log (age) > Log(exp+0.1), that means, education plays a most important role in determining the wage, and then is the age, the experience seems to be not so important in deciding the wage level. This is consistent with the assumptions we get at the beginning of simple regression. This could be explained from two aspects. First of all, since the Pvalue of experience in experience is as high as 16.5%, which means because of some random noise, the regression we have here maybe not so accurate. Secondly, some empirical examples show that, in certain level that the older a woman is, the more attention will be put by her into the family issue. While the young women with less working experience have more and stronger motivation to do her job well. This difference sometimes make the young woman easier to earn more salary.
Conclusion
To sum up, four conclusions can be drawn from the above analysis:
 In general, male has a higher wage than female, while at the similar education level and experience.
 All the three factors we examined in the presentation have positive effect on wage.
 For male, the effect are education > experience > age, while for female are education > age > experience.
 Our work is just a pilot study, further research is needed.
References
Data source: Data and Story Library (DASL) http://lib.stat.cmu.edu/datasets/CPS_85_Wages
Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. SpringerVerlag Berlin Heidelberg.
Härdle, W.; Hlavka, Z.; Klinke, S. (2000): XploRe Application Guide. Springer Verlag BerlinHeidelberg.
Härdle, W.; Simar, L. (2003): Applied Multivariate Statistical Analysis. SpringerVerlag Berlin Heidelberg.
Comments
 Is sex not an indicator variable?
 No programs
 Binwidth of histogram badly choosen
 Graphics could have been a little bit larger
 Typos
 What do you mean with "good" experience?
 is usually used to indicate the set of integer numbers
 What is the importance of giving the correlation with five digits after the decimal point?
 Regression plots which variables are used?
 Which outliers have you removed from the data?
 Multiple regression: Which variable is which?
 How should Dviding by 100 the explanatory variables help in the regression?