Wage Statistic

From Teachwiki
Jump to: navigation, search

The Dataset[edit]

Our dataset is from the webpage http://lib.stat.cmu.edu/datasets/CPS_85_Wages. The datafile consist of 534 observations on 11 variables sampled from the Current Population Survey of 1985. Because there are too many variables and most of them are indicator, we only concentrated on 5 non-indicator variables to do the analysis.

Wag Wage (dollars per hour)
Edu Number of years of education
Exp Number of years of experience
Age Age (years)
Sex Indicator variable for sex(1=female,0=male)


In order to compare the gender effect of the data then find out whether there is a gender gap in wages, we use the variable “sex” to divide our dataset into two groups—288 observations for male and 246 observations for female.

Univariate analysis[edit]

Boxplot[edit]

Bpl.jpg


We put the same variables of male and female into one scale to make it clearer for us to compare the differences between the genders.

In the upper left is the boxplot for wage. We can find that the mean and the median of female are lower than that of male. There is one extreme outlier — a female with very high wage. The spread of male’s wage is larger than that of female.

In the upper right is the education level. The mean and the median of the both gender’s education level are almost the same. But at the lower part of the plot there are more outliers in the male’s case than in the female’s case.

The lower two plots are experience and age. The mean and the median of the female are bigger than that of male in the both 2 plots. For experience, the spread of male is less than female, but for age, they are the same.



Histodiagram[edit]

Histodiagram for female
Histodiagram for male

From the boxplot, we have a general idea about the distribution of the dataset. Now we use histodiagram to see more detailed distribution.

On the left are plots for female. In the wage graph, there is a outlier which has a abnormal high value. Most observations concentrate around 5$/h. In the education graph, most observations have around 12—year’s education.

On the right are plots for male. In the wage graph, most observations concentrate around 5$/h and do not have abnormal high wage. In the education, most observations have around 12—year’s education.

Kernel Density[edit]

densities of wage and education
densities of experience and age


To smooth and compare the distribution, we draw the kernel density of these variables and put the same variable from the male and female in one graph. To differentiate the gender, we colored the variables from male as blue, while for those of female are red.

In the wage graph, although for both gender, the most observation locate around 5$/h, more female earn this wage than male. When come to the higher wage level, there are more male than female. For education level, most observations locate around 12 years education, but more female than male observation locate in this area. Some observations of male have less than 5 years education, while this does not happed in female. From the graph, we can say female has a higher education level than male in this sample.

Multivariate analysis[edit]

Chernoff-Flury Faces[edit]

In this part we use Chernoff-Flury faces to demonstrate the relationship among the variables . The characteristics we use to represent the variables are as follows:


Wag face lines, darkness of hair(this is our main variable so we use the 2 most distinguishable characteristic to represent it)
Edu Upper and lower hair line
Exp Eyes slant
Age Eyes size

Because we have over 200 observations for each gender, there are too many pictures of the face graph. We only chose the most representative one from each gender.


Chernoff-Flury face for female
Chernoff-Flury face for male

the graph on the left is the face plot for female. We can find a fat face with very dark hair and normal hair line but small and slant eyes in the middle of the graph, which mean that the person earn a very high wage but very young with little experience and normal education level. She is really an outlier just like we find in the univariate analysis before.


The graph on the right is the face plot for male. It is easy to find that there are some fat faces with dark hair longer hair lines and middle size, not slant eyes, which mean that they have high wage and with good education, middle age with some experience. All these characteristics are fit for the image of people who earn high wage. So these people should not kick as an outlier.


PCPs[edit]

To examine the relationship of the variables we use the PCPs. To highlight the relationship between the main variable wag (X1) between the rest of the variables, we colored all the observations in X1 which larger than the median of X1 as blue lines.


PCPs for female
PCPs for male

On the left is the PCPs for female. We can see X2 (edu)has a strong positive relation with X1(wag), because there are almost no blue lines drawn in the lower part of X2. For X3(exp) and X4(age), there are some black and blue lines mixed together, but it seems that black lines are more than blue lines in the lower part. We can say may be there is positive relation between X1 and X3,X4 ,but the relation is really weak. We will check it later.


On the right is the PCPs for male. It shows that there exist a strong positive dependence between X1 and X2,X3,X4, since there are no blue lines drawn in the lower part of the three variables.


Until now all the conclusions we get are from the plots. They are based on the visual ground, not accurate. If we want to get some precise conclusions, we must use some numerical methods.

Z test[edit]

As can be seen from the previous analysis, the distribution of male data and female data are in some level very similar, but small differences always exist. It is very natural to draw such a question, is the gender a big matter in deciding the wage level? We choose the Z test to explain this question, because Z test can best show whether the difference between the population mean and the sample mean is significantly large or not.

Generally speaking, when we use the Z test, we need to know the population standard deviation of the variable, since

  \Z=\frac{\bar{X}-\mu}{SE}    (1)
  SE=\frac{\sigma}{\sqrt{n}}      (2)

Note: \bar{X} represents the sample mean, \mu is the population mean, SE is the standard error, n is the number of observations

In our case, we don’t know the exact distribution of the variable wage, but fortunately, our sample is big: male data has 288 observations while female data has 246 observations, which makes it possible for us to use the standard deviation of sample to estimate the population standard deviation. So we could get an equation as followed:

 \Z=\frac{\bar{X}-\mu}{S/{\sqrt{n}}}     (3)

Note: \bar{X} is the sample mean, \mu is the population mean, S is the sample variance, n is the observation number.

From this equation, we could get a confidence interval, that is

 \bar{X}-{\frac{S}{\sqrt{n}}}*Z_\alpha\le\mu\le\bar{X}+{\frac{S}{\sqrt{n}}}*Z_\alpha       (4)

We choose 5% as the significance level, according to Z test table, Z 5% is equal to 1.645. Calculation of the male wage data shows that, n = 288, , S = 5.2859 Calculation of the female wage data shows that, n = 246, , S=4.1024

Plug in these numbers to the equation (4), we can get the confidential interval of wage of male and female respectively:

 \ 9.4825 \le\mu_{male}\le 10.5096        (5)
 \ 7.8796 \le\mu_{female}\le 8.3092       (6)

Compare these two confidential interval, we could draw the conclusion that gender does play an important role in wage determination. The wage of female is averagely 2 dollar per hour less than the wage of male.

Regression[edit]

Univariate analysis[edit]

scatterplot for male
scatterplot for female


Two scatterplots are shown here, the first one is the scatterplot of the male dataset while the second one is the scatterplot of the female dataset, the detailed relationship between each variable could be seen clearly. Three assumptions could be drawn from it:

  1. Education plays a positive role in wage determination, since the slope between wage and education seems to be positive.
  2. Experience and age also give positive influences in wage determination, since the slope follows a positive distribution, but this relationship seems not to be as strong as the relationship between wage and education.
  3. Experience and age seem to be perfectly correlated with each other.

The matrix of correlation also can support these three assumptions:

table 3 matrix of correlation of male data
× Wage Education Experience Age
Wage 1 0.35768 0.18576 0.28353
Education 0.35768 1 -0.35741 -0.13717
Experience 0.18576 -0.35741 1 0.97415
Age 0.28353 -0.13717 0.97415 1


table 4 matrix of correlation of female data
× Wage Education Experience Age
Wage 1 0.49311 0.055818 0.1581
Education 0.49311 1 -0.35129 -0.1677
Experience 0.055818 -0.35129 1 0.9817
Age 0.1581 -0.1677 0.9817 1


For the male data, the correlation coefficient between wage and education is 0.35768, and the correlation coefficient between wage and experience is 0.18576, the correlation coefficient between wage and age is 0.28353, both these three numbers are positive, which means, these three variables: education, experience and age all play a positive influence in wage determination. But the order of these three coefficients shows that, for a male case, education plays a most important role in wage determination. Besides that, the correlation coefficient between experience and age is 0.97415, which reaches quite a high level. This phenomenon could be explained in this way, the older a person is, the more working experience he will have. Because most of education data in our sample is around 12 years, that is high school graduation, so the case like a 30-year doctor with 0 working experience is not a common case in our sample, which could just be ignored.

For the female data, similar conclusions could be drawn: all these three variables play a positive role in wage determination, and experience and age correlate with each other perfectly, but two points should be paid attention, the first one is, the correlation coefficient between wage and education is 0.44037. This is a higher number compared with the male case, 0.35768. It means that for a female, education plays a more important role in wage determination; the second point is the correlation coefficient between wage and experience is 0.0030733, which means for a female, experience seems to be not so important.

Simple Regression[edit]

Simple-regression.JPG

We use the data to do the simple regression. The three pictures above show the male data analysis, while the three pictures below show the female case. The left two pictures reflect the relationship between wage and education, the middle two pictures reflect the relationship between wage and experience, and the right two pictures reflect the relationship between wage and age.

From these six pictures, the three assumptions stated above could be proved again. All of the three variables give the positive influence in the wage determination, while for the male case, the slope of education regression is steeper than the other two variables, so we could guess, in the male case, education plays a more important role in wage determination. For the female case, the education is even steeper than the male case, while the slope of experience regression is almost flat, that means, the education plays a even more important role in wage determination, while the experience is not so important. This conclusion is consistent with the one we made before.

Multiregression[edit]

Multiregression using original data[edit]

We use the original data to make the regression. First we remove the outliers in our dataset and then we take the variable “Wage” as the dependent variable, while the other three variables are treated as independent variable.

For the male case, we have:

Contents of out

[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                164707.288     3 54902.429      84.286   0.0000"
[ 5,] "Residuals                 185644.862   285   651.385"
[ 6,] "Total Variation             8046.794   288    27.940"
[ 7,] ""
[ 8,] "Multiple R      = 4.52423"
[ 9,] "R^2             = 20.46868"
[10,] "Adjusted R^2    = -22.31351"
[11,] "Standard Error  = 25.52225"
[12,] ""
[13,] ""
[14,] "PARAMETERS         Beta         SE         StandB        t-test   P-value"
[15,] "________________________________________________________________________"
[16,] "b[ 0,]=         32.0000     476055093.2051       0.0000         0.000   1.0000"
[17,] "b[ 1,]=         -2.0000     79342515.5342      -1.0472         0.000   1.0000"
[18,] "b[ 2,]=         -1.0000     79342515.5342      -2.2957         0.000   1.0000"
[19,] "b[ 3,]=          0.0000     79342515.5342       0.0000         0.000   1.0000"

The regression equation could be written from this result:

Wage = 32.0000 - 2.0000 * edu - 1. 0000* exp + 0.0000* age

This result does not make so much sense, since the coefficient of both education and experience appears to be negative, while the coefficient of age is 0, which means, the age has no influence in deciding the wage level. The t-test as well as P-value also prove this conclusion. T-test results of four coefficients are all equal to 0, and P-values of them are 1, which means, the regression using the original data is a failure.

For the female case, we have:

Contents of out

[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1209.596     3   403.199      22.990   0.0000"
[ 5,] "Residuals                   4226.595   241    17.538"
[ 6,] "Total Variation             5436.191   244    22.279"
[ 7,] ""
[ 8,] "Multiple R      = 0.47171"
[ 9,] "R^2             = 0.22251"
[10,] "Adjusted R^2    = 0.21283"
[11,] "Standard Error  = 4.18781"
[12,] ""
[13,] ""
[14,] "PARAMETERS         Beta         SE         StandB        t-test   P-value"
[15,] "________________________________________________________________________"
[16,] "b[ 0,]=         -5.4121       6.5940       0.0000        -0.821   0.4126"
[17,] "b[ 1,]=          1.1003       1.0561       0.5663         1.042   0.2985"
[18,] "b[ 2,]=          0.1895       1.0571       0.5065         0.179   0.8578"
[19,] "b[ 3,]=         -0.1218       1.0552      -0.3097        -0.115   0.9082"

The regression equation could be written from this result:

Wage = -5.4121 + 1.1003 * edu + 0.1895 * exp – 0.1218* age

Although this equation shows the positive coefficients of education and experience, it shows a negative coefficient of age. And the t-test results are too small to reject the hypotheses that b i = 0. P-value results are also much bigger than 5% level. So the regression of female case using the original data also fails.

Why does the regression fail[edit]

Take the male data as an instance, we would like to have a further look at why does the previous regression fail. Table 5 shows some statistical characteristics of this dataset. We choose the mean, median, skewness, kurtosis, variance and square root as the reference of this dataset.

table 5 statistical characteristics of male dataset
× Mean Median Skewness Kurtosis Var Sqrt
Wag 9.9949 8.93 0.99984 3.5295 27.94 5.2859
Edu 13.014 12 -0.3512 4.0891 7.6595 2.7676
Exp 16.965 14 0.81731 2.9701 147.25 12.135
Age 35.979 34 0.60789 2.5273 130.9 11.441


From the table showed above, we have found a phenomenon that the variance and square root of these four variables are too big, that means the fluctuations of these variables are strong. This will give big disturbance in the regression process. To stabilize this fluctuation, we need to transform the dataset. We tried three types of transformation method, they are:

A. Exponential

Table 6 shows the statistical characteristics of first possible transformation male dataset. We take the exponential of each variable in the male dataset. The result shows that this is not a good transformation method, since the variance and square root in this case becomes even bigger than the original dataset.

table 6 statistical characteristics of exponential male dataset
× Mean Median Skewness Kurtosis Var Sqrt
Exp(Wag) 2.9571e+09 7555.3 9.4415 101.99 4.497e+20 2.1206e+10
Exp(Edu) 7.2864e+06 1.6275e+05 2.8001 9.4323 3.0691e+14 1.7519e+07
Exp(Exp) 3.6466e+21 1.2026e+06 14.576 224.9 2.3208e+45 4.8175e+22
Exp(Age) 8.1037e+25 5.8346e+14 9.013 84.661 4.2056e+53 6.485e+26


B. Devided by 100

The second possible transformation is to devide every variable by 100. The result is shown in Table 7. In this table, we can see that the variance and square root become much smaller than before, but it is not enough to indicate that this is a good transformation method. Compare the skewness and kurtosis in table 7 and table 5, we can find that the numbers don’t have any changes. That means the nature of this dataset doesn’t change at all, the reason why the variance becomes smaller is because every detailed observation in this dataset becomes 100 times smaller than before. The fluctuation of this dataset remains the same. So this transformation method is not good too.


table 7 statistical characteristics of male dataset deided by 100
× Mean Median Skewness Kurtosis Var Sqrt
Wag/100 0.099949 0.0893 0.99984 3.5295 0.002794 0.052859
Edu/100 0.13014 0.12 -0.3512 4.0891 0.00076595 0.027676
Exp/100 0.16965 0.14 0.81731 2.9701 0.014725 0.12135
Age/100 0.35979 0.34 0.60789 2.5273 0.01309 0.11441


C. Log-transformation

The third transformation method is to take the logarithm function of each variable. There is one point to illustrate in this method, for the variable experience, since there are some people who come into the career market as soon as they graduate from school, their experience is 0, if we take the logarithm in this case, it will make the whole dataset meaningless, so we choose the function of Log(Exp+0.1)to ensure that there will be no meaningless number. The result is shown in Table 8. In this table, we can see that not only variance and square root become much smaller than before, but also the skewness and kurtosis change. That means, the nature of the dataset has been changed and the fluctuation in this transformation method is really stabilized.


table 8 statistical characteristics of logarithm male dataset
× Mean Median Skewness Kurtosis Var Sqrt
log(Wag) 2.1653 2.1894 -0.20718 3.0469 0.28564 0.53445
log(Edu) 2.538 2.4849 -2.445 15.702 0.066591 0.25805
log(Exp+0.1) 2.4608 2.6462 -1.8977 8.4007 1.2115 1.1007
log(Age) 3.5336 3.5264 0.053342 2.2107 0.09913 0.31485

Multiregression using transformed data[edit]

Using the transformed data, we do the multiregression again.

For the male case, we have:

Contents of out

[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                    22.874     3     7.625      36.588   0.0000"
[ 5,] "Residuals                     59.391   285     0.208"
[ 6,] "Total Variation               82.264   288     0.286"
[ 7,] ""
[ 8,] "Multiple R      = 0.52731"
[ 9,] "R^2             = 0.27805"
[10,] "Adjusted R^2    = 0.27045"
[11,] "Standard Error  = 0.45650"
[12,] ""
[13,] ""
[14,] "PARAMETERS         Beta         SE         StandB        t-test   P-value"
[15,] "________________________________________________________________________"
[16,] "b[ 0,]=         -0.8383       0.5476       0.0000        -1.531   0.1269"
[17,] "b[ 1,]=          0.8214       0.1079       0.3966         7.610   0.0000"
[18,] "b[ 2,]=          0.1794       0.0500       0.3694         3.586   0.0004"
[19,] "b[ 3,]=          0.1351       0.1718       0.0796         0.786   0.4323"

The equation written from this result can be written as followed:

log(Wag) = - 0.8383 + 0.8214 *log(edu) + 0.1794 * log(exp+0.1) + 0.1351 * log(age)

All of the three independent variables’ coefficients are positive, which means all of these three variables exert a positive influence in wage determination. Since the order of coefficients is arranged as: Log(edu) > Log (exp+0.1) > Log(age), that means, education plays a most important role in determining the wage, and then is the experience, and then is the age. This is consistent with the assumptions we get at the beginning of simple regression.

For the female case, we have:

Contents of out

[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                    16.031     3     5.344      32.400   0.0000"
[ 5,] "Residuals                     39.582   240     0.165"
[ 6,] "Total Variation               55.613   243     0.229"
[ 7,] ""
[ 8,] "Multiple R      = 0.53690"
[ 9,] "R^2             = 0.28826"
[10,] "Adjusted R^2    = 0.27936"
[11,] "Standard Error  = 0.40611"
[12,] ""
[13,] ""
[14,] "PARAMETERS         Beta         SE         StandB        t-test   P-value"
[15,] "________________________________________________________________________"
[16,] "b[ 0,]=         -2.2458       0.5932       0.0000        -3.786   0.0002"
[17,] "b[ 1,]=          1.2686       0.1379       0.5197         9.202   0.0000"
[18,] "b[ 2,]=          0.0712       0.0511       0.1609         1.393   0.1650"
[19,] "b[ 3,]=          0.2105       0.1705       0.1404         1.235   0.2181"


The equation written from this result can be written as followed:

log(Wag) = - 2.2458 + 1.2686 *log(edu) + 0.0712 * log(exp+0.1) + 0.2105 * log(age)

In the female case, the three independent variables’ coefficients are all positive, which means all of these three variables exert a positive influence in wage determination. But the order of coefficients is arranged as: Log(edu) > Log (age) > Log(exp+0.1), that means, education plays a most important role in determining the wage, and then is the age, the experience seems to be not so important in deciding the wage level. This is consistent with the assumptions we get at the beginning of simple regression. This could be explained from two aspects. First of all, since the P-value of experience in experience is as high as 16.5%, which means because of some random noise, the regression we have here maybe not so accurate. Secondly, some empirical examples show that, in certain level that the older a woman is, the more attention will be put by her into the family issue. While the young women with less working experience have more and stronger motivation to do her job well. This difference sometimes make the young woman easier to earn more salary.

Conclusion[edit]

To sum up, four conclusions can be drawn from the above analysis:

  1. In general, male has a higher wage than female, while at the similar education level and experience.
  2. All the three factors we examined in the presentation have positive effect on wage.
  3. For male, the effect are education > experience > age, while for female are education > age > experience.
  4. Our work is just a pilot study, further research is needed.

References[edit]

Data source: Data and Story Library (DASL) http://lib.stat.cmu.edu/datasets/CPS_85_Wages

Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.

Härdle, W.; Hlavka, Z.; Klinke, S. (2000): XploRe Application Guide. Springer Verlag Berlin-Heidelberg.

Härdle, W.; Simar, L. (2003): Applied Multivariate Statistical Analysis. Springer-Verlag Berlin Heidelberg.

Comments[edit]

  • Is sex not an indicator variable?
  • No programs
  • Binwidth of histogram badly choosen
  • Graphics could have been a little bit larger
  • Typos
  • What do you mean with "good" experience?
  • \Z is usually used to indicate the set of integer numbers
  • What is the importance of giving the correlation with five digits after the decimal point?
  • Regression plots which variables are used?
  • Which outliers have you removed from the data?
  • Multiple regression: Which variable is which?
  • How should Dviding by 100 the explanatory variables help in the regression?