Analysis of U.S. wages

From Teachwiki
Jump to: navigation, search


There are many questions connected with wage. For instance if men earn more then women. Other questions apply with race and so on. Then can be interesting analyse the data which relate to this topic.

This topic is a subject of a great controversy. Because of it would be good to take care about our results. Do not say something generally, because normally we have only data from a special time and place.

The other problem connected with explanation. For instance if it is appeared that men earn more money then women it need not mean, that women are discriminated. It can be explained that women work in jobs, which are worse paid and so on. But then appear another question. Why are these jobs worse paid? Is not it because of there work women?

I do not want to answer these ethical questions, only find some interesting relationships. But of course when someone chooses results, it shows a little bit his opinions.

We are going to show two algorithms that use interesting pruning technique.

Description of a data[edit]

Data were measured in 1985 in USA. This fact says that we cannot use results from this analysis for today's problem. The data contain 534 instances and 11 attributes. They do not contain any missing values. They include only people who have a job. It means they are employed.


This part focuses on description of the variables, because I think that is necessary to know what the attributes mean for better understanding the results.

  • EDUCATION: Number of years of education.
  • SOUTH: Indicator variable for Southern Region (Person lives in South, Person lives elsewhere).
  • SEX: Indicator variable for sex (Female, Male).
  • EXPERIENCE: Number of years of work experience.
  • UNION Indicator variable for union membership (Union member, Not union member).
  • WAGE: Wage (dollars per hour).
  • AGE: Age (years).
  • RACE: Race (Other, Hispanic, White).
  • OCCUPATION Occupational category (Management, Sales, Clerical, Service, Professional, Other).
  • SECTOR: Sector (Other, Manufacturing, Construction).
  • MARR: Marital Status (Unmarried, Married)

Outliers problem[edit]

Boxplot of Wage

My attributes are mainly nominal, or there is only a small amount of values (experience, education etc.). In every value of these attributes are some observations. But we have to investigate outliers for the attribute wage. We see that there is one extreme outlier. We take it away. In scatter plot matrix did not appear any other outlier. Some outliers are natural e.g. for education, because some people studied only 2 or 3 years and they can work.

Histograms of metric attributes

How our data looks like[edit]

Ret colour belongs for men and blue for women. The most of the people studied 12 years (modus=12); it says that the most of the people in this data set have A level. Interesting is that wage is descendent, it means, that many people have low income. So that common employer earns less money than median or mean of this attribute. It tells us that when someone earns mean then his or her standard of living is greater than usual. We can see in our data set more young people than usually. It can have these causes.

  1. The society in U.S.A. differed from current European society. This hypothesis seems to be true, because current European society tends to die out.
  2. The other possibility can be, that data obtain only people, who work. In some professions can someone work only for a shorter time and then go into retirement.

Of course both of these reasons can participate in the solution of this problem. One can think that other possible reason can be, that among older people are more jobless, but from other statistics follows that this is not generally true.


The main part of the analysis was made with use of Weka, R and XploRe.

Akaike algorithm[edit]

We will estimate wage and we will use linear regression with Akaike information criterion for model selection. In the general case, the AIC is

AIC= 2 k - 2 \ln {L} \,

where k is the number of parameters in the regression model (how many attributes we use), and L is the likelihood function of estimated parameter  Y in general linear model  EY=G(X^\top \beta ). Increasing the number of free parameters to be estimated improves the goodness of fit. Hence AIC not only rewards goodness of fit, but also includes a penalty that is an increasing function of the number of estimated parameters.


The wage is the numerical attribute. Good way how to estimate its values is linear regression. We found all possible linear regression models. We try models with different combinations of attributes eventualy indicators of nominal attributes (this method is possible only in case, that there are not too many attributes, in other case we have to choose only some of them.) We choose that one with the lowest Akaike information.

Table of regression coefficients
Attribute Regression coefficient Mean or incidence Multiplication
Education 0.8468 13.017 11.0677
I{sex=male} 2.2094 0.5422 1.198
Experience 0.2675 17.854 4.7759
I{union=member} 1.6249 0.1801 0.2927
Age -0.1674 36.863 -6.1709
I{race=white} 0.808 0.8236 0.6655
I{occupation=professional or management} 2.4457 0.2983 0.7295
I{sector=manufacturing} 1.2601 0.1857 0.234
Constant -3.7893 - -3.7893

In the first column are attributes, which we found with Akaike algorithm, in the second regression coefficients of linear regression, in the third mean if the variable is metric, or incidence for indicators (quotient of observations, which satisfy condition of indicator, and all observations). Multiplication is given by multiple of two previous columns. This last number measures an importance of attribute. In sense that these numbers show influence on wage of average person (not common then we should to use modus!).

If we sum up the last column we receive approximately 9. It can be other estimator of average income. From the above histograms we see that modus is around 3.

Our table shows, that the attribute which influences the wage mainly, is education, but only generally (it means many people become higher income, because these people studied 12 or more years). If we want to talk only about local importance (how changes my income if I will not study only 12 but 13 years), then we have to use regression coefficients. But the problem is, that the range is different for these attributes (it is quite greater difference if I am a man or woman, in contrast to if I worked 12 or 13 years). We can solve this problem in this way: we multiply all our variables by IQR. Then we can suppose that all variables take values between 0 and 1 (IQR for indicator is in our case 1 if incidence is greater than 0.25 and lower than 0.75 and 0 otherwise, but we will suppose that is 1 in all cases).

Table of regression coefficients
Attribute Regression coefficient IQR Multiplication
Education 0.8468 3 2.5404
I{sex=male} 2.2094 1 2.2094
Experience 0.2675 18 4.815
I{union=member} 1.6249 1 1.6249
Age -0.1674 16 -2.6784
I{race=white} 0.808 1 0.808
I{occupation=professional or management} 2.4457 1 2.4457
I{sector=manufacturing} 1.2601 1 1.2601
Constant -3.7893 - -3.7893

How to explain these quotients e.g. quotient by age says, how change your income during your career. But in this case appears a problem, because we use in our model experience and this has a positive sign. From this follows that age influences wage by more attributes then only by age.

Now is the most important factor experience. It tells us, that the people who worked long time earn much more then the people on the beginning of their career. These quotients talk about how change our wage during all situations, but there stay another question. I change my wage during 3 years quite a lot (if I will study), but I need much more years of experience to change it equally.

One can be surprised that gender influence wage so much, but it can be from another reasons than only discrimination.

Goodness of our fit[edit]

Wage x predicted wage

On x-axis is wage. On y-axis is predicted wage. We can see that in many cases estimate our model quite good, because many instances lie around the line of identity. This model is not able to estimate wage for people who earn lot of money. We should mention, that on both axes are different scales.

Association rules[edit]

We choose only nominal attributes and then make some association rules (we mention only some of them).

E.g. When south=south, sex=male, union= not member, marr=married then race=white. This rule is satisfied in 83 from 89 cases. We can find a very similar rule, but without the attribute sex.

Occupation=clerical then union=not member. This rule is satisfied in 89 from 97 cases.

If we look to other association rules can we see, that whites are not often members of union.

These association rules focus very often on attribute union. It seems that whites and clerical evade being in union. For whites can it have historical context. Like a clerical work mainly women (is possible, that women are less interested in membership of union).

RIPPER algorithm[edit]

For the latest analysis we will need some algorithm, which find rules to determine classes. We pick out this algorithm, because it uses very interesting pruning technique based on description length DL.

Minimum description length (MDL)[edit]

Any set of data can be represented by a string of symbols. The basic idea of MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. Since we want to select the hypothesis that captures the most regularity in the data, we look for the hypothesis with which the best compression can be achieved. This Hypothesis should be of course true.

It means we will not write the whole code of our data, but only the rules that we found. E.g. if we have a sequence: 00010001001....0001 then we write only the rule that our sequence consist of this 0001 and then how many these strings our sequence contains.

But one problem appears, how to write our code of rules to be so short as possible. And this question is really complicated. We are not able to answer it generally.


Now we can concentrate on algorithm. We split the data into two parts pruning set contains 
\frac1{3} of the examples and growing set which contains the rest. Positive instance means that our instance belongs to some specific class of our attribute (into class for which we are looking the rule set) and negative that it does not. We want to find rules for positive instances.

Initialise two sets of rules RI,RS := \emptyset, and for each class from the less prevalent one to the more frequent one, DO:

1. Building stage:
Repeat 1.2 and 1.3 until the description length (DL) of the rule set and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate \geq 50\% .

1.2 Grow phase:
Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate).

  • The procedure tries every possible value of each attribute and selects the condition with highest information gain.

1.3. Prune phase:
Prune a rule. Chooses the deletion that maximizes the function \frac{p-n}{p+n}. Where p, n respectively is the number of examples in PrunePos, PruneNeg respectively covered by Rule. PrunePos, PruneNeg respectively are defined as a set of all positive respectively negative instances in the prunning set.

  • I. e try all possible combinations of antecedents of a rule and then use that one with the highest value of function mentioned above.

Now insert the rule into  RI and go back to 1.

2. Optimisation stage:
After generating the initial rule set RI, generate and prune two variants of each rule R_i using procedure 1.2 and 1.3. But one variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule R_i. Moreover, the pruning metric used here is \frac{TP-TN}{P+N}. Where TP (TN) is the number of examples in PrunePos (PruneNeg) covered by all Rules in RI (but without R_i and with one of its variants). And P+N are all instances in pruning set (This is the reason why the variant of the rule which we build now differs from the original one - we use another metric). Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of R_i in the rule set.

  • After all the rules in RI have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again. I.e. go back to 1. (But if we cannot add rule in 1 then go to 3.)

3. Final stage:
Delete the rules from the rule set that would increase the DL of the whole rule set if it were in it and add resultant rule set to RS.

It was interesting that, when this algorithm run in weka then in one rule appeared that we should split with one class two times and it was not necessary: it means there was (wage \leq 12.5 ) and (wage \leq 5.8) \ldots It says, that optimisation stage was not able to find more optimal solution with shorter description length. And algorithm does not control these mistakes.


Now we focus on problem, how to describe gender in our data set. In contemporary time many governments try to launch Gender-blind approach. If we find good set of rules that describe gender well then it can mean two things:

  1. There is diference between genders (e.g. women are interested in other jobs), but we have data only about U.S.A. 1985!
  2. The gender blind approach was not applied.

And now we turn to our results.
RIPPER algorithm determined female in this cases:

  • (Occupation = clerical) (97/21)
  • (Wage less then 5.4) and (Age greater then 32) (50/12)
  • (Occupation = professional) and (Education less then 17) and (Wage less then 12,05) (54/17)
  • (Occupation = service) and (Wage greater then 6.88) (33/13)
  • (Wage less then 4.75) and (Marr = married) (17/5)
  • Otherwise male (282/61)

The first number in brackets denotes all observations for which are the conditions satisfied. The second number denotes errors (how many of observations, which fulfil the antecedents, do not satisfy implication.

This algorithm classifies correctly in 404 cases from 533.


We compare our results from RIPPER with other simple algorithms. Firstly we use the easiest method (0R) for finding a rule. This method chooses the median, or the most frequent value. In our case this method chooses man. 1R classifier uses the minimum-error attribute for prediction. For men it founds these occupations: other, management, sales. For women: services and clerical.

Comparison of methods
Method Number of correctly classified instances
0R 289
1R 359

We choose these simple methods, because it shows how much improves our fit, when we use something more complicated. RIPPER improves our fit about 22% from 0R. Or we can compute another quotient of correctness of our rule set. It is number of instances correctly classified with RIPPER minus instances correctly classified with R0 it all divided by all instances minus instances correctly classified with R0. It takes 47%. In our data can someone find good criteria for distinguishing between genders.

Cluster analysis[edit]

Two clusters

Now we turn our attention to cluster analysis, because we want to see, if there are some important groups. For our cluster analysis select we only metric variables: education, experience, age and wage (because if we add nominal variable then clusters are divided according it). We use Farthest first algorithm. We can see that in ret cluster are people with small experience, but quite good income in face of experience. We can ask how look these instances in other attributes like. There are mostly young people, but they studied usually more then 12 years. They earn more then other in the same age. The conclusion goes with our intuition. Usually is better to study if we would like to get higher income (it supports linear regression model as well). But in our data appear some people who studied a lot but do not earn much money. So these cluster show successful people.

Very short conclusion[edit]

This data shows some relationships, but generally we would need more current data and more instances to make some general conclusion about this topic. What can we mention is that from our data follows that it is good to study.


[1] Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques
[2] Peter Grünwald, A tutorial introduction to the minimum description length, Amsterdam
[3] William W. Cohen (1995): Fast effective rule induction


  • English needs to be polished!
  • Data source?
  • Why does Multiplication measure the importance of an attribute?
  • Why does multiplication with the IQR ensure that all variables between [0;1]?
  • Goodness of fit: I can not see the 45 degree line