An Analysis of the U.S. System of Tertiary Education

From Teachwiki
Jump to: navigation, search

Introduction[edit]

The German system of tertiary education is still mainly state-run and often characterized as incrusted and little competitive. But as fewer public funds can be provided to finance universities the education market changes. Private institutions are more and more founded and existing universities have to rethink their concepts to meet the demand of future students. The U.S. system of tertiary education is often cited as very competitive and is mainly privately financed. It may therefore serve as a projection for the German situation.

In the following paper we use the College Data Set to compare the two main U.S. school types offering tertiary education concerning their main characteristics. The aim of the analysis will be to detect patterns for a “successful” college or university. Once uncovered, these patterns will help to give advice how to better manage German universities.

Table 1. Relative proportions of public and private expenditure on tertiary education, as a percentage (2003)
Public sources Private sources
Germany 87.1 12.9
United States 42.8 57.2
OECD average 76.4 23.6

Description of the Data Set[edit]

The College Data Set is provided by the Data and Story Library (DASL). It contains information on the top 25 liberal arts colleges and the top 25 research universities of the United States. On these 50 cases information is stored in 8 variables. There are no missing values in this data set. The first variable is a text variable, it contains the name of each school. The second variable is a binary variable coding the school type as text. The other 6 variables are metric variables. The following table provides an overview on the content of all 8 variables.

Table 2. Variables
No. Variable Name Type Content
1. [School] Text Contains the name of each school
2. [School_Type] Text Coded 'LibArts' for liberal arts and 'Univ' for university
3. [SAT] Metric Median combined Math and Verbal SAT score of students
4. [Acceptance] Metric  % of applicants accepted
5. [$/Student] Metric Money spent per student in dollars
6. [Top_10%] Metric  % of students in the top 10% of their high school graduating class
7. [PhD%] Metric  % of faculty at the institution that have PhD degrees
8. [Grad%] Metric  % of students at institution who eventually graduate


Explanations: In the U.S. both liberal arts colleges and universities offer tertiary education. Liberal arts colleges enroll fewer students than universities. These students are expected not to focus on one academic discipline but to study a wide variety of courses. At liberal arts colleges, research is less important compared to universities. SAT is a standardized test used by U.S. colleges and universities to assess their applicants. It was called Scholastic Aptitude Test and Scholastic Assessment Test before, but is declared not to be initialism since 1994.

General Considerations[edit]

Having learned the content of the College Data Set in the last section, we can now reflect on how to use the variables to analyze the “success” of colleges and universities. In our study we will focus on a school’s function as provider of education. In this sense, the percentage of students who eventually graduate will serve as crude measure of success, i.e. Grad% will be considered as the dependent variable. The graduation rate is seen as influenced by the teaching conditions offered by the school as well as ambition and qualification of the students. If more money is spent on better technical equipment in class rooms, computer labs etc. and more qualified faculty is hired teaching conditions will be improved. Here we assume that the percentage of faculty at the institution having PhD degrees is an indicator of the qualification as a teacher. More so, one expects students that did very good in high school or in a generalized knowledge test to cope better with the diverse challenges at a university or college than other students. We also suppose that the qualification of the applicants increases if the acceptance rate is lower. Therefore we will consider [Top_10%], [SAT], [Acceptance], [$/Student] and [PhD%] as explanatory variables.

Univariate Data Analysis[edit]

In this section the variables are studied in detail in order of their appearance in the data file. In the variable [School] the abbreviated names of the colleges and universities are stored. Table 3. lists all colleges and universities in alphabetical order. The variable [School_Type] indicates that the first 25 observations of the data set are liberal arts colleges and the following 25 universities. A first distinct comparison of the two groups will be drawn in this section.

Before the metric variables will be analyzed in the following, some general remarks:

  • In all graphs and plots colleges of liberal arts are colored blue and universities are marked in red.
  • Ideas of statistical tests, e.g. on the difference of parameters of the two groups, where dismissed. The observations do not represent a good sample of the population, since they are the top 25 of universities and liberal arts colleges. Therefore our data set will be treated as a population and not as a sample.
  • In some cases outliers are replaced in order to make possible conclusions more robust. Because of the relatively small size of the data set (only 25 observations per group) outliers where replaced by the mean of the remaining variables of the respective group. In the case of [PhD%] it was necessary to repeat this process two times, because new outliers appeared after replacing the first ones.


Figure 1. Boxplot and Histogram for [SAT]

Analyzing the third variable [SAT], one finds that the median (1280) and mean (1271) SAT score of universities are slightly higher than their counterparts in the group of colleges (1255 and 1257, respectively). The distribution for universities is skewed to the left where as the colleges’ is almost symmetric. The university students scored in wider range on their SAT tests, which shows that the group of students accepted at a college is more homogeneous in the case of SAT scores. This is also represented by the standard errors being 43.674 and 76.845, respectively.



Figure 2. Boxplot and Histogram for [Acceptance]

A rather large difference can be found in the percentage of applicants accepted stored in the variable [Acceptance]. While at the median college of liberal arts 38% of all applicants are accepted, this rate is only 31% for the average university. This might indicate a more thorough selection process of their students at universities. Both are skewed to the right, visible through the difference in mean and median. The variability is nearly identical, indicated by standard errors of 12.517 for colleges and 13.875 for universities. This variable bears potential to mislead, since it does not give any information about the number of applicants and the number of applicants accepted. For example, one could suspect that a very renown university such as Harvard has many more applicants than Fresno State University, and although it is much larger in the number of students, Harvard can only accept a lower percentage of applicants or vice versa.

Figure 3. Boxplot and Histogram for [$/Student]

The boxplot for the fifth variable indicates an outlier in the group of the universities. At the California Institute of Technology 102,260 US $ are spent per student while the median university spends 37,137 US $ per student. The median liberal arts college spends 20,377 US $. Not only is the median spending per student at a college only roughly more than half of the amount spent at a university, the range is also much smaller (10,359 compared to 82,897(42,556 without the outlier)). Both distributions are somewhat skewed to the right. Looking at the Boxplot and Histogram without the outlier, the distribution for universities is much more symmetric.


Figure 4. Boxplot and Histogram for [Top_10%]

From the variable [Top_10%] we can learn, that at the median university 85% of the students were in the top 10% of their high school graduating class, while this number yields only 68% for the liberal arts colleges. Two outliers can be identified with the help of boxplots. At the Carnegie Mellon University in Pittsburgh and the University of Rochester only 52% of the students were in the top 10% of their high school graduating class. The two outliers drastically influence skewness of universities, which is directed more leftward compared to the distribution with replaced outliers.



Figure 5. Boxplot and Histogram for [PhD%]

Next, we find a slightly higher percentage of faculties at universities having a PhD degree (96%) than at the median college (91%). Here, the two outliers among the universities are the Johns Hopkins University of Washington, D.C. with 58% and the Northwestern University with 79%. Both distributions are skewed to the left. The picture changes drastically when the outliers are replaced. The group of universities is much more homogeneous and the range has decreased to only 9%, although it is still skewed to the left. Furthermore, a relationship between the education of the teaching staff and the money spent is expected.


A summary of the results so far suggests that the access to universities is harder because of lower acceptance rates and higher qualification of the students accepted. Moreover, we find that more money is invested per student at universities and that the teaching staff is higher educated. The financial indicator has to be treated with caution, because universities spend more money on research than liberal arts colleges and that does not directly yield student benefits, but could explain a higher ratio of teaching staff with a PhD degree.


Figure 6. Boxplot and Histogram for [Grad%]

The most interesting point of this analysis lies in similar percentages of graduates, obtained in both groups of schools of higher learning. According to the last variable [Grad%], 86% of all students at the median university eventually graduate and 85% of the students of the median liberal arts college graduate. Considering the mean the colleges even have a slight advantage of 84% compared to 83%. Again, the variability in the group of universities is much higher and stronger skewed to the left. One reason for the higher variability may be athletes training at universities, but turning early to a professional career and thus never graduating with a degree. This variable also inherits some inaccuracies as universities and colleges are prone to enhance their graduate percentage, because it is widely considered a measure of their quality and excellence.


Since we defined [Grad%] as variable that should be maximized ideally, the following table provides both for colleges and universities the three highest cases and the three lowest cases.

Table 4. Colleges and universities: highest and lowest cases
Rank in their group Colleges of Liberal Arts Universities
[School] [Grad%] [School] [Grad%]
1. Williams 93 Dartmouth 95
2. Amherst 93 Yale 93
3. Middlebury 92 Princeton 93
23. Claremont McKenna 74 U of Rochester 73
24. Grinnell 73 Berkeley 68
25. Occidental 72 UCLA 61

Multivariate Data Analysis[edit]

In this paragraph relationships between variables in the College Data Set are studied. The central part is to determine how the other variables can explain [Grad%], which we defined as measure of success of colleges and universities. To optimize the regression model, we will first analyze outliers more in detail and have a closer look at the relation between [Grad%] and single explanatory variables to gain a better understanding of the data set.

Outlying values[edit]

Helpful devices to visualize outliers and bilateral relationships between all variables in the data set are the following scatterplot matrices. In the left and the right figure the original data set is plotted separately for universities and colleges. Outliers only appear in the group of universities, they are marked in Figure 1. The outlier in [%/Student], California Institute of Technology, is marked as a black x. The outlying values of [Top_10%] (Carnegie Mellon University and University of Rochester) are visible through blue triangles. Green stars represent the Johns Hopkins University and the Northwestern University, outliers in [PhD%]. Comparing Figure 7. and Figure 9. one gets the impression that the university data is less spread than the data for colleges. But this appearance is deceptive since outliers influence the scale of the plots heavily. The figure in the middle shows the university data corrected for outliers.

Figure 7. Scatterplot matrix for the group of universities with marked outliers
Figure 8. Scatterplot matrix for the group of universities with replaced outliers
Figure 9. Scatterplot matrix for the group of colleges (the group did not contain any outliers)
Table 5. Universities: outliers
[School] [SAT] [Acceptance] [$/Student] [Top 10%] [%PhD] [Grad%]
Average without outliers 1,271 35 36,092 84 95 83
Cal Tech 129 -4 6,6170 14 3 -8
U of Rochester -116 21 2,505 -32 1 -10
Carnegie Mellon -46 29 -2,485 -32 -11 -6
Northwestern -41 12 -7,241 -7 -16 -1
John Hopkins 19 13 9,368 -15 -37 3

In total, five outliers in three variables were identified with the help of the boxplots shown above. The table to the right compares the observations, where outliers were identified through the boxplots, to the rest of the universities. In the first row the mean for every variable without outliers is listed. The next rows contain each variable's deviation from the "corrected" mean. Deviations relating to outliers are printed bold. As we already saw from the boxplots, only Cal Tech is an outlier to the upper bound, the other four cases are negative deviations. It is interesting to see that also Cal Tech spends extremely more money than the average university and reaches the highest [SAT] and [Top 10%] values of all colleges and universities, only 75% of its students eventually graduate. We guess that a lot of the money is spent on technological research equipment that does not necessarily improve the learning conditions of students and therefore the graduation rate. Nevertheless, the school is attractive to highly gifted students. A look on the downward outliers in [Top 10%] U of Rochester and Carnegie Mellon does not surprise. Both have also low [SAT] values and graduation rates of only 73% and 77%. Analyzing Northwestern and John Hopkins reveals that although both deviate strongly downwards from the mean in the percantage of the faculty staff having a Ph.D. degree, they reach average values in [Grad%]. That might indicate that other qualifications of the faculty staff are relevant for good teaching performance.

We also see that all five cases are outliers only in one dimension of the observation. Excluding the observations would reduce the group of universities by one fifth. Therefore, we decided to replace each outlying value by the mean value of the remaining observations in the group of the universities, e.g. for Cal tech the value for [$/student] of 102,262 is replaced by 36,092. As explained above, this procedure had to repeated two times for [PhD%] since always new outliers appeared. In total, four values were replaced in [PhD%]. Comparing the scatterplot matrix for the "corrected" university data in Figure 8. to Figure 9. one can detect a positive relationship between [SAT] and [Grad%] as well as a negative relationship between [Acceptance] and [Grad%] both for colleges and universities.

Bivariate Data Analysis[edit]

As a preparation for the multiple regression we will now study the impact of every single variable on [Grad%] via a simple regression. Therefore, we plotted [Grad%] against each of the other metric variable again separately for the original university data and the college data. We also included linear regression lines in the graphs. Obviously, all regression lines do not fit the data very well. For universities, this is especially true for the last three variables that are very much spread above and below the line. In the case of [$/Student] and [PhD%] the regression line seems to be influenced heavily by outliers. Therefore, we made similar plots for the data set with replaced outliers. As there were only replacements in the last three variables, changes are only observed there. The largest change affects the relationship between [Grad] and [Top_10%], where a change in sign is observed.

Figure 10. Simple regressions on [Grad%] for universities with outliers
Figure 11. Simple regressions on [Grad%] for universities with replaced outliers
Figure 12. Simple regressions on [Grad%] for liberal arts colleges

To improve the approximation of the data by a regression, we applied several transformations to the regressors. The following two tables provide an overview on how the transformation improved the goodness of fit of the regression measured by the adjusted R-squared. For the variables [SAT] and [$/Student] the exponential transformation was not applied since their values are too large.

Table 6. Universities: regression of [Grad%] on x, adjusted R^2
Variable Outliers replaced x ln(x) \sqrt{x} e^x x^2
[SAT] no 0.39968 0.41472 0.40736 . 0.38340
[Acceptance] no 0.36804 0.41463 0.39557 -0.02306 0.30002
[$/Student] no -0.03011 -0.00149 -0.01602 . -0.04342
[Top_10%] no -0.01904 -0.00598 -0.01263 0.03057 -0.03027
[PhD%] no -0.04331 -0.04347 -0.04342 -0.01223 -0.04298
[$/Student] yes 0.10518 0.08824 0.09796 . 0.11032
[Top_10%] yes -0.03977 -0.04187 -0.04093 0.03057 -0.03669
[PhD%] yes -0.04311 -0.04315 -0.04313 -0.01181 -0.04307
Table 7. Colleges: regression of [Grad%] on x, adjusted R^2
Variable x ln(x) \sqrt{x} e^x x^2
[SAT] 0.14380 0.14339 0.14360 . 0.14417
[Acceptance] 0.27732 0.29738 0.28850 0.10754 0.25122
[$/Student] 0.02956 0.02122 0.02547 . 0.03712
[Top_10%] 0.12856 0.10544 0.11704 0.11956 0.15080
[PhD%] -0.02770 -0.03068 -0.02923 0.06005 -0.02441

The left table shows the data for universities originally and corrected for outliers. For each variable the optimal adjusted R^2 is highlighted in red indicating the optimal transformation. The second column of the table relates to untransformed explanatory variables, it is called "x". We see that [Grad%] can be explained best by [SAT] and [Acceptance]. Applying the logarithm to them gives the two largest adjusted R^2 for all regressions for universities. For the other three variables [$/Student], [Top_10%] and [PhD%] only very small values of R^2 are achieved, even after applying transformations. This corresponds to the impression we got from the plots. The results of the regression were improved a lot in terms of adjusted R^2 for the variable [$/Student] when taking the data set with replaced outliers into account. Then, a quadratic transformation became optimal. For [Top_10%] no change could be observed and while there was a small improvement for [PhD%].

Analyzing colleges we see that ln([Acceptance]) can explain the graduation rate best, followed by ([SAT])2 and [Top_10%] untransformed. While the influence of [Acceptance] and [SAT] are much lower than for universities, a higher impact of [Top_10%] can be found.

Advanced Graphical Tools[edit]

As an introduction to the following multiple regression, now two advanced graphical tools will be applied to visualize all variables from all observations at once. Again, colleges are marked in blue and universities appear red.

To the left, two parallel coordinates plots can be found. The first one shows the original data set. In the second one, outliers in the group of universities have been replaced by the mean of the remaining observations. In these figures the coordinates of each observation are plotted in an orthogonal coordinate system and connected through a curve. On the x-axis the variables are plotted in the order used before: 1-[SAT], 2-[Acceptance], 3-[$/Student], 4-[Top_10%], 5-[PhD%] and 6-[Grad%]. In accordance with the results from the univariate data analysis both plots show that the data for universities is more heterogeneous than for colleges. One detects easily the outlier to the upper bound in variable three [$/Student] and the outlier to the lower bound in the fifth variable [PhD%] in the left figure. They disappear in the second Parallel Coordinates Plot. Furthermore, the two plots confirm that universities and colleges differ most in the variables [$/Student], [Top_10%] and [PhD%].

Figure 13. Parallel coordinates plot
Figure 14. Parallel coordinates plot with replaced outliers
Figure 15. Star diagram

Having a look at the star diagram to the right, where each axis of a star represents one variable of the observation, one finds more big stars in the group of the universities, i.e. higher variable values. The large variety of star forms verifies higher variability in the group of the universities once again.

Regression[edit]

As the overall goal of this paper is to find a model, which allows us to predict the success measured by [Grad%] of an institution of higher learning, this section is about finding this model via regression analysis. From the univariate and the bivariate data analysis we learned that there are rather large differences in the variable values and in the impact of variables on [Grad%] between universities and colleges.

The results from the sections above are used to determine the optimal regression model. The best model is the one with the best fit, which means we try to maximize adjusted R^2 and keep the constant parameter and all variables included in the model significant. We used the methods of forward selection, backward elimination and stepwise regression to select the optimal set of regressors. Moreover, models obtained by the listed methods were improved by manual adjustments.

Universities[edit]

  • The first option is a regression with the original data set without any transformation including outliers. The best model for this case is the following:
Contents of ANOVA
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1211.276     2   605.638      19.708   0.0000"
[ 5,] "Residuals                    676.084 2e+01    30.731"
[ 6,] "Total Variation                 1887    24    78.640"
[ 7,] ""
[ 8,] "Multiple R      = 0.80111"
[ 9,] "R^2             = 0.64178"
[10,] "Adjusted R^2    = 0.60922"
[11,] "Standard Error  = 5.54356"
Contents of Summary
[1,] "Variables in the Equation for Y:"
[2,] " "
[3,] ""
[4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[5,] "  __________________________________________________________________________________"
[6,] "b[ 0,]=        -65.4547      23.6628       0.0000     -2.7661   0.0113   Constant   "
[7,] "b[ 1,]=          0.1265       0.0204       1.0967      6.2152   0.0000   [SAT]"
[8,] "b[ 2,]=         -0.0003       0.0001      -0.6443     -3.6514   0.0014   [$/Student]"

It shows that the model achieves an adjusted R^2 of about 61% and only includes the variables [SAT] and [$/Student] besides a constant. Surprising is the fact that [$/Student] has a negative coefficient. This probably results from the exceptionally high outlier California Institute of Technology, which showed a below average graduation percentage.

  • Applying the results from the bivariate analysis, a transformation of some variables seems appropriate. We will therefore use natural logarithms of [SAT], [Acceptance] and [$/Student] as well as e[Top_10%] and e[PhD%] will be included in the regression. Outliers have not been replaced, yet.
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1404.812     4   351.203      14.556   0.0000"
[ 5,] "Residuals                    482.548 2e+01    24.127"
[ 6,] "Total Variation                 1887    24    78.640"
[ 7,] ""
[ 8,] "Multiple R      = 0.86274"
[ 9,] "R^2             = 0.74433"
[10,] "Adjusted R^2    = 0.69319"
[11,] "Standard Error  = 4.91196"

Contents of Summary
[ 1,] "Variables in the Equation for Y:"
[ 2,] " "
[ 3,] ""
[ 4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[ 5,] "  __________________________________________________________________________________"
[ 6,] "b[ 0,]=       -536.1789     170.1138       0.0000     -3.1519   0.0050   Constant   "
[ 7,] "b[ 1,]=         89.8796      22.7090       0.6218      3.9579   0.0008   [SAT]"
[ 8,] "b[ 2,]=         -6.1150       3.3567      -0.2743     -1.8217   0.0835   [Acceptance]"
[ 9,] "b[ 3,]=          0.0000       0.0000      -0.4397     -3.6234   0.0017   [Top_10%]"
[10,] "b[ 4,]=          0.0000       0.0000      -0.1932     -1.7067   0.1034   [PhD%]"

The model's fit has improved by roughly 8%, but [Acceptance] is not significant at a level of 5%. This does not conform with the standards set in the beginning. In the following [Acceptance] is excluded from the regression model.

Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1253.320     2   626.660      21.744   0.0000"
[ 5,] "Residuals                    634.040 2e+01    28.820"
[ 6,] "Total Variation                 1887    24    78.640"
[ 7,] ""
[ 8,] "Multiple R      = 0.81490"
[ 9,] "R^2             = 0.66406"
[10,] "Adjusted R^2    = 0.63352"
[11,] "Standard Error  = 5.36843"
Contents of Summary
[1,] "Variables in the Equation for Y:"
[2,] " "
[3,] ""
[4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[5,] "  __________________________________________________________________________________"
[6,] "b[ 0,]=       -745.7651     133.0529       0.0000     -5.6050   0.0000   Constant   "
[7,] "b[ 1,]=        116.1058      18.6299       0.8033      6.2322   0.0000   [SAT]"
[8,] "b[ 2,]=          0.0000       0.0000      -0.4947     -3.8382   0.0009   [Top_10%]"

The adjusted R^2 is now a little lower at 63%, but all variables included in the model are significant at the 5% level.

  • Now we will do the regression analysis for universities as above but with replaced outliers. At first again without transformations.
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1293.761     3   431.254      15.257   0.0000"
[ 5,] "Residuals                    593.599 2e+01    28.267"
[ 6,] "Total Variation                 1887    24    78.640"
[ 7,] ""
[ 8,] "Multiple R      = 0.82794"
[ 9,] "R^2             = 0.68549"
[10,] "Adjusted R^2    = 0.64056"
[11,] "Standard Error  = 5.31664"
Contents of Summary
[1,] "Variables in the Equation for Y:"
[2,] " "
[3,] ""
[4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[5,] "  __________________________________________________________________________________"
[6,] "b[ 0,]=         75.0541      28.4411       0.0000      2.6389   0.0153   Constant   "
[7,] "b[ 1,]=          0.0506       0.0179       0.4386      2.8333   0.0100   [SAT]"
[8,] "b[ 2,]=         -0.3776       0.1074      -0.5908     -3.5169   0.0020   [Acceptance]"
[9,] "b[ 3,]=         -0.5138       0.1495      -0.4801     -3.4367   0.0025   [Top_10%]"

This model's adjusted R^2 is marginally better at 64% than the one's above. All variables are significant at a level of 5%.

  • Now we use ln([SAT]), ln([Acceptance]), ([$/Student])2, e[Top_10%] and e[PhD%] as suggested by the bivariate analysis.
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1013.774     2   506.887      12.765   0.0002"
[ 5,] "Residuals                    873.586 2e+01    39.708"
[ 6,] "Total Variation                 1887    24    78.640"
[ 7,] ""
[ 8,] "Multiple R      = 0.73290"
[ 9,] "R^2             = 0.53714"
[10,] "Adjusted R^2    = 0.49506"
[11,] "Standard Error  = 6.30146"
Contents of Summary
[1,] "Variables in the Equation for Y:"
[2,] " "
[3,] ""
[4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[5,] "  __________________________________________________________________________________"
[6,] "b[ 0,]=       -304.4200     203.4767       0.0000     -1.4961   0.1488   Constant   "
[7,] "b[ 1,]=         58.5971      27.1347       0.4054      2.1595   0.0420   [SAT]"
[8,] "b[ 2,]=         -9.0348       4.1854      -0.4052     -2.1586   0.0421   [Acceptance]"

This is the best model with replaced outliers and transformed variables. This is a disappointing result as the adjusted R^2 is about 15% less than the one of the best model obtained so far and the constant parameter is not significant.

  • To sum up the regression analysis for universities, the best model is obtained by replacing outliers and not transforming the regressors. It is characterized by the following equation:

[Grad%] = 75.0541 + 0.0506*[SAT] - 0.3776*[Acceptance] - 0.5138*[Top_10%]

This model yields an adjusted R^2 of 64.06%, whereas the best model (constant, ln([SAT]), e[Top_10%]) for the data set including the outliers reaches an adjusted R^2 of 63.35%. We see that replacing the outliers does not improve the fit of the model drastically. Furthermore, despite all the transformation and outlier replacement the difference to the model which just uses the original data (R^2 of 60.92%, constant, [SAT], [$/Student]) is not large, anyway.

It seems logical that [SAT] has a positive coefficient, because higher SAT scores among the students indicate that the students are higher talented and will graduate more likely. Moreover, a negative coefficient for [Acceptance] also makes sense as it is usually harder to get into a better and more renowned university, which offers very good education. The negative coefficient for [Top_10%] seems implausible, since one expects students that have high SAT values also to have done well in high school. One would expect top level high school students also to achieve high graduation percentages at a university.

Liberal Arts Colleges[edit]

As we saw in the bivariate analysis, there are less clear relations between the explanatory variables and [Grad%] in the group of liberal arts colleges. This result will also become visible in the following regressions. Since the group of colleges does not contain any outliers, only two different options of a regression are studied.

  • Regression without transformation
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                   273.812     1   273.812      10.210   0.0040"
[ 5,] "Residuals                    616.828 2e+01    26.819"
[ 6,] "Total Variation                  891    24    37.110"
[ 7,] ""
[ 8,] "Multiple R      = 0.55447"
[ 9,] "R^2             = 0.30743"
[10,] "Adjusted R^2    = 0.27732"
[11,] "Standard Error  = 5.17867"
Contents of Summary
[1,] "Variables in the Equation for Y:"
[2,] " "
[3,] ""
[4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[5,] "  __________________________________________________________________________________"
[6,] "b[ 0,]=         95.0651       3.5786       0.0000     26.5651   0.0000   Constant   "
[7,] "b[ 1,]=         -0.2699       0.0845      -0.5545     -3.1953   0.0040   [Acceptance]"

We see that the model reaches a very low adjusted R^2 of 27.7% compared to the value achieved by the optimal model for universities. The model includes a constant and [Acceptance], which both are significant.

  • Regression with the following transformations as advised by the bivariate analysis is now applied: ([SAT])2, ln([Acceptance]), ([$/Student])2, e[PhD%] and [Top_10%] remains unchanged.
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                   290.936     1   290.936      11.158   0.0028"
[ 5,] "Residuals                    599.704 2e+01    26.074"
[ 6,] "Total Variation                  891    24    37.110"
[ 7,] ""
[ 8,] "Multiple R      = 0.57154"
[ 9,] "R^2             = 0.32666"
[10,] "Adjusted R^2    = 0.29738"
[11,] "Standard Error  = 5.10628"
Contents of Summary
[1,] "Variables in the Equation for Y:"
[2,] " "
[3,] ""
[4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[5,] "  __________________________________________________________________________________"
[6,] "b[ 0,]=        124.5905      12.1586       0.0000     10.2471   0.0000   Constant   "
[7,] "b[ 1,]=        -11.0691       3.3138      -0.5715     -3.3404   0.0028   [Acceptance]"

The regressions result as expected in a much lower adjusted R^2 compared to the group of universities. The transformation improves the result from ~28% to ~30%. Both rely on the most controversial variable of the data set, as [Acceptance] does not give unbiased information as mentioned before. The conclusion we can draw is, that there must be other factors contributing to the graduation percentage of colleges, which are not included in the data set. We could suspect that e.g. class size or the number of students supervised by a single counselor play a much more important role at liberal arts colleges.

Dummy Variable Regression[edit]

To test the importance of other factors to the groups of colleges we applied the following regression: We used all 5 explanatory variables to estimate [Grad%] of the entire data set and included a binary variable that assigns a "0" to colleges and a "1" to universities. If this variable shows a high significance, one can conclude that there are other important factors not included in the modelling that drastically influence the graduation percentage at liberal arts colleges.

  • First, we used the untransformed data.
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1502.855     4   375.714      13.049   0.0000"
[ 5,] "Residuals                   1295.625 4e+01    28.792"
[ 6,] "Total Variation                 2798    49    57.112"
[ 7,] ""
[ 8,] "Multiple R      = 0.73282"
[ 9,] "R^2             = 0.53703"
[10,] "Adjusted R^2    = 0.49587"
[11,] "Standard Error  = 5.36579"
Contents of Summary
[ 1,] "Variables in the Equation for Y:"
[ 2,] " "
[ 3,] ""
[ 4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[ 5,] "  __________________________________________________________________________________"
[ 6,] "b[ 0,]=         16.3248      24.9442       0.0000      0.6545   0.5161   Constant   "
[ 7,] "b[ 1,]=          0.0732       0.0183       0.6041      4.0102   0.0002   [Dummy Var]"
[ 8,] "b[ 2,]=         -0.2540       0.0840      -0.4491     -3.0225   0.0041   [SAT]"
[ 9,] "b[ 3,]=         -0.0001       0.0001      -0.2877     -2.1157   0.0399   [Acceptance]"
[10,] "b[ 4,]=         -0.1545       0.0802      -0.2764     -1.9260   0.0604   [$/Student]"
  • Secondly, the variables were transformed according to the highest adjusted R^2 of simple regressions explaining [Grad%] of the entire data set. Therefore, logarithms of [SAT], [Acceptance], [$/Student] and e[Top_10%] and e[PhD%] are used.
Contents of ANOVA
[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                  1649.833     4   412.458      16.159   0.0000"
[ 5,] "Residuals                   1148.647 4e+01    25.525"
[ 6,] "Total Variation                 2798    49    57.112"
[ 7,] ""
[ 8,] "Multiple R      = 0.76782"
[ 9,] "R^2             = 0.58955"
[10,] "Adjusted R^2    = 0.55306"
[11,] "Standard Error  = 5.05228"
Contents of Summary
[ 1,] "Variables in the Equation for Y:"
[ 2,] " "
[ 3,] ""
[ 4,] "PARAMETERS         Beta         SE         StandB      t-test   P-value  Variable"
[ 5,] "  __________________________________________________________________________________"
[ 6,] "b[ 0,]=       -423.4510     146.1641       0.0000     -2.8971   0.0058   Constant   "
[ 7,] "b[ 1,]=         74.4061      19.6244       0.4888      3.7915   0.0004   [Dummy Var]"
[ 8,] "b[ 2,]=         -6.5097       2.5811      -0.3148     -2.5221   0.0153   [SAT]"
[ 9,] "b[ 3,]=          0.0000       0.0000      -0.3537     -3.5201   0.0010   [$/Student]"
[10,] "b[ 4,]=          0.0000       0.0000      -0.2168     -2.2539   0.0291   [Top_10%]"

As one can observe the dummy variable [Dummy Var] being most significant in both models, which supports our thesis that there must be other influences in the group of colleges that determine the graduation percentage. The regressions shown above are without replacement of outliers. This is due to the fact that it becomes nearly impossible to replace outliers by our usual process. Since the group of universities is generally less homogeneous than the group of colleges, especially in the case of [$/Student], a lot of outliers appear again and again at the upper bound when combining the two groups.

Conclusion[edit]

In this paper we analyzed data of the top U.S. universities and colleges of liberal arts. We applied several methods to investigate the data and study the relationships in the College Data Set. From the univariate data analysis we found out that universities and colleges of liberal arts are characterized by different distributions of the variable values. Therefore we decided to treat them separately in the following steps of the analysis. Nevertheless, it was interesting to see that both school types reach the same results in the variable [Grad%] which we declared as measure of success. Consequently, we think that they are both valuable and should be taken into account when deciding on how to organize tertiary education in Germany.

Next, we studied outliers which only appear in the group of the universities and decided to create a new data set where the outliers are replaced. In the following regressions of [Grad%] on single explanatory variables we saw that the measure of success can be explained better for universities than for colleges. Furthermore we found out that transformations of the variables and replacement of the variable can improve the results of the regression.

All the knowledge achieved so far and three stepwise regression methods were incorporated in the following part to develop the optimal regression model for [Grad%], separately for universities and colleges.

We found that the success of universities is explained best by the SAT values of the students, the acceptance rate and the percentage of students that were among the top 10% of their high school graduating class, [Top_10%]. We were surprised that lower values in [Top_10%] result in a higher graduation rate. Furthermore, the acceptance rate has to be evaluated critically as a device since it depends on the number of applicants and does not necessarily indicate higher qualification of the accepted students.

For colleges we found out that with the given variables it is utterly hard to find a sufficient model. That confirms on the one hand that universities and colleges should be managed in a different way. On the other hand, this result shows that important variables for high graduation rates at colleges might be missing. That could be for example the ratio of professors per student. A regression that included all observations and a dummy for the school type lead to a model with a highly significant [Dummy Var]. This supports the aforementioned thesis, that other causes strongly influence the success of liberal arts colleges.

To give more substantial advice how to manage institutions of higher learning in Germany one needs a data set of higher quality. Either the sample size needs to be higher or information on a larger number of variables has to be collected.

References[edit]

  • Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.
  • Härdle, W. and Simar, L. (2003). Applied Multivariate Statistical Analysis. Springer-Verlag Berlin Heidelberg.

Comments[edit]

  • No XploRe programs (not even in the appendix)
  • Remark: public institutions often finance private institutions :(
  • Data source given
  • Links to external terms given
  • SAT?
  • Outlier replacement procedure is at least doubtful
  • Bandwidths?
  • Too colorful, especially the green :)
  • How did you compare the adjusted R^2?
  • Outlier problem in linear regression
  • A final overview about all regression models would have been nice
  • How should collecting more informations (observations) about variables which do not help to explain [Grad%] help to increase R^2?