What Factors Determine Salaries in Professional Basketball? - Evidence from the National Basketball Association
Error creating thumbnail: File missing
Please do not cite work from this wiki, since these are mainly students theses which may contain errors!
Error creating thumbnail: File missing
The idea for this paper emerged from asking a basketball adept friend of mine: “What determines a basketball player’s salary?” He briefly answered with a laugh: “No one knows.”
Professional sports is an interesting research area of economics. Professional sports can represent large markets with significant revenues and numerous stakeholders such as sport associations, players, textile companies, viewers and many more. From a labor economic viewpoint basketball teams represent the demand side as employers and players are suppliers of labor offering their skills and performance. Furthermore, professional sports is interesting as a research field, because it provides detailed data about performance measures, physical variables and income data of players publicly accessible for several years collected by various institutions such as television networks and web pages.
The purpose of the analysis is rather the application of the statistical techniques and models taught within the course "Computergestützte Statistik" at Humboldt-Universität zu Berlin in the winter term 2009/10 than a complete empirical analysis. Nevertheless, the question of what determines salaries in professional basketball is considered throughout the analysis with a focus on the Northern American National Basketball Association (NBA). Player salaries of season 2009-10 are considered and how they are influenced by variables of season 2008-09 using a variety of statistical techniques. SPSS 17.0 is primarily used for the analysis. Additionally, EViews 5.0 is used to complement some analysis techniques.
Data Set and Variable Description
The data used in this paper are extracted and combined from the following web pages:
The statistics published by these websites are mostly collected and provided by the Elias Sports Bureau (ESB).
The data set observation units are basketball players of the NBA season 2009-10 which have values of characteristic, salary and performance variables of players. Table 2.1 gives a detailed variable list. The selected variables for the subsequent analysis are chosen with reference to Yu et al. (2008, p.196) especially for the performed regression analysis. The most important variables for the analysis are highlighted with grey. Table 2.1 contains various performance measures. Field Goals Made, Assists and Steals are the main variables of interest to evaluate a player’s performance for most of the analysis. The rest of the performance measures and the already mentioned ones are analyzed with the help of factor analysis to get some insights of how the complex question of what performance is can be reduced to meaningfully interpretable factors. The reader should notice that most of the performance variables are averages indicated by p.G. (= per Game). The abbreviation is not used throughout the text for convenience, but must be kept in mind. Average values are used instead of total values since averages tend to be normally distributed by the Central Limit Theorem and can be meaningfully interpreted as the “true” performance of a player.
|Tab. 2.1 Variable List|
Figure 2.1 shows a small snapshot of the data sheet in SPSS to get an impression of the data set.
Before the subsequent analysis could be undertaken the data provided by the above mentioned sources had to be edited. Since teams are allowed to trade players within a season some players occurred twice in the data set. These players were identified with SPSS using the variable "Name". A player who occurred twice in the data would have had the same values for characteristic and salary variables, but different values for performance variables. Consequently, one case of a double entry must be deleted to avoid a misleading analysis. To choose which case of the double player entry should be deleted the variable “Games Played” was used. The case with the most games played is chosen to retain in the data set, because a player should be much more likely to show his ability with more games played. 70 redundant double cases were identified out of 515 cases and subsequently deleted. A total number of 445 cases remained for the analysis.
Mainly, the variables Salary0910, Salary0809, Height, Weight, Experience, Field Goals Made, Assists and Steals are taken into consideration throughout the analysis. The aim is to understand, as mentioned above, how salaries of the season 2009-10 are influenced by different variables. The variable salary0809 is a proxy for rigid wage setting institutions in the NBA since it only reflects a player’s current performance up to a minor extend. The height and weight of a player are proxies for a player’s physical ability. The interpretation of the player’s experience is straightforward and can be compared with traditional labor economics applications. The variable Field Goals Made reflects a player’s offensive ability, the variable Assists the player’s team play capability and the variable Steals the player’s defensive ability – all three variables describe the performance of a player.
Explorative Data and Extreme Value Analysis
This sections looks at descriptive statistics of the aforementioned NBA variables. Furthermore, extreme values are analyzed with the help of different techniques and variable distributions are graphically analyzed and tested since some extreme value techniques require the assumption of a normal distribution. The metric variables Salary0910, Salary0809, Height, Weight, Experience, Field Goals Made, Assists and Steals are analyzed separately. Extreme value tests are applied for the variable Weight, because it can be transformed with the help of a power transformation such that a normal distribution cannot be rejected by the Kolmogorov-Smirnov and Jarque-Bera tests. Unfortunately, some of the tests do not provide a test decision, because critical values are not tabulated for large sample sizes. The Jarque-Bera test statistic and the corresponding p-value are calculated with the help of EViews 5.0.
Figures 3.1-3.5 illustrate the distribution of the variable Salary0910. The histogram in figure 3.1 shows that the distribution is skewed to the right. The additionally plotted normal distribution already indicates that the variable is not normally distributed. Figure 3.2 gives the stem-and-leaf plot. It also shows that most of the observations are observed in lower salary regions. Furthermore, it already indicates that there are 23 high extreme values (salaries above a value of $15,779,912). This is confirmed by the boxplot in figure 3.3. Again, one can see that most of the distribution mass is located in lower salary categories and that the distribution is skewed to the right. There are also high extreme values indicated by circles. Figures 3.4 and 3.5 show the Normal Q-Q plot and the Detrended Normal Q-Q plot. The distribution clearly deviates from a normal distribution in its tails as well as in its main body. Key figures are presented in table 3.1. The skewness with a value of 1.4 and kurtosis with a value of 1.7 deviate from the values of a normal distribution (0 and 3 respectively). The Kolmogorov-Smirnov and the Shapiro-Wilk test confirm that the distribution is not a normal one since the reported p-value is 0.00 each and thus below every reasonable significance level. The Jarque-Bera test confirms this result with a test statistic of 248.66 and a p-value of 0.00. Hence, extreme value tests such as the Grubbs test with the assumption of a normal distribution cannot be applied. Nevertheless, comparing the variable’s mean and M-estimators show that high extreme values might have a significant influence. The mean, susceptible to extreme values, is about $1,000,000 higher than the M-estimators which give extreme values a lower weight. Furthermore, the median is $1,600,000 lower than the mean and also the 5 percent trimmed mean is about $500,000 lower than the mean. Though, no values are excluded from the analysis since it is the aim to explain differences in salaries and the extreme values are not indicated by stars within the boxplot. It is just stated that neither a logarithmic transformation nor a power transformation of the variable provides a normal distribution to enable the application of outlier tests.
Similarly as for the variable Salary0910, the variable Salary0809 seems not to be normally distributed. The histogram in figure 3.6 shows the right skewness of the distribution confirmed by the stem-and-leaf plot in figure 3.7. Additionally, the stem-and-leaf plot indicates that there are 31 extreme values. The boxplot in figure 3.8 reveals the right skewness since it upper whisker is much farther away from the main body of the distribution than the lower one. It is also observable as in the in stem-an-leaf plot that there are high extreme values indicated by circles and even one value indicated by a star which might be an outlier. This star value reflects the player Kevin Garnett who is known to be one of the best players in the NBA. Unfortunately, as shown in table 3.2 by the Kolmogorov-Smirnov and Shapiro-Wilk test the variable is not normally distributed either. The Jarque-Bera test provides the same result with a test statistic of 266.78 and a p-value of 0.00. The Normal Q-Q plots in figure 3.9 and 3.10 confirm this result. Again, skewness and kurtosis clearly deviate from their theoretical normal counterparts. A normal distribution cannot be assumed. As before, the mean clearly deviates from the median and the M-estimator values. The extreme value indicated by a star might therefore be a problem, but as the value cannot be tested to be an outlier it is kept for the analysis. Again, as for the variable Salary0910 a logarithmic and a power transformation of the variable Salary0809 do not provide a normal distribution either.
Height is usually known to be normally distributed within populations. However, basketball players in the NBA are on average much taller than people who are not professional basketball players. This means that the sample analyzed is a selection of extraordinarily tall persons and consequently not representative for a country's population as whole. As can be seen in table 3.3 the player’s average height is 2.01m. Therefore, the variable height in this data set might not be normally distributed. The histogram in figure 3.11 suggests that the distribution is bimodal with two peaks at 1.90m and 2.05m. Nevertheless, the distribution seems not to be too far away from a normal one. The stem-and-leaf plot in figure 3.12 reveals only one peak and detects one low and two high extreme values which is confirmed by the boxplot in figure 3.13. The distribution seems to be symmetric being a bit left skewed. The Normal Q-Q plot in figure 3.14 seems to have an almost perfect fit with some deviation at the right tail. However, the Detrended Q-Q plot indicates that the deviation of the observed values from the theoretical normal distribution are not random. Table 3.3 gives the tests and key numbers for normality. The skewness is close to its theoretical counterpart of zero with a value of 0.275, but the kurtosis is much smaller than three with a value of 0.355. Furthermore, both tests reject the assumption of a normal distribution since the p-value with 0.00 is below any commonly used significance level. This is confirmed by the Jarque-Bera test with a test statistic of 8.03 and a p-value of 0.02. Unlike the salary variables extreme values seem not have an impact on the mean. The 5 percent trimmed mean, the median and the M-estimators deviate from the mean only by 1-2cm. A transformation to obtain a normal distribution is not advisable since the spread of the variable is too short (the ratio of the maximum and minimum of the variable is 2.02 which is lower than 20). Anyway, the observed extreme values do not need to be eliminated, because they do not seem to have an influence on the mean as shown by the M-estimators.
The variable Weight might be normally distributed as indicated by the histogram in figure 3.11 since its distribution seems to be symmetric and well fitted with the plotted normal distribution. This impression is confirmed by the stem-and-leaf plot in figure 3.12. Two high extreme values are detected in the stem-and-leaf plot and the boxplot in figure 3.13. The boxplot confirms the symmetry of the distribution. The whiskers seem to have almost the same length and the median is in the center of the lower and upper fourths. The Normal Q-Q plots in figures 3.19 and 3.20 show that the distribution is close to a normal one except for the tails. The skewness is close to zero, but the kurtosis is far away from three as shown in table 3.4. The Kolmogorov-Smirnov and Shapiro-Wilk tests reject a normal distribution. Though, the p-value of the Kolmogorov-Smirnov test with a value of 0.044 is close to the significance level 0.05. The Jarque-Bera test does not reject the assumption of a normal distribution with a test statistic of 2.85 and a p-value of 0.45. So far, a normal distribution must be rejected since one test denies a normal distribution. However, since the variable distribution seems to be close to a normal one a power transformation is undertaken as follows: with c = 0 and p = 1.4. The value c = 0 is chosen since there are no original and transformed negative values and p = 1.4 based on the highest p-value of the Kolmogorov-Smirnov test after a trial and error procedure with different values of p. The test in table 3.5 gives a p-value for the Kolmogorov-Smirnov test of 0.118 which is higher than a significance level of 0.05. The Kolmogorov-Smirnov test does not reject a normal distribution of the transformed variable weight. The Jarque-Bera test of the untransformed variable does not deny a normal distribution and it does not for the transformed variable either. Although the test statistic increased to a value of 5.09, the corresponding p-value of 0.08 is still larger than 0.05. Unfortunately, the Shapiro Wilk test denies a normal distribution of the transformed variable. Nevertheless, outlier tests are performed, because the variable's distribution seems not to be too far away from a normal distribution as indicated by the other tests. Subsequent test results must be seen as an approximation and are thus questionable. To test whether the highest extreme value (the player Shaquille O’Neal) is an outlier or not the Grubbs test is applied. To get the test statistic the mean of the transformed variable is subtracted from the highest value of the transformed variable and divided by its standard deviation. The values for the calculation are not taken from table 3.5 since the approximation is too rough for the transformed data. Instead, the exact values with more decimals are used, because the test statistic values seem to be sensitive to truncations. Since critical values are not tabulated for a large sample size of 445 (see Grubbs/Beck 1972) the approximate p-value is calculated as follows (see GraphPad):
- 1. The test statistic is calculated: .
- 2. The value is calculated.
- 3. The T value is used to obtain the two-tailed p-value of a the following t-distribution function in Excel .
- 4. The obtained p-value of the t-distribution is multiplied by N which gives the approximate p-value of the Grubbs test.
The test statistic value of 3.895 is larger than its critical value on a 5 percent significance level with 3.832 (the approximate p-value is 0.038). The critical value and the test statistic are almost identical, but as a test result the observation seems to be an outlier. Additionally, since the boxplot detected two extreme values the Grubbs-Beck test should be applied to assess whether both values are outliers or not. Unfortunately, the test cannot be performed since critical values are usually reported up to a sample size of 100 maximum (see Grubbs/Beck 1972). Thus, a test decision is not be possible. The Dixon’s r-statistic test suffers from the same problem as the Grubbs-Beck test of the too large sample size for which critical values are not tabulated. Consequently, the test is not been undertaken either. Fortunately, critical values of the David-Hartley-Pearson-Test are tabulated for larger sample sizes (see David et al. 1954). For illustration since SPSS does not provide the test the key figures for calculating the test statistic are:
- Sum of Squared Deviation = 5,651,663.762
- Range = 680.009
- Test Statistic = 6.034
The critical values for the David-Hartley-Pearson test on a 5 percent significance level for a sample size of 500 (the values for a sample size of 445 are not tabulated) are 5,37 for the lower one and 6,94 for the upper one. The test statistic lies between both values such that the null hypothesis that the highest observed value is not an outlier cannot be rejected. This contradicts the result of the Grubbs test which evaluated the highest value to be an outlier. Nevertheless, the observation is kept for the analysis since it seems to have no impact on the mean as shown above by the mean and the corresponding M-estimators. Furthermore, the critical value and test statistic of the Grubbs test are almost identical and the David-Hartley-Pearson test does not detect it as an outlier.
The variable experience is a metric variable, but not continuous since it is measured by natural numbers (also referred to as count data). That is why a bar chart is used instead of a histogram to illustrate the variable’s distribution. The bar chart is given in figure 3.21. Low values are more often observed than high values. The stem-and-leaf plot in figure 3.22 might look odd, because all leafs are coded with zero. This is a natural result since the variable is measured by natural numbers as stated above. It detects one high extreme value as well as the boxplot in figure 3.23. As can be seen in table 3.6 the value does not seem to distort the analysis, because the mean is only slightly higher than the median, 5 percent trimmed mean and M-estimators and consequently not deleted.
Field Goals Made
Figures 3.24-3.26 show that the variable’s distribution is right skewed with one high extreme value. Figures 3.27 and 3.28 suggest that the variable is not normally distributed which is confirmed by the estimated skewness and kurtosis and the Kolmogorov-Smirnov test in table 3.7. Comparing the mean with the 5 percent trimmed mean and median the high extreme value does not seem to have much influence. Though, the M estimators are different. It is just stated here that excluding the high extreme value does not change the results reported in table 3.7 significantly and thus are kept for the analysis.
Figures 3.29-3.31 show that the distribution is right skewed. The stem-and-leaf plot and the boxplot indicate that there are a couple of high extreme values (the boxplot detects even eight star values). The Normal Q-Q plots in figures 3.32 and 3.33 show that the data might not be normally distributed which is confirmed by the skewness, over-kurtosis and Kolmogorov-Smirnov test in table 3.8. As the M-estimators and their deviation from the mean show extreme values might be an issue, but are not tested since the variable is not normally distributed.
Figures 3.34-3.36 indicate that the distribution is right skewed. The stem-and-leaf plot and the boxplot detect 5 high extreme values with one star value. The data is not normally distributed as shown by the Normal Q-Q plots in figures 3.37 and 3.38 and the Kolmogorov-Smirnov test in table 3.9. Extreme values seem not to be important since the mean and the M-estimators only deviate slightly.
Missing Value Analysis
Missing values are reported in table 4.1 for the variables Salary0910 and Salary0809. 1.1 percent of values are missing for Salary0809 and 11.7 percent for Salary0910. The reason is that missing salaries for Salary0910 are not published yet (what presumably will be the case after the season has ended). Missing values of Salary0809 might be due to the fact that some players of the current season did not play in the previous one.
|Tab. 4.1 Missing Value Counts|
However, a more detailed analysis is needed to evaluate whether there are observable patterns of missing values and if they are missing completely at random (MCAR) or not. To analyze patterns of missing values categorical variables are used to evaluate mean differences and frequencies within categories. Additionally, the metric variables Salary0809, Salary0910, Height, Weight and Field Goals Made are used. Table 4.2 gives the interrelationship between the considered quantitative variables and the separate variance t-tests. The test evaluates whether the means of the quantitative indicator variables differ significantly between missing and existing values of the variable of interest with missing values. The significance level of 5 percent is chosen. Since the number of degrees of freedom is high the two-sided critical values of a t-distribution can be approximated with the two-sided critical values of a standard normal distribution which gives critical values of 1.96. Table 4.2 indicates that the means for the indicator variables Salary0809, Salary0910 and Field Goals Made differ significantly since the t-statistics are above 1.96 with values of 2.3 up to 16.8. This might be already a hint that missing data values are not missing completely at random. Tables 4.3-4.5 show cross tabulations between the categorical variables of the data set and both salary variables with missing values. Accordingly tables 4.3 and 4.4 missing values are approximately equally distributed across categories of the variables Conference and Division. Percentages between Conference categories range from 0.5 to 1.8 and from 12.2 to 11.2 for salary variables. Percentages between Division categories range 0.0 to 2.7 and from 10.5 to 13.9 respectively. These patterns might be due to chance. In contrast, considering the distribution of missing values with the help of the indicator variable U.S. College reveals a pattern for the variable Salary0910 as can be seen in table 4.5. Players who attended foreign colleges or with none college attendance have only 5.5 percent of missing values whereas players who attended a U.S. college have 13.3 percent of missing values – another hint that some values are not missing completely at random.
|Tab. 4.2 Missing Values Quantitative Variable Patterns|
|Tab. 4.3 Missing Values Categorical Variable Conference|
|Tab. 4.4 Missing Values Categorical Variable Divisions|
|Tab. 4.5 Missing Values US college|
Since there are several hints that the missing data are not missing completely at random Little’s MCAR test is performed. Its null hypothesis is that data are missing completely at random (see SPSS 2007). Table 4.6 gives the test result in its footnote. The p-value is 0.00 and thus smaller than every reasonable significance level. The null hypothesis is rejected and the data seem not be missing completely at random. Hence, most of the imputation methods such as mean substitution and regression would result in biased values.
|Tab. 4.6 Missing Values Little's MCAR test|
The EM algorithm seems to be the most advisable choice to reduce the bias (see SPSS 2007) and is performed for further analysis. The EM algorithm is an iterative likelihood based method which assumes a normal distribution. It is performed even though the variables theirselves are not normally distributed as shown in the previous section. Table 4.7 gives some basic descriptive statistics to assess the impact of the imputed values. The range of values for the variable Salary0809 remains unaffected. The decrease in mean for the variable Salary0809 is less than 1 percent and therefore negligible. In contrast, the mean for the variable Salary0910 decreased by 6.5 percent. The maximum is the same as before, but the minimum changes from $736,420 to $-292,981 which is impossible since salaries can only have positive values and additionally the range of the untreated variable should be retained. This problem is a hint that the assumed normal distribution of the EM algorithm to calculate imputed values is not correct since a lot of values of the original Salary0910 variable are observed at the lower tail of its empirical distribution. Nevertheless, table 4.8 shows that only 0.9 percent of the imputed values are below $736,420 and consequently are substituted by the latter value (the mean remained considerably stable with an increase of less than 1 percent). Estimation techniques using truncated normal distributions might improve the quality of imputed values.
|Tab 4.7 Means of Salary Variables before and after Missing Value Treatment|
|Tab 4.8 Values Below Former Minimum Variable Salary0910|
Testing Equality of Means
This section examines differences in means between various variables. Inference is made for professional basketball players in general based on the NBA sample. First, the influence of U.S. college attendance on a player’s salary is analyzed with the help of a t-test. Second, ANOVA and post-hoc tests are used to analyze if the mean weight differs significantly across positions. Third, the non-parametric Kruskal-Wallis test is used to evaluate whether mean salaries differ across positions or not.
The role of U.S. colleges for a NBA player’s mean salary in 2009-10
College basketball is important for professional basketball in the U.S., because it provides the NBA with young talents. This section analyzes the means and variances of salaries in 2009-10 to answer the question if the attendance of a U.S. college plays a significant role. Table 5.1 shows that the mean of salaries for players who played for U.S. colleges are lower than for players with none or foreign college attendance. This might be due to the fact that young inexperienced players are relatively more often represented in the group of players with a U.S. college background than in the group of players with a foreign or non college background. However, table 5.1 reveals that the means of player salaries with higher experience have a significant gap even if players with low experience are not taken into account. The mean differences must have another explanation, but are not further analyzed here.
|Tab. 5.1 Descriptive Statistics US College|
It is now tested whether the mean differences are significant or not. Figure 5.1 gives the error bar plot for salaries by categories of U.S. college attendance with a confidence interval of 95 percent. They clearly do not intersect which might be a hint that means are different for both groups.
|Fig. 5.1 Error Bar Plot Salary US College|
To test whether means are different or not the variances must be tested for equality. Preliminary, the distribution of both groups must be tested for normality. The Kolmogorov-Smirnov test table 5.2 rejects a normal distribution with a p-value of 0.00.
|Tab. 5.2 Kolmogorov-Smirnov Test Salary0910 groups: US college|
Consequently, the F-test cannot be performed to assess the equality of variances and the Levene-test is applied. The result in table 5.3 is that variances are not equal, because all p-values based on four different center measures are below a significance level of 5 percent. This result is confirmed by the Spread-and-Level Plot in figure 5.2. Both dots which reflect the two different groups of the variable U.S. college do not lie on a horizontal line which means that variances within both groups are different.
|Tab. 5.3 Levene-Test||Fig. 5.2 Spread-and-Level Plot Salary US College|
To compare the means of both groups the t-test with different unknown variances is performed. The result is shown in table 5.4. The assumption of equal variances is rejected, because the p-value is below a significance level of 0.05. Consequently, the salary means for both groups are different from each other which means that it does play a role whether a player attended a U.S. college or not.
|Tab. 5.4 T-test US college|
The relationship of game positions and mean weight
The question if the mean weight as a proxy for a player’s athleticism differs significantly across positions is analyzed with the help of ANOVA and post-hoc tests. Light players are likely to be fast whereas heavy players are likely to play a strong post-up and defense game. Figure 5.3 gives an overview of the different basketball positions. Positions near the basketball basket usually require heavy players. Thus, the mean weight should differ across positions. Figure 5.4 gives the boxplots of weight categorized by positions. It indicates that means are different especially between Point Guards and Centers. The error bar plot in figure 5.5 corroborates the differences since the 95 percent confidence intervals do not overlap for any group.
To test differences in means simultaneously the ANOVA model is applied. Beforehand, the model assumptions of normality within groups and a homogenous variance across groups must be tested. Table 5.5 shows the Kolmogorov-Smirnov and Shapiro Wilk tests for normality within groups. Some p-values are above a significance level of 0.05 and some are below. Thus, normality cannot be assumed for some groups. Though, as some of the p-values are close to 0.05 and above the distribution within groups might not be to far away from a normal one. To verify the result of the ANOVA model (which actually should not be performed due to assumption violations) the Kruskal-Wallis test is performed additionally below. Table 5.6 gives the Levene test with different estimators for the first moment used. The test based on the median and based on the median with adjusted degrees of freedom do not reject an equal variance since p-values are above a significance level of 0.05. In contrast, the Levene test based on mean and trimmed mean reject the assumption of an equal variance which contradicts the test result based on the first alternative of the test. Nevertheless, an equal variance is assumed since two p-values are clearly above a significance level of 0.05 and the other two are relatively close and above other commonly used significance levels such as 0.025.
Table 5.7a gives the result of the ANOVA model. The hypothesis of equal means is rejected as indicated by a p-value of 0.00. This result could be meaningless, because model assumptions are violated as stated above. Nevertheless, the ANOVA test result is confirmed by the non-parametric Kruskal-Wallis test in table 5.7b. The p-value of the test is 0.00 and thus below 0.05. Furthermore, the result of unequal means is confirmed by the Bonferroni post-hoc test in table 5.8 and the Scheffé post-hoc test in table 5.9. Both tests report a p-value of 0.00 for every pair of positions rejecting the hypothesis of equal mean weight. Table 5.10 illustrates this result by the attempt of forming homogenous subgroups. Every position appears only once in a subgroup and thus five homogenous subgroups are found which reflect the five original positions. Consequently, the weight of a player is important for the position he plays.
The role of game positions for a NBA player’s mean salary in 2009-10
The influence of a player’s position on his mean salary is analyzed in this paragraph. Table 5.11 gives some key statistics for different positions. Point guards earn the lowest mean salary with $4,328,389,87 whereas small forwards earn $5,207,309.41 on average and also have the highest median salary with $3,700,000. In contrast, Centers have the lowest median salary with 2,628,439. Figure 5.6 shows the boxplot for salaries across positions which does not reveal significant differences on first sight. The error bar plot in figure 5.7 suggests the same, because all 95 percent confidence intervals overlap with each other.
The Kolmogorov-Smirnov test in table 5.12 denies normal distributions within groups. Thus, the non-parametric Kruskal-Wallis test which does not need the assumption of normality is performed and results are given in table 5.13. The assumption of equal means cannot be rejected, because the reported p-value is 0.627 and thus larger than the significance level of 0.05. Consequently, the position of a player has no influence on his mean salary in 2009-10.
|Tab. 5.12 Kolmogorov-Smirnov Test Position Salary0910||Tab. 5.13 Kruskal-Wallis Test Position Salary0910|
Analysis of Association
Association of Salary0910 and other Metric Variables
Figures 6.1-6.7 show scatter plots of the variable Salary0910 and the rest of the in section three analyzed metric variables. Every scatter plot includes different kinds of regression lines. Green lines are always linear ones whereas orange lines vary. Figure 6.1 shows a clear linear relationship between the variables Salary0910 and Salary0809. The orange line is the non-parametric Loess regression line using a Gaussian Kernel with 50 percent of data points as fitting basis (other percentage criteria provide a similar picture). The line looks almost linear except for low values (the result for a quadratic regression line is similar). Consequently, there seems to be a strong linear relationship. Figure 6.2 shows the scatter plot of Salary0910 and Height. It reveals a discrete behavior of the variable, even though it should be continuous. The reason is that cm unit values are converted from the U.S. customary system which is a non decimal based measurement system. Nevertheless, the variable Height is treated is a continuous one since enough variable realizations are observed. The plot suggests a slight non-linear behavior between both variables. Though, a linear approximation seems reasonable since only the tails do not overlap (outliers might play a role).
The scatter plot 6.3-6.7 suggest non-linear relationships between Salary0910 and the rest of the variables.
Table 6.1 gives the Bravais-Pearson bivariate correlation matrix. Obviously, Salary0910 and Salary0809 are highly correlated with 0.892 which is a hint that rigid wage institutions might play a role salary determination. Also Field Goals Made is correlated with Salary0910 which is hint that performance affects salaries. Surprisingly, the variables Height and Weight are not highly correlated with Salary0910. Considering the correlations between variables except Salary0910 it is apparent that there are some high, moderate and low values. Height and Weight are highly correlated which is not surprising since tall basketball players tend to have more weight than smaller ones. Interesting is the negative correlation between Height and Assists as well as Steals. The same is true for Weight and both latter variables. The explanation is that small light players are faster than tall heavy players which is required for high assist and steal statistics. The results of the Linear-by-Linear Association test are reported in table 6.2. Because most of the p-values are 0.00 or below 0.05 many of the variables are not independent from each other especially by considering dependencies of salary variables with the rest.
|Tab. 6.1 Bivariate Correlations|
|Tab. 6.2 Linear-by-Linear Association Test|
Association of Position and U.S. College
This passage analyzes the association between the two nominal variables Position and U.S. College. Table 6.3 gives the contingency table of both variables. It reports the observed counts and expected counts. The differences between both already indicate that they might be dependent.
|Tab. 6.3 Contingency Table Position U.S. College|
The Chi-Square test in table 6.4 confirms this result. The test assumptions are fulfilled since no cells have expected counts less than 5 as stated in the footnote. The p-value of the Chi-square test is 0.022 and thus less than a significance level of 0.05. The null hypothesis of independence is rejected. The Likelihood Ratio test confirms this result.
|Tab. 6.4 Tests of Independence Position U.S. College|
Table 6.5 gives the symmetric measures of association. The Phi Coefficient should not be considered since it is only valid for 2*2 contingency tables which is not the case here (albeit it has the same value as Cramer's V). The Contingency Coefficient and Cramer's V have values of 0.159 and 0.161 respectively. Both values indicate a minor association of both variables.
|Tab. 6.5 Symmetric Dependence Measures Position U.S. College|
Table 6.6 gives the directional measures of association. All three methods report low values of proportional reduction in error (the highest is 4.6 percent). Hence, the association of both variables seems to be low as already seen above. Although, all tests and measures report a statistically significant relationship between Position and U.S. College on a five percent significance level the relationship seems not to be important, because all measures are low in magnitude. Thus, the relationship might be neglected.
|Tab. 6.6 Directional Measures Position U.S. College|
Linear Regression Models
To analyze the influence of various variables on player’s salaries at the same time a multiple linear regression model is constructed. The variable salary 2009-10 is used as dependent variable. Independent variables are Salary 2008-09 as a proxy for (rigid) wage institutions, Height in cm, Weight in kg as a proxy for athleticism, Experience measured in years, U.S. college attendance, Field Goals Made p.G. as a proxy for offensive game ability, Assists p.G. as a proxy for team play ability and Steals p.G. as a proxy for defensive game ability.
Before estimating the model, assumptions must be analyzed. First, linearity between the dependent and independent variables is checked graphically. Second, the assumption of homoscedasticity is analyzed with the help of the White test for heteroscedasticity. Third, multicollinearity between regressors is examined. Fourth, normality of errors is checked for each model after their estimation separately.
The linearity assumption seems not to be fulfilled by the scattorplots in figures 6.1-6.6 of the previous section except for the relation between Salary0910 and Salary0809. Linear, quadratic and non-parametric regression lines seem to overlap for most of the value body, but not for the tails. Quadratic regression lines have a higher which might indicate a non-linear relationship or a problem of outliers. Since values that might be outliers could not be tested in section three no values are excluded from the analysis in this section. To further assess the assumption of linearity a linear regression was performed with the above mentioned variables and the resulting regression standardized residuals are plotted against regression standardized predicted values in figure 7.1. A non-linear behavior might not be observable since the drawn quadratic regression line almost perfectly overlaps with the linear one. What can be seen already is a violation of the homoscedasticity assumption. The plotted values look like a cone spreading from left to right. Studentized deleted residuals are plotted against the regression dependent variable Salary0910 in figure 7.2. The quadratic and linear regression lines are almost the same which indicates that a linear relationship might be reasonable.
|Fig. 7.1 Standardized Residual Standardized Predicted Value Plot||Fig. 7.2 Studentized Deleted Residual Salary 2009-10 Plot|
This result is not confirmed for all variables by looking at the partial plots in figures 7.3-7.9). The linearity assumption seems to be fulfilled for the variables Experience, Height, Salary0809 and Weight. It does not seem to be fulfilled for the variables Field Goals Made, Assists and Steals. Thus, linearity can be assumed for some, but not all variables. Thus, different model specifications will be calculated including quadratic terms to check for non-linear relationships.
Heteroscedasticity is another assumption to be checked. The White test for heteroscedasticity is applied with the help of EViews 5.0. Table 7.1 gives the result of the White test for heteroscedasticity which tests the null hypothesis of an equal variance for the whole sample range. First, the test runs the regression under the assumption of homoscedasticity. Second, an auxiliary regression of the squared residuals on the used regressors, quadratic terms and distinct interaction terms is performed. Third, it computes the LM statistic ( of the auxiliary regression). The LM test statistic is asymptotically chi-square distributed with degrees of freedom q = the number of auxiliary regressors except the constant. The p-value in table 7.1 is below a significance level of 0.05. Thus, the null hypothesis of homoscedasticity is rejected. The presence of heteroscedasticity is confirmed by plotting the residual Cook's distances and the dependent variable Salary0910 in figure 7.10. Deleting certain single observations from the regression changes the rest of the residuals significantly and Cook's distances are increasing for high salaries (outliers might be influential).
|Tab. 7.1 White Heteroscedasticity Test||Fig. 7.10 OLS Cook Distances|
Multicollinearity is analyzed with the help of the variance inflation factor (VIF) and conditioning index. The analysis is based on the most simple OLS model ignoring non-linear relationships and heteroscedasticity. Table 7.2 shows the included regressors, their significance and VIF values. Taking a look into the third model which excludes insignificant variables shows that VIF values are all below 2.9 and none of the values are considerably larger than other ones which is a hint for mild multicollinearity. Table 7.3 shows the conditioning index and variance proportions for model specification three. The conditioning index is considerably higher for dimension 7 than for the rest which indicates multicollinearity. Furthermore, some high variance proportions are marked with the color yellow which indicate dependencies between these variables. Though, multicollinearity seems to be mild since the VIF values are all below a value of 10 (see Besley et al.) and high variance proportions are not observed more than once for most of the variables.
|Tab. 7.3 Conditioning Index and Variance Proportions|
Since the assumption of homoscedasticity does not hold and there are hints that some relationships are not linear three different model specifications are estimated as follows:
- 1. OLS estimation without quadratic terms (residuals have already been used above to check model assumptions)
- 2. OLS estimation with quadratic terms
- 3. PMLE estimation with quadratic terms and White heteroscedastic robust covariance matrix .
The first model specification serves as reference model to explain basic results and methods whereas for the other models only final results with significant regressors are reported.
First, model one is estimated using the backward method. Stepping method criteria are set to an entry probability of 0.05 and removal probability of 0.10. Table 7.4 presents the estimates, standard errors and p-values. It is apparent that the backward method excluded the variables Height and Experience since they turned out to be insignificant accordingly the chosen removal criterion. Model three as the final model reveals high standard errors which might be due to heteroscedasticity. Table 7.5 shows that the model explains about 84 percent of the variance (the and the adjusted are nearly the same). Regarding the interpretation of the results, the constant does not have a meaningful interpretation in this setting. A negative reservation salary is not meaningful. Other signs are as expected except for the variables US college and Steals. The coefficient for the variable Salary0809 suggests that players who earned $100,000 in that season are likely to earn $71,000 in the next season. The coefficient of Weight as a proxy for athleticism indicates that one more kilogram raises a player's salary by about $25,000. Especially large is the influence of the variable Field Goals Made. A player who scores one additional field goal per game is likely to have a higher salary of about $561,000. A player who gives one more assist per game earns on average about $225,000 more, but this result is not reliable since its standard error is high. On first sight, the result that having attended a U.S. college leads to a decline in salary by about $570.000 is counterintuitive. The result might be explained by two factors. First, young players with relatively low salaries and a U.S. college history might be overrepresented in the NBA and second, players from outside the U.S. must perform very well compared to their U.S. colleagues to get the chance to play in the NBA. The result for the variable Steals is totally odd and cannot interpreted meaningfully. It is not meaningfully interpretable why a high number of steals should decrease a player's salary. Multicollinearity might be the case such that there is a resource conflict between the variable Steals and other performance variables such as Field Goals Made as indicated by table 7.3. Surprisingly, the height of a player does not have an influence on his salary. This could be due to the fact that most of the NBA players are exceptionally tall and that players of different height are allocated to different positions. Table 7.6 checks the model assumption of normally distributed error terms. The residuals are tested by the Kolmogorov-Smirnov test which rejects normality with a p-value of 0.00. Consequently, since the error term seems not to be normally distributed the estimator seems not to be normally distributed and standard errors and significance tests might be incorrect which is a serious issue. Inference might still be correct by modes of convergence and the Central Limit Theorem. Nevertheless, asymptotic normality of the estimator is questionable with a sample size of 445.
|Tab. 7.5 OLS Regression Model Fit||Tab. 7.6 OLS Regression Residual Normality Test|
Second, the OLS model is estimated with quadratic terms for every variable to capture possible non-linearities. Table 7.7 gives the coefficient estimates. Table 7.8 shows that about 85 percent of the variance is explained by the model. The main difference compared to the latter model is that Experience and its quadratic term as well as quadratic terms for Steals and Assists are significant. Furthermore, Field Goals Made and Weight enter the model quadratically. Signs of the coefficients are as expected except for Steals again. Interestingly, one year experience increases the salary by about $283,000. But this bonus declines with every additional year as the negative sign of the respective quadratic term indicates. The quadratic terms of Weight, Field Goals Made and Assists indicate that gains for Salary0910 increase as the corresponding measures rise (an outstanding performance is rewarded disproportionately high). Checking the normality assumption, table 7.9 shows that the residuals are not normally distributed. Again, the assumption of normality is violated and the same reasoning of asymptotic normality might apply as before.
|Tab. 7.7 OLS Regression Quadratic Terms Estimates|
|Tab. 7.8 OLS Regression Quadratic Terms Model Fit||Tab. 7.9 OLS Regression Quadratic Terms Residual Normality Test|
Third, the model with quadratic terms is estimated by pseudo-maximum-likelihood (PMLE) with the help of the heteroscedasticity robust White covariance matrix. PMLE assumes a wrong distribution (in this case a normal distribution), but still achieves consistency, but losing efficiency. Results of the final specification are reported in table 7.10. Insignificant variables are excluded from the analysis step by step by eliminating the variable with the highest p-value at each step until all p-values are below 0.05. The final model specification provides parameter estimates similar to the second estimated model above except that U.S. college and the quadratic terms of Steals and Assists turn out to be insignificant. Again, table 7.11 rejects the assumption of normality, because the p-value of the Kolmogorov-Smirnov test is 0.00. In this case, inference is correct although the errors are not normally distributed. Consistency is achieved by pseudo-maximum likelihood estimation. Standard errors and significance tests of the parameter estimates are valid by applying the heteroscedasticity robust White covariance estimator. Though, the used estimator is not efficient anymore as indicated by the high standard errors (see Hayashi 2000 and Heij 2004).
|Tab. 7.10 PMLE White Regression Quadratic Terms|
The previously used performance measures might be too simplistic to explain what performance actually might be. More variables are of interest as proposed by the number of performance measures in table 2.1. All performance measures listed in the table are used for the analysis except Field Goals Made since it is a linear combination of the variables Two Points Made and Three Points Made. This section aims at finding a small number of latent factors of performance to reduce the number of variables. To get an idea about a reasonable number of factors principal component analysis is undertaken. Principal components are generated if their corresponding eigenvalue is larger than one. The analysis is based on the correlation matrix to ensure scale independence. Table 8.1 shows the eigenvalues and total variances explained by the corresponding factor. As can be seen, two components are generated which explain about 77.91 percent of the total variation. The screeplot in figure 8.1 plots the components with the corresponding eigenvalues. Two eigenvalues are above a value of one and thus two components are extracted. Table 8.2 gives the communalities after the component extraction. Whereas only 62.8 percent of the variation of Three Points Made is explained by the found factors, 86.9 percent of the variation of Two Points Made is explained. The component matrix in table 8.3 shows the component loadings of the original variables. Component one might be interpreted as the ability of players near the basket inside the offense zone (indicated as semi circle in figure 5.3), because variables known to be important in this zone are positively loaded. Though, component two is not interpretable in a reasonable fashion. Furthermore, too many variables are highly loaded with component one which also leads to interpretation difficulties. A rotation might overcome that problem and is undertaken beneath.
|Tab. 8.1 Principal Component Total Variance Explained||Fig. 8.1 Principal Component Scree Plot|
|Tab. 8.2 Principal Component Communalities||Tab. 8.3 Principal Component Factor Loadings|
Since two components are identified with the help of principal component analysis two factors are assumed for the factor analysis. To assess the sampling adequacy the Kaiser-Meyer-Olkin measure is reported in table 8.4 with a value of 0.774. Thus, a factor analysis can be performed. The Bartlett’s test of sphericity in the same table reports a p-value of 0.00. The null hypothesis of an identity correlation matrix is rejected (this test result is questionable since the test assumes an approximate multivariate normal distribution which presumably is not fulfilled and not tested here).
|Tab. 8.4 Factor Analysis KMO and Bartlett's Test|
Principal components and principal axis factoring are used as estimation methods (maximum likelihood estimation is left out, because most of the data are not normally distributed as shown in previous sections). The varimax rotation is applied and estimations are based on the correlation matrix. Most of the variables are well explained by the factors except Three Points Made as can be seen for both extraction methods by the communalities in table 8.5. Principal components as extraction method seems to generate factors that explain more variation of the variables than the principal axis factoring, because all communalities are larger for principal components. Table 8.6 shows the factor loadings of the rotated component matrices. Both methods approximately provide the same results, but factor loadings are not as strong and distinct for principal axis factoring as for principal components. Subsequently, only results of principal components are considered. Factor loadings of the variables change compared to principal component analysis without rotation. Figure 8.2 shows the component plot before rotation and figure 8.3 after rotation. The interpretation of the rotated factors is more distinct and intuitive. High loadings are highlighted with yellow. The first factor has high loadings for Three Points Made, Free Throws Made, Assists and Steals – variables that reflect game statistics of positions outside the offense zone (point and shooting guard). Thus, factor one might be interpreted as "game ability outside the offense zone". Factor two is highly loaded with variables that are associated with game positions inside the offense zone (center, small and power forward). Thus, factor two might be interpreted as "game ability inside the offense zone". Both factors seem to be reliable. Cronbach's Alpha for factor one is 0.717 and for factor two 0.725 which is above the required minimum of 0.7 (all variables are loaded positively on the respective factor such that calculating Cronbach's Alpha makes sense).
|Fig. 8.2 Component Plot before Rotation||Fig. 8.3 Component Plot after Rotation|
|Tab. 8.5 Factor Analysis Communalities|
|Tab. 8.6 Factor Analysis Factor Loadings|
The paper analyzes how salaries of players in the National Basketball Association are determined with the help of different statistical tools. Salaries of the season 2009-10 are analyzed and the relationship with salaries of the previous season 2008-09, physical and performance measures is examined. The data are anaylzed for outliers in section three and missing values in section four. Section five shows that U.S. college attendance is negatively associated with a player's salary. This odd result might be interpreted such that U.S. college basketball provides the NBA with most of the players, whereas players from abroad or with none college attendance must outperform their colleagues which lead to exceptionally high salaries. Furthermore, it is shown that the weight as a proxy for a player's athleticism is important for what position he plays. Heavy players a likely to be assigned to positions near the basket and light players are usually assigned to positions outside the offense zone. Moreover, it is shown that game positions do not have an influence on mean salaries. Section six analyzes dependencies between variables. It is shown that salaries of the season 2009-10 are highly correlated with salaries of the season 2008-09 - a hint that rigid wage institutions have a large influence on salary determination. Performance measures are moderately correlated with salaries. Scored field goals seem to be the most important influential factor of performance measures. Different linear regression model specifications are estimated in section seven. Overall, salaries of the latter season 2008-09 account for most of the variation in salaries of 2009-10. Additionally, field goals made and athleticism are important. Experience also seems to have a positive influence on salaries. Nevertheless, the large influence of salaries of a previous season reflect inefficient rigid salary institution since performance seems not to be the most important factor (market failures such as asymmetric and uncertainties might be important). Section eight simplifies the question of how performance can be measured with the help of factor analysis. The found factors indicate that game ability inside the offense zone (two points shots, blocks and rebounds) and game ability outside the offense zone (three point shots, assists and steals) are important to assess a player's performance.
My friend's answer to the question of what determines a basketball player's salary: "No one knows" should rather be: "How good a player's past performance was and - more importantly - his current contract is".
- Kahn, L. & Sherer, P. 1988. Racial Differences in Professional Basketball Players' Compensation, Journal of Labour Economics, 6(1)
- Leonard, J. & Prinzinger, J. 1999. A determination of professional basketball salaries based on performance and race, Atlantic Economic Journal, 27(2), pp. 238
- Yu, K. et al. 2008. An Exploratory Study of Long-Term Performance Evaluation for Elite Basketball Players, International Journal of Sports Science and Engineering, 2(4), pp. 195-203
- Belsley D., Kuh, E. & Welsch R. 1980. Regression diagnostics: Identifying influential data and sources of collinearity, New York, Wiley
- Bleymüller, J., Gehlert, G. & Gülicher, H. 2004. Statistik für Wirtschaftswissenschaftler, München, Verlag Vahlen
- Choros, B. & Klinke, S. 2009. Course material of the course "Multivariate Statistical Analysis II" at Humboldt-Universität zu Berlin
- David, H., Hartley, H. & Pearson, E. 1954. The distribution of the ratio, in a single normal sample, of range to standard deviation, Biometrika, vol. 41, p. 491
- Eckstein, P. 2008. Angewandte Statistik mit SPSS. Praktische Einführung für Wirtschaftswissenschaftler, Wiesbaden, Gabler
- GraphPad. http://www.graphpad.com/quickcalcs/Grubbs1.cfm (access: 13.01.2010)
- Grubbs, F. & Beck, G. 1972. Critical values for six Dixon tests for outliers in normal samples up to sizes 100, and applications in science and engineering, Technometrics, Vol. 14 (4), p. 848.
- Grüner, H. http://gruener.userpage.fu-berlin.de/spss-tutorials.htm#deskrip1 (access: 25.09.2009)
- Härdle, W. & Simar, L. 2003. Applied multivariate statistical analysis, Berlin, Springer
- Hayashi, F. 2000. Econometrics, Princeton University Press, pp. 53-54
- Heij, C. et al. 2004. Econometric methods with applications in business and economics, Oxford University Press, pp. 259-260
- Rönz, B. 2000. Script of the course: "Computergestützte Statistik" at Humboldt-Universität zu Berlin
- SPSS. 2007. SPSS missing value analysis TM 16.0, Chicago, SPSS Inc.
- Stock, J. & Watson, M. 2007. Introduction to econometrics (Brief Edition), Boston, Pearson