OLYMPIC DATA ANALYSIS

From Teachwiki
Jump to: navigation, search
Error creating thumbnail: File missing
Please do not cite work from this wiki, since these are mainly students theses which may contain errors!
Error creating thumbnail: File missing

Olympic Data Analysis


Abstract[edit]

In this thesis we analysis the data of Women’s field sport results from 1984 Olympic Games, employing skills such as descriptive statistics, outlier test, distribution estimate and test, and trying to find the relationship between the performance of a country in short-distance and long-distance runnings. Adopting the statistical methods such as regression and implementing in XploRe, the results of analysis verified our assumption. The detailed method and data analysis are presented in the following thesis.

Key Words[edit]

running performance outlier test density test correlation regression variable transformation

Background and Data Set[edit]

We retreat our data set from1984 Olympic records, collected by Johnson & Wichern [1]. The data shows result from Women Field competition of athletes from 55 countries, including 100m, 200m, 400m, 800m, 1500m, 3000m, and marathon. Originally it contains 55 observations and 8 variables.

The data in given by the following table

National Track Records for Women
OBS   COUNTRY     100m    200m    400m   800m   1500m   3000m   Marathon
  1   argentin   11.61   22.94   54.50   2.15    4.43    9.79    178.52
  2   australi   11.20   22.35   51.08   1.98    4.13    9.08    152.37
  3   austria    11.43   23.09   50.62   1.99    4.22    9.34    159.37
  4   belgium    11.41   23.04   52.00   2.00    4.14    8.88    157.85
  5   bermuda    11.46   23.05   53.30   2.16    4.58    9.81    169.98
  6   brazil     11.31   23.17   52.80   2.10    4.49    9.77    168.75
  7   burma      12.14   24.47   55.00   2.18    4.45    9.51    191.02
  8   canada     11.00   22.25   50.06   2.00    4.06    8.81    149.45
  9   chile      12.00   24.52   54.90   2.05    4.23    9.37    171.38
 10   china      11.95   24.41   54.97   2.08    4.33    9.31    168.48 
 11   columbia   11.60   24.00   53.26   2.11    4.35    9.46    165.42
 12   cookis     12.90   27.10   60.40   2.30    4.84   11.10    233.22
 13   costa      11.96   24.60   58.25   2.21    4.68   10.43    171.80
 14   czech      11.09   21.97   47.99   1.89    4.14    8.92    158.85
 ...  .....       ...     ...     ...     ...     ...    ...      ...
 36   mauritiu   11.76   25.08   58.10   2.27    4.79   10.90    261.13
 ...  .....       ...     ...     ...     ...     ...    ...      ...
 55   wsamoa     12.74   25.85   58.73   2.33    5.81   13.04    306.00
 ...  .....       ...     ...     ...     ...     ...    ...      ...

Analysis Target[edit]

Our analysis target is to find out the relationship between the performance of a country in short-distance running and long-distance runnings. The result should be given in the form of regression formula.

Analytical Method[edit]

Data modification[edit]

For further comparison, we convert the unit of time into time spent for running 100 meters in each sport to put them into the same scale. Calculation formula is given by

Time per 100M = \frac{time}{distance} = \frac{time*(60)}{ \frac{distance}{100}} Now we obtain the data in uniform unit, which is ready for next step’s calculation.

Another modification is made when explanatory variable and dependent variables are defined. 100, 200 and 400m are defined as explanatory variables (X_1, X_2, X_3) while the most representative long distance running marathon as dependent variables (Y). Because adopting data of 800m can incur ambiguity in the definition for long-distance running. These data are excluded from our analysis.

Descriptive Graphing[edit]

Univariate Graphing[edit]

In order to check distribution and existence of outliers, we generate box plot[2] of our transformed data, as shown in following. (Note that the x axis of box plot has no meaning.) From the plot we can see that few outliers exist. And they all appear on the upper boundary side, pulling the mean higher away from median in the dataset except in 800M. Though there are no outliers in 800M, the data of this variable is still skewing distributed. Before deciding about kicking outliers, we still need to find out whether there are some special reasons contributing to them. And we also need to see in multidimensionality whether these outliers in one dimension are still outliers in other dimensions.If not, then the data may provide some important information, wrongly excluding them could lead to unsatisfactory result in future analysis. Boxplot ORD.JPG

1-1 Boxplot of running data



Multivariate Graphing[edit]

Star Diagram[edit]
1-2 Stardiagram of Running data




Star diagram can show multidimensional data in plane. Each axis of a star represents one variable in dataset. The dataset is standardized to a common interval. Then we can easily identify multidimensional outliers as they tend to become round or a larger heptagon.

In the diagram we can identify several possible outliers: the 12th, 13th, 36th and 55th countries. The 12th country is Cookis, the islands located in South Pacific Ocean, 13th country is Costa Rica in Central America ,while the 36th and 55th countries are Mauritius Western Samoa, which are both South Pacific Oceania countries.

At the same time, we find some countries have perfectly contradicted almost into a point. When referring back to the dataset, we found them are respectively: 19th German Democratic Republic, 53rd USA and 54th USSR. However, the 20th country, Federal Republic of Germany is lagging a little behind in terms of track performance. Up to now, it is still not clear that whether subgroups exist in our data.This also requires future work.






PCP[edit]
1-3 PCP of Running data


Plotting dataset in parallel coordinate also yield the similar result. In this diagram, we partition countries according their performance in Marathon. Those countries with a value larger than median in marathon result will be colorized blue, otherwise it will be red. The two obviously flowing away curves represent respectively 12th and 55th countries.

As PCP is plot in such as way that the biggest value is regarded as 1, and the rest coordinate will be the proportion of 1, we can see that in short distance running, the two slowest countries differ not so greatly as they differ from other countries in long distance running.

Strong correlation can also be observed between the 4th and 5th variables. However, we did not see anythings which clearly shows there is subgroup in our dataset.










Andrew’s curve[edit]
1-4 Andrew's curve of Running data


The Andrew’s curve is also partitioned in the same fashion as PCP. There are also a few curves flowing away from the other curves. The most obvious ones are 12th, 13th, 36th and 55th countries. From this diagram we can also see that blue and red colors are separated quite well, except a few one still entangling. Hence we can conclude that using the 7th variable to partition dataset is desirable.
















Face plot[edit]

Face 1 ORD.JPG Face 2 ORD.JPG

1-5 Flury faces plot Running data


100m = mouth size, 200m = pupil size, 400m = eye slant, 800m = upper hair line 1500m = lower hairline, 3000m = face line, marathon = darkness of hair,

Flury faces can help to visualize dataset in a more vivid way. For the reason that we assign the most distinct feature: face line and hair darkness to the 3000m running and marathon, a “squared ” face and dark hairs will suggest poor performance in both competition. The 55th country again stands out, however, 12th country is not so unique as there are some similar face as well: 23rd guatemal and 36th mauritiu, 41st Papua New Guinea.



Scatter Plot[edit]

In order to illustrate the overall and general relation between each explanatory and dependent variables, a series of scatters are plotted.. From now on, we begin to introduce some selection of variables. As 800 is ambiguous between short distance and long distance, we exclude it from our scatter plot. In the plot, we find there might be quadratic curves in the plot against Y and X_3. So we add X_3^2 item into our model (xplore result also shows after adding this, out model can explain more about the original data’s variation).


Scatter ORD.JPG

1-6 scatter plot Running data

Outliers Detection and Test[edit]

For more simplicity and better focusing, in the following, we restrict to the 4 variales which are related to our model: X_1(100m), X_2(200m), X_3(400m) as explanatory variables and marathon(Y) as dependent variables

Though graphically we identify some potential outliers, they are not mathematically proved to be outliers. Actually, more cautious work is required. In statistics, outlier is defined as a single observation "far away" from the rest of the data. We can detect them using some techniques. Once potential outliers are located, we should not just simply delete them. If the outliers stem from mis-recording, then it is reasonable to delete it. Otherwise, it may just simply indicate a different type of response. [3]

Outlier locating technique can be done though linear model fitting. First of all, we use a general forward selection linear regression model to approximate marathon performance (Y) by our short running performance({X_1}\sim{X_2}\sim{X_3}=X) to analysis residuals, first of all we plot the standardized residuals from fitting value of Y by X as follows.

rundata=read("running")
rundatao=rundata[,1]~rundata[,2]./2~rundata[,3]./4~rundata[,4]*60./8~rundata[,5]*60./15 ~ rundata[,6]*60./30 ~ rundata[,7]*60./422
x1=rundatao[,1]
x2=rundatao[,2]
x3=rundatao[,3]
X=x1~x2~x3
Y=rundatao[,7]
{xfs,bfs}=linregfs(X,Y, 0.05)
{res,out}=linregres(xfs, Y, xfs*bfs)
std=res[,3]
stdsqr=res[,3]^2
stdsqr
d= createdisplay(1,1)
show(d, 1, 1, xfs*bfs~std)
setgopt(d, 1, 1, "xlabel","Fitted Value","ylabel","Standard Residuals")
sqrt(51*0.1545)

Fitted value ORD.JPG

1-7 Data Fitting


At the first glance, we can find there are some influential observations(outliers) in the plots, however, a more detailed data analysis and outlier test should be done before any final decision is down. We begin our outlier test, which is based on standardized residuals. First we should assume normally distributed errors, e_t \sim N(0, {\sigma}^2) i.i.d. Then concerning the potential outlier, the t-th observation, we have two hypothesis:

Hypothesis H : Ey = X\beta

Alternative bar{H}_t : Ey = X\beta + {\delta}_{t} ; \delta_{t} is a T-vector of zeros except for the t-th component.

We test the Hypothesis by using formula: d_t : = \frac{\tilde{e}^2}{T-K} \sim beta(\frac{1}{2}, \frac{T-K-1}{2})

And by using this alpha-test, we will rejects H if d_t is larger than the (1 - \alpha)-quantile of thecorresponding beta-distribution. And according to Barnett & Lewis and Cran et. all the critical value is \sqrt {(t-k) d_a^{u}} where d_a^{u} is the \left ( 1- \frac{\alpha}{T}\right)-quantile of the beta\left ( \frac{1}{2},\frac{T-K-1}{2}  \right)- distribution. here we set \alpha= 0.2(in this test, we usually select a larger alpha value than usual). As there is no the corresponding beta distribution critical value generator in XploRe , we use matlab to get the corresponding critical value of the beta distributions is 0.1545. then if our test statistic is higher than our test statistic critical value 2.807 which is equal to \sqrt{51*0.1545}, we reject our null hypothesis. The outcomes can be seen the following table.

Test Outcome
[ 1,] 0.026947
[ 2,] 0.011173
[ 3,] 0.15936
..... ........
[ 13,] 3.7855
..... ........
[ 36,] 5.8698
..... ........
[ 55,] 19.196

table 2 Outlier Test Outcome


After checking, we can find the # 13, 36 and 55 observations have larger test statistic value than our critical value which drivers us to reject the null hypotheses. we can find that for # 36, and 55’s high test statistic mainly results from their abnormal marathon performance where their performance is almost one times slower than the best country. This can be regarded as very abnormal and we should deleted them for our further analysis. However, for the 13th country Costa Rica, its high value is resulted from they take less time almost in every running which should be desirable in the Olympic games, so we should never delete it. Though # 12 country cookis behave like a potential outlier in the previous graphs, its abnormality comes from other variables than what we are interested here, therefore after outlier test, we just delete 36 and 55th observation, the left ones will be used for further analysis.

Distribution Test[edit]

For this part, we use shifted averaged histogram to get a general feeling and estimate our explanatory variable’s distribution which are demanded for further analysis.

First of all, we use default bandwidth to plot the shifted averaged histogram (1st column) for x1 and x2 variables. For the first impression, they seem normal and find there are 3 peaks in each plot. However, as the theoretical research shows that the optimal bandwidth selection will be (n^{-1/3}) to minimize the AMISE of our density function estimator. And if the true density is normal (unfortunately in practice, we can’t know this, so the following optimal bandwidth is rule of thumb bandwidth) [4]is 3.5* N^{-1/3} which is 0.9203 in our model. As in the right hand side, we can find our histogram plotting is much more smoothed, and seems like normal. However a theoretical test should be cautiously constructed before any conclusion is made.


rundata=read("running1")
rundatao=rundata[,1]~rundata[,2]./2~rundata[,3]./4~rundata[,4]*60./8~rundata[,5]*60./15 ~ rundata[,6]*60./30 ~ rundata[,7]*60./422
t1  =(rundatao[,1])
t2  =(rundatao[,2])
bp11 = grash(t1)
bp12 = grash(t1,0.9203)
bp21 = grash(t2)
bp22 = grash(t2,0.9203)
d= createdisplay(2,2)
show(d, 1, 1, bp11)
show(d, 1, 2, bp12)
show(d, 2, 1, bp21)
show(d, 2, 2, bp22)
setgopt(d, 1, 1, "xlabel","100M'(Default Bandwidth)")
setgopt(d, 1, 2, "xlabel","100M'(Optimal Bandwidth)")
setgopt(d, 2, 1, "xlabel","100M'(Default Bandwidth)")
setgopt(d, 2, 2, "xlabel","200M'(Optimal Bandwidth)")

Kernal ORD.JPG Kernal 2 ORD.JPG

1-8 Density Estimation of running data


Similarly, we can plot x3 and y to check their distribution. Please note that, it seems y is more skewed to the left side which shows it’s not normal distributed. In the following test, we can get same conclusion concerning distribution of y.


As our sample size is more than 50, our test is based on the skewness and kurtosis instead of employing the shapiro-wilk test. (where n less or equal 50).

Let \mu_p = E(y - E(y))^ {p}  denote the p-th central moment of a random variable y. Then \gamma_1 =\frac{ \mu_3}{\sqrt{{\mu_2}^3}} and \gamma_2 =\frac{\mu_4}{{\mu_2}^2} are called skewness and kurtosis, respectively.

For a sample y_1 ,……, y_T , the empirical central moments are defined by  \hat{\mu_p} =\frac{1}{T} \sum_{t-1}^T (y_t- \bar{y})^p  , (p = 2, 3, ….)

Now \hat{\gamma_1} =\frac{\hat{\mu_3}}{\sqrt{{\hat{\mu_2}}^3}} (sample skewness)

\hat{\gamma_2} =\frac{\hat{\mu_4}}{\hat{{\mu_2}}^2} (sample kurtosis)

For the case of normal distribution, both skewness and excess (i.e. kurtosis - 3) are zero. Using Jarque-Bera test, we try to compares the difference of skewness and kurtosis of the data with those from a normal distribution:

Test statistic formula is given by:  \frac{T}{6}\left ({\hat{\gamma}_1}^2+ \frac{{\hat{\gamma}_2-3}^2}{4}\right) \sim {\chi_2}^2  (asymptotically) (Here our hypothesis are H : \gamma_1 = 0 and \gamma_2 = 3 . \bar{H} : \gamma_1 \ne  0 and \gamma_2 \ne 3

After running the program we used in xplore, our X_1,  X_2,  X_3 have test statistics 1.25, 4.29,0.26 which are all smaller than our critical value 5.9915 generated from the corresponding beta distribution at the 5% significance level., while our Y dataset has 19.427 much higher than 5.9915 which confirm our previous assumption --- we can’t reject X_1,  X_2,  X_3 has normal distribution while should reject the assumption that Y is normally distributed.

Correlation[edit]

Correlation coefficients can be shown as result of XploRE:


[1,]        1   0.95942   0.83548   0.70676   0.71094   0.73618   0.71441 
[2,]  0.95942         1   0.83954   0.68462   0.68498   0.69678   0.68757 
[3,]  0.83548   0.83954         1   0.88193   0.80358   0.78087   0.68912 
[4,]  0.70676   0.68462   0.88193         1   0.93746   0.87528   0.7693 
[5,]  0.71094   0.68498   0.80358   0.93746         1   0.94919   0.81305 
[6,]  0.73618   0.69678   0.78087   0.87528   0.94919         1   0.83641 
[7,]  0.71441   0.68757   0.68912   0.7693    0.81305   0.83641         1


In accordance with the scatter, a large correlation coefficient are observed between X_1 and X_2. So if we directly involve both of them in our regressor , it will cause serious multicollinearity problem which makes our parameter estimation very unstable. We are driven to do some transformation at first to eliminate this effect.

Variable Transform[edit]

By regressing X_2(200m) on X_1(100m), we obtained the formula: \hat{X_1}=0.9824X_2

The transformation of explanatory variables is made possible by utilizing the coefficient 0.9824. And after adding X_3^2 item, the adjusted R^2 is increase from 0.51158 to 0.56684. The new explanatory variables are given in the following form.

X_1'=X_1-0.9824*X_2 , X_1 is the residuals regressing  X_2 on  X_1

X_2'=X_1+0.9824*X_2 , X_2' is original X_1 + estimated X_1 by regressing X_2

X_3  (There is no change of this variable)

X_4=X_3*X_3

XploRE Modelling[edit]

When the data is transformed, further analysis can be carried out.

Regression of Y on X_1', X_2',X_3,X_4[edit]

Our model can explain about 56% of our original data(Marathon)’s variation. At the 5% significance level , the P-value of X_3 and X_4 ‘S are less then 5% which shows they are significant , while X_1 is not significant as their P value are larger than 5% which drivers us to remove X_1 from our model which is performed next.

 

[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                   288.012     4    72.003      18.012   0.0000"
[ 5,] "Residuals                    191.882    48     3.998"
[ 6,] "Total Variation              479.894    52     9.229"
[ 7,] ""
[ 8,] "Multiple R      = 0.77470"
[ 9,] "R^2             = 0.60016"
[10,] "Adjusted R^2    = 0.56684"
[11,] "Standard Error  = 1.99938"
[12,] ""
[13,] ""
[14,] "PARAMETERS         Beta         SE         StandB        t-test   P-value"
[15,] "________________________________________________________________________"
[16,] "b[ 0,]=        202.8759      88.4055       0.0000         2.295   0.0262"
[17,] "b[ 1,]=          2.8905       2.0601       0.1567         1.403   0.1670"
[18,] "b[ 2,]=          1.5008       0.5861       0.4687         2.561   0.0136"
[19,] "b[ 3,]=        -33.8212      13.1570      -7.0654        -2.571   0.0133"
[20,] "b[ 4,]=          1.3322       0.4947       7.4381         2.693   0.0097"

Regression of Y on X_2', X_3, X_4[edit]

Secondly, we remove X_1' from our model, and X_2', X_3, X_3^2 are all significant at the 5% significance level.

[ 1,] ""
[ 2,] "A  N  O  V  A                   SS      df     MSS       F-test   P-value"
[ 3,] "_________________________________________________________________________"
[ 4,] "Regression                   280.143     3    93.381      22.907   0.0000"
[ 5,] "Residuals                    199.751    49     4.077"
[ 6,] "Total Variation              479.894    52     9.229"
[ 7,] ""
[ 8,] "Multiple R      = 0.76404"
[ 9,] "R^2             = 0.58376"
[10,] "Adjusted R^2    = 0.55828"
[11,] "Standard Error  = 2.01905"
[12,] ""
[13,] ""
[14,] "PARAMETERS         Beta         SE         StandB        t-test   P-value"
[15,] "________________________________________________________________________"
[16,] "b[ 0,]=        197.5929      89.1939       0.0000         2.215   0.0314"
[17,] "b[ 1,]=          1.2263       0.5579       0.3830         2.198   0.0327"
[18,] "b[ 2,]=        -32.0543      13.2254      -6.6963        -2.424   0.0191"
[19,] "b[ 3,]=          1.2654       0.4973       7.0653         2.545   0.0141"

Final Model and prediction[edit]

Finally we have our model, which illustrate relation between explanatory variables and dependant variable.

Y = 197.59 + 1.23X_2'-32.05X_3+1.27X_3^2 = 197.59+ 1.23X_1+1.21X_2-32.05X_3+1.27X_3^2

To test this model, we also generate prediction of data from it, and compare those prediction with true data. Graphs are shown below. We can find that our model fits the data well except in some extreme situations.


Prediction1 ORD.JPG


Prediction2 ORD.JPG


Prediction3 ORD.JPG

1-9 Prediction of running data

Conclusion and Explaination[edit]

Based on the exploration and construction of the final linear regression models above, we can now draw our conclusion. A quadratic polynomial relation exist between the performance of 100m, 200m, 400m running and the marathon. If X_1, X_2 increases one unit, Y will incrase 1.23 and 1.2054 unit respectively. The speed of marathon running will increase at first as speed of 400m goes up, but after reaching the maximum value, marathon speed decreases as 400m speed continue to increase. This may be resulted from the physical conditions of women athletes in different countries, or from the differentiated support from governments to long and short distance running sport.

Reference[edit]

1. Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis. Prentice-Hall International, USA.

2. W. Härdle, S. Klinke, M. Müller. XploRe Learning Guide

3. Dr. Droge, selected topics in economics, Humboldt Universitaet zu Berlin.

4. Multivariate density estimation: theory, practice, and Visualization, John Wiley & Sons, New York, Chichester