Analysis of a Financial Dataset (DAX companies)
Please do not cite work from this wiki, since these are mainly students theses which may contain errors!
- 1 Introduction
- 2 Exploring the Data
- 3 Statistical Inference
- 3.1 PP Plots and QQ Plots for Daily Data
- 3.2 Jarque-Bera Test for Daily Data
- 3.3 Kolmogorov-Smirnov Test for Daily Data
- 3.4 Constructing a New Dataset
- 3.5 PP Plots and QQ Plots for Monthly Data
- 3.6 Jarque-Bera Test for Monthly Data
- 3.7 Kolmogorov-Smirnov Test for Monthly Data
- 3.8 T-Test for Monthly Data
- 3.9 F-Test for Monthly Data
- 4 Regression Analysis
- 5 A Risk Management Application
- 6 Conclusions
- 7 References
- 8 Comments
The purpose of this paper is to analyze a financial dataset using various statistical methods. It builds on knowledge from previous courses and projects and is intended to implement new elements from the Applied Quantitative Methods class. It is also an exercise in view of the fact that some techniques can be applied to datasets with particular features, so it is interesting to see to which extent these techniques can be used to analyze financial data.
The Initial Dataset
In the course of the study, the data will be transformed or additional datasets will be created for the purpose of the analysis. When mentioning "the dataset", a reference is made to the initial dataset, and all subsequent datasets are derived from this one. In order to avoid misrepresentations, the derived datasets will be referred to in an explicit manner.
- Period: 1. January 1999 - 31. December 2007
- Daily observations of 9 DAX companies
- 2347 trading days, 21123 observations
- Data downloaded from Datastream
- Returns are calculated using
where is the level of the stock price on day t.
The 9 companies are: Allianz, BMW, Continental, Daimler, Lufthansa, Münchener Rück, Siemens, TUI and Volkswagen. By grouping them on fields of activity, the following categories can be formed:
- Automobile: BMW, Continental, Daimler and Volkswagen
- Insurance: Allianz and Münchener Rück
- Transportation & Logistics: Lufthansa, TUI
- Industrial: Siemens.
|MUV2||Münchener Rück AG||3.60%|
The weights of the companies in the DAX index change frequently (daily basis), the ones from the table are taken from
Exploring the Data
In this section, the descriptive statistics are presented, namely Mean, Variance, Standard Deviation, Minimum, Maximum, Skewness and Kurtosis. To calculate them, XploRe was used.
The plot represents the time series of log-returns for the 9 companies. The axes have the same scale, to insure comparability. One can get an idea about the volatility of the stocks. Generally, the daily log-returns are well within the [-15%,+15%] variation interval. However, there are observations that exceed these margins:
- decreases that exceed -15%: Allianz, Lufthansa, Münchener Rück, TUI
- increases that exceed +15%: Lufthansa, Münchener Rück.
The correlation matrix is a simple and useful tool that enables one to see the relationships between variables. The correlation coefficient between two variables ranges between [-1,1], but what is considered low, medium and high correlation respectively, is subject to a convention.
In the present case:
- all correlations are positive
- considered low corelation to be between [0,0.35], medium between (0.35,0.60), high between [0.60,1]
- most correlations are medium
- there are only 2 low correlations and 2 high correlations.
Considering the high correlations:
- a correlation coefficient of 0.78 indicates a strong direct relationship between Allianz and Münchener Rück, which can be explained by the fact that both companies are from the insurance field
- there is also a high positive correlation between Daimler and Volkswagen, both from the automobile industry, with a coefficient of 0.64.
Considering the low correlations:
- Continental has low correlations with Siemens and TUI
- all companies except the ones from the automobile sector seem to have relatively lower correlation coefficients with Continental.
Another helpful exploratory tool is the boxplot. Different software packages have different ways of drawing boxplots. In this case Matlab was used; by default, the red line represents the median, the edges of the box are at the lower and upper quartiles, respectively. The whiskers have the maximum length 1.5*Inter-quartile range. The points outside the whiskers represent the outliers, depicted by "+".
It can be inferred from the plot that:
- the median is close to zero in all cases, the same as the mean, which was included in the Descriptive Statistics table
- Daimler and Volkswagen have the smallest spread of log-return values
- the insurance companies have a large number of outliers and seem to be more volatile.
In the practice of statistics, there exists a variety of parametric and non-parametric tests that can be performed in order to test hypotheses about the data.
- Parametric tests: the distribution of the population is fixed, usually the normal distribution is considered
- Non-parametric tests: the distribution of the population is free, usually it is based on ranks
PP Plots and QQ Plots for Daily Data
It is relevant to see whether the log-returns of the companies are normally distributed or not. In this respect, the Probability-Probability Plot and the Quantile-Quantile Plot are used.
From both pictures it appears that the empirical distributions (blue) of log-returns do not fit to the normal distribution (red). Further analysis can be conducted by means of a Jarque-Bera test.
Jarque-Bera Test for Daily Data
The Jarque-Bera Test is a non-parametric test of normality in the data.
The hypotheses of the test are:
- H0: data are normally distributed
- H1: data are not normally distributed
The value of the test is computed according to the formula:
where n is the number of observations, S represents the Skewness and K the Kurtosis of the data.
The critical value for the 5% significance level stands at 5.99. A rejection of H1 does not automatically imply normality.
The results of the test for the daily data are summarized in the table below.
The values of the Jarque-Bera test are larger than 5.99 in all cases, and it can be concluded that none of the stocks have normally distributed daily log returns. Particularly, Allianz, Lufthansa, Münchener Rück and TUI have very large values for the test, which indicates that they are the furthest away from the normal distribution.
Kolmogorov-Smirnov Test for Daily Data
Another way to test whether the data are normal or not is the Kolmogorov-Smirnov test. One way to perform this test is by using SPSS. "The One-Sample Kolmogorov-Smirnov procedure is used to test the null hypothesis that a sample comes from a particular distribution. It does this by finding the largest difference (in absolute value) between two cumulative distribution functions (CDFs)- one computed directly from the data, the other, from mathematical theory." (SPSS Documentation)
The hypotheses of the test are:
- H0: data comes from the normal distribution
- H1: data doesn't come from the normal distribution
The probability of the Z statistic is in all cases below 0.05, meaning that the normal distribution with the respective mean (different for each case, calculated from the empirical observations) is not a good fit for the daily log-returns of the companies. The interpretation of the result is somehow different than what is usual in statistics, and a significant result is actually a rejection of H0.
Not having normal data impedes the analysis through parametric tests. Therefore, a new dataset is constructed, with log-returns considered on monthly basis.
Constructing a New Dataset
The new dataset is derived from the initial one and has the following features:
- Period: 24. January 1999 - 21. December 2007
- Monthly observations of 9 DAX companies
- 108 trading periods, 972 observations
- Log-returns are used.
Inference is made to see if monthly observations can be modeled as being normally distributed.
PP Plots and QQ Plots for Monthly Data
Once again the PP plots and the QQ plots are used and this time it can be seen that the deviations from the normal distribution are much smaller than in the case of the initial dataset.
The fact that the deviations from the normal distribution are smaller in the case of monthly observations than in the case of daily observations is quite common for financial data. Usually, the smaller the interval of observation, the closer the log-returns are around the mean and also the higher the kurtosis. There is a better chance of getting data that is normally distributed when one considers relatively longer time periods.
Still, it is difficult to say which monthly log-returns are normally distributed by just looking at the pictures. A selection can be made by using the Jarque-Bera test for the new dataset.
Jarque-Bera Test for Monthly Data
Under the same hypotheses, significance level and critical value as the Jarque-Bera test for daily data, results are obtained that are summarized in the table below.
The results are clearly better than in the daily data case, which was expected from the PP- and QQ-Plots. However, only 2 companies have values below 5.99, BMW and Volkswagen. From the point of view of the Jarque-Bera test, the H0 hypothesis (data normally distributed) is rejected for the other stocks' monthly log-returns.
It is interesting to see whether the Kolmogorov-Smirnov test earns the same results.
Kolmogorov-Smirnov Test for Monthly Data
The results for the Kolmogorov-Smirnov test for monthly data are summarized in the SPSS output.
"When assessing goodness of fit,[...] a statistically significant Z statistic means that the chosen distribution does not fit the data well" (SPSS Documentation). It seems that there is a reasonably good fit for all the companies, which is contradicting the previous result of the Jarque-Bera test. The methodologies of the Jarque-Bera and Kolmogrov-Smirnov tests are different. In the case of BMW and Volkswagen, the first 4 moments (mean, variance, skewness and kurtosis) are tested and confirmed against the normal distribution. Confronting the Jarque-Bera and the Kolmogorov-Smirnov tests, we consider that BMW and Volkswagen have normally distributed monthly log-returns. This will be used to perform further parametric tests.
T-Test for Monthly Data
Having the monthly data for BMW and Volkswagen, the question raised is which one is the better investment. To determine this, one needs to look at the profitability and at the risk (volatility). For the profitability, it can be seen that in both cases, the average monthly log-return is close to zero. The T-Test is a parametric test that can be used to compare means. Monthly data for BMW and Volkswagen is considered and one checks to see whether the respective means are significantly different from zero in both cases, at the 5% level.
A One-Sample T-Test is performed for each company's log-returns, using SPSS.
The first table provides summary statistics for the two variables. From the second table, one can observe that the means of both variables are not different from zero at the 5% significance level.
According to SPSS documentation, the column labeled Sig. (2-tailed) displays a probability from the t distribution with 107 degrees of freedom. The value listed is the probability of obtaining an absolute value greater than or equal to the observed t statistic, if the difference between the sample mean and the test value is purely random.
So, at the 5% significance level, there is no difference between buying the BMW or the Volkswagen stock, each one has a mean of the monthly log-returns that equals zero.
F-Test for Monthly Data
Knowing that from the point of view of the profitability the two stocks are similar at the 5% significance level, one can try to evaluate the risk of these stocks in order to determine which one is the better investment for the respective period.
Variance can be considered as a measure of volatility, and therefore risk. The F-Test, two-tailed in our case, is a parametric test with the following hypotheses:
- H0: variances are equal
- H1: variances are different.
The decision to accept H0 is taken if:
In the present case, the significance level and the number of observations of the two time series is .
The F-Test can be implemented in Matlab, one way to do it is the following:
x = load('DAX_log_returns_monthly.dat'); x_BMW=x(:,3); x_VOW=x(:,10); ratio=var(x_BMW)/var(x_VOW) p=finv(0.975,107,107)
One compares against a critical value of 1.4637, and the variance ratio is 0.6624. Therefore, at the 5% significance level, the variances of the BMW and Volkswagen monthly data samples are equal!
This means that also from the point of view of risk, the two stocks are similar. Statistically there is no significant difference in favour of one of the two investments. One is expected to earn the same under the same risk conditions by holding either the BMW or the Volkswagen stock.
A linear regression can be used in order to check whether the prices of the stocks depend on time. In order to achieve this, one can proceed in two steps:
- use a dataset with the stock prices on a daily basis. The time is indexed from the first observation up to the last one - 2347
- use a dataset with the stock prices on a monthly basis. The time index is between 1 and 108.
By making these regressions, one checks whether the level of the price depends on time and if the intervals of observation are relevant. It is interesting to determine whether the behaviour of the prices with respect to time is the same for daily and monthly data or not.
The regressions are done in EViews.
Regressions for Daily Data
A simple regression model is constructed:
, where Time represents the time index and the error term. 9 regressions are performed, and the results are sumarized in the next table.
|Dependent Variable: Stock Price|
|Number of observations||2347||2347||2347||2347||2347||2347||2347||2347||2347|
- According to the results, all the coefficients are statistically significant (p-value of 0.0000 in all cases).
- There are 4 companies for which the stock price is positively influenced by the time index, namely BMW, Continental, Siemens and Volkswagen.
- In 5 cases, there exists a negative influence of the independent variable on the dependent variable: Allianz, Daimler, Lufthansa, Münchener Rück and TUI.
- The Adjusted R-Squared ranges from 0.01 to 0.76 which means
- in the cases of Siemens, Lufthansa, Daimler and Volkswagen, only a small or a relatively small part of the variance in stock price is explained by the time index;
- for BMW, Münchener Rück, Allianz, TUI and Continental, a relatively large or a large part of the variance in the stock price is explained by the time index.
Regression Analysis for Monthly Data
The model from the previous subsection is implemented now, in the case of monthly data. The results are summarized in the following table:
|Dependent Variable: Stock Price|
|Number of observations||108||108||108||108||108||108||108||108||108|
- Almost all coefficients are statistically significant (p-value smaller than 0.05). The exceptions are the coefficient of the constant for Continental (0.1008) and the coefficients of the independent variable for Siemens (0.2172).
- The behaviour of the relationships between the dependent and independent variable is similar to the daily data case, with the exception of the variables whose coefficients are not statistically significant.
- For 3 companies, the stock price is positively influenced by the time index, namely BMW, Continental and Volkswagen.
- In the cases of Allianz, Daimler, Lufthansa, Münchener Rück and TUI, there exists a negative influence of the independent variable on the dependent variable.
- The Adjusted R-Squared ranges from 0.01 to 0.76 reflecting that
- for Lufthansa, Daimler and Volkswagen, only a small or a relatively small part of the variance in stock price is explained by the time index;
- again for BMW, Münchener Rück, Allianz, TUI and Continental, a relatively large or a large part of the variance in the stock price is explained by the time index.
Comparison Daily Data vs. Monthly Data
Even though the number of observations was dramatically decreased, from 2347 in the daily data case to 108 in the monthly data case, the regressions reveal that the relationship between the stock price and the time index respects the same pattern. There is an exception for Siemens, where even if the type of relationship is direct in both cases, the coefficient for the monthly data is not statistically significant.
To conclude, there is not much change in the relationship between the dependent and independent variable when we switch from a daily-based dataset to a monthly-based one.
A Risk Management Application
In this section, a risk management application will be illustrated by making use of the bootstrap procedure. As building blocks for this application, it is necessary to first introduce the concepts of Value at Risk and Expected Shortfall.
Value at Risk (VaR)
The following explanation and example regarding Value at Risk are taken from the book "Modern Portfolio Theory and Investment Analysis", by Elton, Gruber, Brown and Goetzmann (EGBG).
- Institutions such as banks and insurance companies are concerned with the likelihood of bad outcomes. One way to express the willingess to tolerate bad outcomes is Value at Risk (VaR). VaR looks at the size of bad outcomes that can occur with a specified probability in a specific time interval. For example, the institution might calculate that there is a 5% probability of a loss of 295.000 or more occuring in the next week. If management were interested in the 5% probability level, then 295.000 would be the Value at Risk. Let's discuss how this value for VaR could have occured.
- Assume a portfolio is 100 million in value. Assume the expected return over the next week is 0.2% with a standard deviation of 0.3%. Also assume normal distributions. Then we know that the lowest 5% of possible returns are returns that occur more than 1.65 standard deviations away from the mean. Thus 5% of the time we can expect returns below R-1.65 or 0.2-1.65*0.3. Simplifying this results in a return of -0.295% or less. If this investor has 100 millions in assets, this is a loss of 295.000 or more. This number, 295.000 (expressed in currency), is called Value at Risk. VaR is the best outcome that can occur if returns are in the worst part of the possible outcomes.
- Many institutions hold assets that do not have normal distributions of returns. These institutions usually use simulations to compute Value at Risk.
Expected Shortfall (ES)
Value at Risk is widely used, but as a disadvantage, it penalizes diversification. The Expected Shortfall (ES) is a coherent risk measure that is an alternative to VaR. Generally the Expected Shortfall is computed as the average of the losses that are higher than VaR, and therefore it is a stricter measure than VaR (more cautious).
Computing VaR and ES using the Bootstrap
The daily data for the log-returns are considered and one would like to calculate the VaR and ES for each stock.
For describing the procedure, the book "Measuring Market Risk" by Kevin Dowd is used.
- The VaR and ES can be estimated by various parametric and non-parametric methods. In our case, a bootstrapped historical simulation is used. A bootstrap procedure involves resampling from our existing data set with replacement. A bootstrapped estimate will often be more precise than a "raw" sample estimate. To apply the bootstrap, 1000 new samples of n observations each are created from the original dataset by drawing at random and replacing the observation after it has been drawn. Each new "resampled" sample gives us a new VaR estimate, and the best estimate is the mean of these resample-based estimates. The same approach can also be used to produce resample-based ES estimates - each one of these is the average of the losses in each resample exceeding the resample VaR - and the best ES estimate is the mean of these ES estimates.
Results for Daily Data
In the case of the initial dataset, the number of observations is n=2347.
The Matlab solution for the daily case is the following:
x1 = load('DAX_log_returns.dat'); x=x1(:,2:10); n=2347; p=0.95; VaRind=floor((1-p)*n)+1; for i=1:1000 xs1=x(ceil(n*unifrnd(0,1,n,1)),:); xs1=sort(xs1); VaR(i,:)=xs1(VaRind,:); ES(i,:)=mean(xs1(1:VaRind-1,:)); end bootVaR=mean(VaR); bootVaR bootES=mean(ES); bootES
Due to the random sampling process, results are different each time the program is executed. However, the values are very close and the bootstrap performs very well. A series of results is summarized in the following table.
|Value at Risk||0.0334||0.0317||0.0305||0.0307||0.0325||0.0348||0.0372||0.0345||0.0307|
Note: in Matlab, the results have the "-" sign. Since losses are discussed, conventionally in the table, the sign is changed. VaR and ES are expressed in percent.
- The Value at Risk ranges from 0.0305 or 3.05% to 0.0348 or 3.48% possible loss in one day.
- The Expected Shortfall illustrates more severe possible losses, ranging from 0.0430 or 4.30% to 0.0550 or 5.50% per day.
Results for Monthly Data
For the monthly dataset, the number of observations is n=108.
A series of results is presented in the following table.
|Value at Risk||0.1880||0.1448||0.1254||0.1468||0.1754||0.1739||0.1971||0.1759||0.1621|
- The Value at Risk ranges from 0.1254 or 12.54% to 0.1971 or 19.71% possible loss in one month.
- The Expected Shortfall is between 0.1755 or 17.55% and 0.3053 or 30.53%.
- It is normal that the values are larger than those of the previous case, since the period of variation is extended from 1 day to 1 month.
- An analysis of a financial dataset (9 DAX Companies) was performed.
- Various statistical methods were used, and the implementation was done in Matlab, SPSS, EViews, XploRe and Excel.
- For observations on daily basis, none of the stocks have normally distributed log-returns. When considering a transformed dataset with monthly observations, BMW and Volkswagen have normally distributed log-returns.
- For the monthly dataset, the difference is not statistically significant when choosing between the BMW and the Volkswagen stocks, since they both have means equal to zero and bear the same risk.
- A regression analysis was performed in order to illustrate a relationship between the dependent variable "Stock Price" and the independent variable "Time". 18 regressions were estimated, 9 for the daily dataset and 9 for the monthly dataset. The relationship between the stock price and the time index respects the same pattern regardless of using daily or monthly data.
- A risk management application is done by using the bootstrap procedure to calculate the Value at Risk and the Expected Shortfall for the 9 companies, based on the daily and monthly datasets, respectively.
- Statistical methods can be successfully applied to find answers to practical aspects, such as: choosing the stock to invest in, knowing the behaviour of the stocks with respect to time, estimating the downside risk for a stock.
- Klinke, S.: "Applied Quantitative Methods - Lecture Notes", Institut für Statistik und Ökonometrie, Humboldt-Universität zu Berlin
- Härdle, W., Simar, L.: "Applied Multivariate Statistical Analysis", Springer, 2003
- Elton, E., Gruber, M., Brown, S., Goetzmann, W.: "Modern Portfolio Theory and Investment Analysis", John Wiley & Sons, 2007
- Dowd, K.: "Measuring Market Risk", John Wiley & Sons, 2005
- Deutsche Börse http://deutsche-boerse.com/dbag/dispatch/de/kir/gdb_navigation/home
- SPSS documentation, Wikipedia