What drives Market Value: Analysis of the Forbes 500 US companies

From Teachwiki
Jump to: navigation, search

Introduction[edit]

This paper analyzes the data of 79 companies included in the Forbes 500 list of the largest U.S. companies. We focus mainly on the following questions:

  • What are potential criteria for being considered “the largest”? Which factors did Forbes analysts consider the most important in selection of the largest USA companies from the particular sector? (PCA).
  • How “good” are the existing criteria of Forbes 500? To put it differently, how homogenous is our sample? (Cluster analysis).
  • What drives the market value of the Forbes 500 companies (Regression analysis).

We intentionally use different statistical tools and methods to show different approaches to answer the questions indicated above.

Data description[edit]

This dataset holds several facts about 79 companies selected from the Forbes 500 list for 1986. This is a 1/10 systematic sample from the alphabetical list of companies. The Forbes 500 includes all companies in the top 500 on any of the criteria. To define the biggest companies Forbes used 6 criteria, also sectors to which companies belong are identified.

Therefore the dataset includes 79 observations and 8 variables - 6 numerical, 2 nominal (text):

Company Company Name Nominal
Assets Amount of assets (in millions) Numerical
Sales Amount of sales (in millions) Numerical
Market Value Market Value of the company (in millions) Numerical
Profits Profits (in millions) Numerical
Cash flow Cash Flow (in millions) Numerical
Employees Number of employees (in thousands) Numerical
Sector Type of market the company is associated with Nominal

Descriptive Statistics[edit]

To obtain the detailed statistic summary for each variable, we apply "descriptive" quantlet. See example for ”Market Value”.

[ 62,] =========================================================
[ 63,]  Variable Market Value
[ 64,] =========================================================
[ 65,]  
[ 66,]  Mean              3269.75
[ 67,]  Std.Error         11303.5     Variance       1.2777e+08
[ 68,]  
[ 69,]  Minimum                53     Maximum             95697
[ 70,]  Range               95644
[ 71,]  
[ 72,]  Lowest cases                  Highest cases 
[ 73,]          34:            53              58:         8190
[ 74,]          18:            90              39:         9462
[ 75,]          51:           101               1:        10636
[ 76,]          11:           181              38:        33172
[ 77,]          25:           181              40:        95697
[ 78,]  
[ 79,]  Median                944
[ 80,]  25% Quantile          483     75% Quantile         2002
[ 81,]  
[ 82,]  Skewness          7.15365     Kurtosis           57.213
[ 83,]                                Excess             54.213
[ 84,]  
[ 85,]  Observations                     79
[ 86,]  Distinct observations            77
[ 87,]  
[ 88,]  Total number of {-Inf,Inf,NaN}    0
[ 89,]  
[ 90,] =========================================================
 

9 sectors are represented in the data sample - Communication, Energy, Finance, HiTech, Manufacturing, Medical, Retail, Transportation and Other. To learn more about our sample with relation to sectors we use XploRe “frequency” command and obtain a summary of sectors frequency. Later we will try to find some distinguishing features of sectors and find dependences typical for the sectors.

            
[ 1,]               
[ 2,]                   Minimum     Maximum        Mean      Median   Std.Error
[ 3,]              -------------------------------------------------------------
[ 4,] Assets                223       52634      5940.5        2788      9156.8
[ 5,] Sales                 176       50056      4178.3        1754      7011.6
[ 6,] Market Value           53       95697      3269.7         944       11304
[ 7,] Profits            -771.5        6555      209.84        70.5      796.98
[ 8,] Cash Flow          -651.9        9874      400.93       133.3      1205.5
[ 9,] Employees             0.6       400.2      37.597        15.4      64.504
[10,]  
 

Principal Components Analysis[edit]

Following our objectives we will try to define which variables have the strongest impact on the data variability. The statistical tool for reducing the dimensionality of data and finding the most informative projections is principal components analysis. The weighting of the PCs tells us in which directions, expressed in original coordinates, the best variance explanation is obtained. It is advisable to apply PC analysis to data of approximately the same scale. Since we deal with variables measured in different units (e.g. Employees and Assets) and of different scale, the transformation is necessary. The original variables were standardized via deduction of mean and division by standard deviation.

A rough look at the 1st vs 2nd PC plot suggests that we have obvious outliers in the data set. Since the firs principal component usually has the highest explanatory power it is reasonable first to search along the 1st PC axes. Therefore first we exclude the two farthest points along the 1st PC. These points correspond to 2 companies – General Electric and IBM – both with the highest values of Sales, Market value, Profits, Cash flow and Employees, furthermore IBM has the highest value of Assets, which makes is the best in all characteristics among all presented companies.

1st vs 2nd PC with outliers

Then we carry out the same procedure rescaling the remaining data sample. The second graph indicates that we again have points along the first PC that are quite distant from the rest cloud, so this time we exclude Bell Atlantic (the company with the highest market value among the remaining sample).

After several iterations it could seem that we will always have some outlying points which do not fit perfectly to our point cloud. We decided to constraint ourselves with 3 iterations. After excluding in sum 5 outliers (IBM, General Electric, Bell Atlantic, Cigna, LTV) we are left with 74 observations.

Next table presents the characteristics of outliers in comparison with the remaining data. As it is seen in almost all variables the outliers have values considerably exceeding the mean values of the remaining sample. Contents of summ

[ 1,]               
[ 2,]                   Minimum     Maximum        Mean      Median   Std.Error
[ 3,]              -------------------------------------------------------------
[ 4,] Assets                223       33406      4316.3        2548      5296.2
[ 5,] Sales                 176       17124      2949.5        1679      3417.1
[ 6,] Market Value           53        9462      1534.5         910      1824.5
[ 7,] Profits              -279         485      109.42        69.2      137.73
[ 8,] Cash Flow          -108.1        1462      227.53       131.4       253.8
[ 9,] Employees             0.6       184.8      28.116        12.6      39.375
[10,]  


Contents of outliers

[1,]              IBM     GE      BA        Cigna     LTV
[2,] Assets       52634   26432   19788     44736     6307
[3,] Sales        50056   28285   9084      16197     8199
[4,] Market Value 95697   33172   10636     4653      598
[5,] Profits      6555    2336    1092.9   -732.5    -771.5
[6,] Cash Flow    9874    3562    2576.8   -651.9    -524.3
[7,] Employees    400.2   304     79.4      48.5      57.5
 

The decomposition of the covariance matrix of the remaining dataset gives us the following eigenvalues and corresponding eigenvectors:

Contents of e

[1,]   3.5494 
[2,]   1.0292 
[3,]   0.8052 
[4,]   0.4034 
[5,]   0.1743 
[6,]   0.0333 


Contents of v

[1,]  0.2482  0.37051 -0.88983  0.01705  0.08687  0.03802 
[2,]  0.4551 -0.34555 -0.07801 -0.52778 -0.27807 -0.55815 
[3,]  0.4429 -0.12488  0.05357  0.82123 -0.24854 -0.22176 
[4,]  0.3910  0.52284  0.37051 -0.05734  0.58812 -0.29541 
[5,]  0.4653  0.33255  0.24151 -0.20732 -0.48687  0.57838 
[6,]  0.4069 -0.58632 -0.06036 -0.02181  0.52002  0.46482
 

The percentage of variation explained by each principal component:

[1,] Eigenvalue  Prop of Var  Cum Prop
[2,] 3.5494      0.5921       0.5921
[3,] 1.0292      0.1717       0.7637
[4,] 0.8052      0.1343       0.8981
[5,] 0.4034      0.0673       0.9654
[6,] 0.1743      0.0291       0.9944
[7,] 0.0333      0.0055       1
 

The plot below shows which original variables are more strongly correlated with the first and the second PC. The correlation coefficients are shown in the table.

The correlation of variables with the PCs






[1,]              1st PC   2nd PC
[2,] Assets       0.46796  0.37603
[3,] Sales        0.85777 -0.3507
[4,] Market_value 0.8348  -0.12674
[5,] Profits      0.73703  0.53064
[6,] Cash flow    0.87705  0.33751
[7,] Employees    0.76707 -0.59506







The main conclusion is that the variable of our main attention, “Market value”, is rather good explained by the first PC (the correlation coefficient is equal to 0.83). This supports the further analysis. In the regression section we will try to find the more precise dependence of MV on PC as well on individual variables.


Cluster Analysis[edit]

The next step is to figure out if there is any similarity between individual companies in the data sample, or, in other words, if it is possible to divide data in groups (clusters) with certain characteristics. As we are already aware, each company is assigned to the certain sector. Altogether 9 sectors are presented in data (Communication, Energy, Finance, HiTech, Manufacturing, Medical, Retail, Transportation and Other).

Our task is to figure out if the fact that a given company is a member of a sector endows it with some particular characteristics – sector–specific characteristics.

To make conclusions we use another statistical tool – cluster analysis. We divide our data in clusters using agglomerative hierarchical algorithm with distances computed by Ward method. On the dendrogramm below it is easy to see that the sample is clearly divided into two clusters.

The dendrogram for the 75 US companies (without outliers)
1st vs 2nd PC without outliers divided by two clusters

The sample is divided by two groups: “blue” - the bigger (58 companies), and “red”– the smaller one (16 companies). Our guess that each sector possesses particular qualities was only partially confirmed. Thus all financial companies as well as the majority of energy and transportation are gathered in the “blue” cluster, but in general both groups are constituted by all sectors’ representatives. Nevertheless we can characterize obtained clusters by descriptive statistics. Below we compare statistical characteristics of both:

Contents of summ (blue)
[ 1,]               
[ 2,]                   Minimum     Maximum        Mean      Median   Std.Error
[ 3,]              -------------------------------------------------------------
[ 4,] Assets                223       33406      3533.9        1679      5387.5
[ 5,] Sales                 176        6615        1615        1302      1302.4
[ 6,] Market Value           53        2306      805.07         717      503.49
[ 7,] Profits            -208.4       310.7      67.269        60.6      82.541
[ 8,] Cash Flow          -108.1       578.3      131.91       106.8      112.35
[ 9,] Employees             0.6          65       12.35         6.4       13.52
[10,] 


Contents of summ (red)
[ 1,]               
[ 2,]                   Minimum     Maximum        Mean      Median   Std.Error
[ 3,]              -------------------------------------------------------------
[ 4,] Assets               2535       14045      7152.4        5769      3907.6
[ 5,] Sales                4152       17124      7787.1        5958      4304.9
[ 6,] Market Value         1915        9462      4178.8        3023      2391.9
[ 7,] Profits              -279         485      262.24       283.7      186.15
[ 8,] Cash Flow              83        1462      574.17       521.7      319.35
[ 9,] Employees            23.4       184.8      85.269        66.2      49.043
[10,]

The descriptive statistics shows that “red” companies are higher in mean values, and variances (for all variables). But the “blue” cluster contains at least one company with very high assets. In fact it has more high-assets companies, which are compensated by low-assets firms. As a result we see the smaller then of the “reds” mean value. This high-assets contribution was made by financial companies, which are all concentrated in the "blue" cluster.

How this analysis can be helpful? Below we will try to run the regression inside the revealed clusters and figure out are there any dependences among the variables in two groups and if they differ from the whole sample and from each other.

Distribution Analysis[edit]

In order to analyze the distribution of each characteristic in our sample, we employ two statisctic methods, namely hystograms and empirical Kernel probability density functions (bandwidth = 0.15). Because both methods uncover roughly the same notion about distribution, we present here only Kernel pdf:

Kernel pdf

The graph shows that all our variables have strongly skewed distribution. We approach this by taking log, thus making the distribution more symmetric. The graph below presents log Kernel pdf, which resembles Normal density:

Log Kernel pdf

Regression Analysis[edit]

The aim of the regression analysis carried out in this section is to define, what drives market value of the biggest American companies, i.e. find out the link (if any exists) between the market value (MV) and other characteristics in our sample.

We begin with simple regression of MV on other numerical variables, namely: assets, sales, profits, cash flow, and number of employees. Note that from now on we are using log of variables, except for the variables profits and cash flow, which have negative values. Also the sample used excludes outliers, detected by PCA. Using the graph below we can make some preliminary remarks:

  • The linear regression log(MV) vs log(Assets) doesn’t fit good the data, perhaps the true link function is non-linear.
  • Scatterplots for variables “Sales” and “Employees” look very similar. The same observation pertains to the pair “Profits” and “Cash Flow”.
  • The best fit is provided by regression log(MV) vs log(Profits). To obtain this regression, we have eliminated from the sample 7 observations with negative profits’ value and then applied log-transformation to this variable.
Simple regression of MV on other numerical variables

Guess #1 is confirmed by the insignificant coefficient of this regressor.To prove our guess #2 we computed correlation matrix and got correlation of nearly 0.89 for both pairs of variables: “Sales”&”Employees” and “Profits”&”CF”. This coefficient is even higher (0.97 and 0.99 respectively) in the sample with 4 outliers included. High correlation between regressors leads to a multicollinearity problem in multiple regression.

Finally, we replace “Profits” with log(“Profits”) and run the multiple regression of “Market Value” on “Assets”, “Sales” and “Profits” (all variables are log). The results are presented in the table below.

F-test shows that regression is significant. R-squared equals 0.74, what means that regression explains 74% of the dispersion of “Market Value”. Only first regressor “Assets” has a coefficient that is insignificant on the 5% critical level.

[ 1,] 
[ 2,] A  N  O  V  A                   SS      df     MSS       F-test   P-value
[ 3,] _________________________________________________________________________
[ 4,] Regression                    46.229     3    15.410      62.167   0.0000
[ 5,] Residuals                     15.864    64     0.248
[ 6,] Total Variation               62.093    67     0.927
[ 7,] 
[ 8,] Multiple R      = 0.86285
[ 9,] R^2             = 0.74451
[10,] Adjusted R^2    = 0.73254
[11,] Standard Error  = 0.49787
[12,] 
[13,] 
[14,] PARAMETERS         Beta         SE         StandB        t-test   P-value
[15,] ________________________________________________________________________
[16,] b[ 0,]=          2.4105       0.4971       0.0000         4.849   0.0000
[17,] b[ 1,]=         -0.1185       0.0645      -0.1364        -1.838   0.0706
[18,] b[ 2,]=          0.2547       0.0710       0.2884         3.588   0.0006
[19,] b[ 3,]=          0.7765       0.0872       0.7266         8.904   0.0000
 

Another way to analyze market value with the help of regression analysis is to use principal components (PC) as regressors. Thus we can avoid problems we encountered above, namely:

  • no multicollinearity in model because PC are by construction independent;
  • no need to do log-transformation because PC-based approach has no distributional assumption;
  • use of PC leads to reduced data dimensionality that is good to know as far as we don’t have relatively little observations for regression analysis.

The only disadvantage of this approach is that it doesn’t give an explicit relationship between variables because principal components do not have economic interpretation. This however doesn’t mean that the model itself doesn’t yield results. Principal components are simply linear combinations of initial variables (which possess specific properties). So the true “link function” could be inferred from regression equation. Also we can treat principal components as indices, constructed by weighting the initial variables. The value of the index in turn is related to market value performance through the found linear regression function. See next section for the details.

Also we should notice that this time we ran regression on the first two PC without deleting the observations considered to be outliers in the beginning. We obtain high R-squared that means that the regression makes a good fit. Nevertheless, let us try to make regression on the sample without 5 outliers, namely GeneralElectric, IBM, Cigna, LTV and BellAtlantic.


[ 1,] 
[ 2,] A  N  O  V  A                   SS      df     MSS       F-test   P-value
[ 3,] _________________________________________________________________________
[ 4,] Regression                9564352143.736     24782176071.868      904.727   0.0000
[ 5,] Residuals                 401718405.200    765285768.489
[ 6,] Total Variation           9966070548.937    78127770135.243
[ 7,] 
[ 8,] Multiple R      = 0.97964
[ 9,] R^2             = 0.95969
[10,] Adjusted R^2    = 0.95863
[11,] Standard Error  = 2299.07992
[12,] 
[13,] 
[14,] PARAMETERS         Beta         SE         StandB        t-test   P-value
[15,] ________________________________________________________________________
[16,] b[ 0,]=       3269.7468     258.6667       0.0000        12.641   0.0000
[17,] b[ 1,]=         -0.4261       0.0664      -0.3764        -6.417   0.0000
[18,] b[ 2,]=          6.7955       0.3033       1.3147        22.409   0.0000
 

Though our fisrt two PC have significant coefficients, R-squared fell dramatically from 0.96 to 0.71. Apparently the observations considered to be outliers improve the fit of our regression, thus they couldn’t be considered outliers any longer. Why could it happen? We assume that though on the plot these observations seemed to lie too far from other eintities, it is not enough to call them outliers. The first few principal components have large variances and explain the largest cumulative proportion of the total sample variance. These components are usually strongly related to the variables with relatively large variances and covariances. Consequently, the observations that are outliers with respect to the first few components usually correspond to outliers on one or more of the original variables. On the other hand, the last few principal components represent linear functions of the original variables with minimal variance. These components are sensitive to the observations that are inconsistent with the correlation structure of the data but are not outliers with respect to the original variables. Therefore, large values of observations on the minor components reflect multivariate outliers that are not detectable using the criterion based on large values of the original variables. And our 5 observations have large values on the features with high variance, so possibly they are not outliers in a multivariate sense.

Turning back to the cluster analysis we would like to see if there is any dependence inside the two groups of companies defined higher. We are interested in the question if the nature of dependence is similar to the one that already exist in the whole data set, or, on the contrary, there are group-peculiarities. Again we run the regression of log Market Value on log Assets, log Sales and log Profits. The results are in the codeblock below:

Contents of out (blue)
 
[ 1,] 
[ 2,] A  N  O  V  A                   SS      df     MSS       F-test   P-value
[ 3,] _________________________________________________________________________
[ 4,] Regression                    17.009     3     5.670      28.123   0.0000
[ 5,] Residuals                      9.879    49     0.202
[ 6,] Total Variation               26.888    52     0.517
[ 7,] 
[ 8,] Multiple R      = 0.79536
[ 9,] R^2             = 0.63260
[10,] Adjusted R^2    = 0.61010
[11,] Standard Error  = 0.44900
[12,] 
[13,] 
[14,] PARAMETERS         Beta         SE         StandB        t-test   P-value
[15,] ________________________________________________________________________
[16,] b[ 0,]=          3.3250       0.6262       0.0000         5.310   0.0000
[17,] b[ 1,]=         -0.1497       0.0599      -0.2290        -2.501   0.0158
[18,] b[ 2,]=          0.1445       0.0769       0.1719         1.879   0.0662
[19,] b[ 3,]=          0.7849       0.0947       0.7712         8.291   0.0000
 
Contents of out (red)
 
[ 1,] 
[ 2,] A  N  O  V  A                   SS      df     MSS       F-test   P-value
[ 3,] _________________________________________________________________________
[ 4,] Regression                     0.582     3     0.194       0.785   0.5270
[ 5,] Residuals                      2.718    11     0.247
[ 6,] Total Variation                3.300    14     0.236
[ 7,] 
[ 8,] Multiple R      = 0.41987
[ 9,] R^2             = 0.17629
[10,] Adjusted R^2    = -0.04835
[11,] Standard Error  = 0.49707
[12,] 
[13,] 
[14,] PARAMETERS         Beta         SE         StandB        t-test   P-value
[15,] ________________________________________________________________________
[16,] b[ 0,]=          6.4187       2.8994       0.0000         2.214   0.0489
[17,] b[ 1,]=          0.0984       0.2917       0.1042         0.337   0.7422
[18,] b[ 2,]=         -0.0692       0.2891      -0.0689        -0.239   0.8152
[19,] b[ 3,]=          0.2678       0.2109       0.3771         1.270   0.2303

The first output correspond to the first “blue” group of companies, the second one - to the “red”.

In the second group we do not observe any strong dependence of log Market Value on log Assets, log Sales and log Profits. All the coefficients are insignificant on the 5% level. R-squared is also very low (0.18). But or the first, blue, group of companies we observe coefficient very similar to those we obtained for the whole data sample. Namely, 3.32(2.41), -0.15 (0.12), 0.14(0.25), 0.78(0.78). R-squared declined and equals to 0.63 (0.74). The main difference now is the coefficient of “Sales” is insignificant on 5% significance level, but, on the contrary “Assets” coefficient became significant (but still only on the 5% level).

As it was already mentioned, the “blue” group, the biggest one, includes mainly the companies from financial sector as well as majority of energy and transportation companies. So we can conclude that in these particular industries the market value is driven by Profits and Assets rather then Sales.

Sector Analysis[edit]

We would like to use information about industry division in our analysis. Because the respective variable is not numeric, we would need to include 8 dummies for 9 sectors in our regression. However, insufficient amount of observations precludes us from estimating 14 coefficients with 68 observations (after excluding outliers and negative profits). Therefore, we deploy color options of XploRe to analyze the possible implications of each sector.

Each sector was assigned a distinctive color. Also in order to keep as much observations as possible, we used here the initial sample of 79 companies.

Sector Analysis

It is obvious that with regards to profits’ value the sector doesn’t matter so much – companies from different industries “lie” on the common mv-profit curve. Meanwhile, the cloudy scatterplot “MV vs Assets”, which led to the insignificant coefficient by the asset variable, presents now a clearer picture. We can observe that companies from different sectors possess a different relationship between their assets and market value.

Conclusion[edit]

PC analysis. Since the 1st PC explains quite high percentage of variation in the data sample, we tried to construct the index based on the 1st PC. The index we obtained is a linear combination of original variables with weights equal to the values of the eigenvector that corresponds to the first highest eigenvalue of the correlation matrix. Probably, this index could be the answer to the questions: how to defined the largest US companies, which parameters have stronger contribution to the index, or in other words which figures are the most important for the company willing to find itself in the Forbes’ “list of the best”. We obtained the following expression for the index that could be used to define which companies should be in the Forbes 500 list: IF500 = 0.25\times Assets + 0.45\times Sales + 0.44\times MV + 0.39\times PR + 0.46\times CF + 0.41\times Empl

Regression analysis. We also tried to answer the question “what drives market value”. After log transformation of original variables we obtained the following dependence of Market Value from Profits and Sales: log (MV) = 2.41 + 0.25\times log (Sales) + 0.78\times log (Profits).

Sector analysis. We didn’t manage to define the inherent characteristics of given sectors, on the contrary the analysis allows us to conclude that in the data sample each sector of economy has companies with different values of assets, sales, market value and other parameters. But on the other hand have found similarities in data – two clusters ("blue" and "red") contain companies with different characteristics. Also the regression analysis showed different dependence of log Market Value on other variables.

Nevertheless we concluded that sector also matters when we examine dependences among variables. The scatter plot “MV vs Assets” clearly shows that companies from different sectors possess a different relationship between their assets and market value.


References[edit]

  • Hardle, W., Simar, L.(2003) Applied Multivariate Statistical Analysis. Springer Verlag, Heidelberg. ISBN 3-540-03079-4 (486 p)
  • Hardle, W., Hlavka, Z. and Klinke, S. (2000) XploRe Application Guide. Springer Verlag, Heidelberg. ISBN 3-540-67545-0 , (525 p)
  • Hardle, W., Klinke, S., Muller, M. (2000) XploRe Learning Guide. Springer Verlag, Heidelberg. ISBN 3-540-66207-3 (526 p)
  • Tsay, Ruey S. (2005) Analysis of Financial Time Series (Wiley Series in Probability and Statistics) - 2nd edition. Wiley-Interscience. ISBN 0-471-69074-0 (640 p)
  • Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, LiWu Chang, “A Novel Anomaly Detection Scheme Based on Principal Component Classifier”, Conference paper, University of Miami, Coral Gables, FL, USA, 2003

Comments[edit]

  • Variable units: in millions what?
  • What did you learn from the descriptive analysis?
  • Where are the result of frequency ?
  • Programs are missing
  • Typos
  • Analysis of 1.PC: could we not see this by a boxplot?
  • The exclusion of outliers suggests a more complex transformation of the data, e.g. log
  • Rather than showing the table of means, medians, etc for each group a graphic would be much better
  • Why not showing a contingency table sector vs. cluster group?
  • Why doing some univariate analysis AFTER doing multivariate analysis?
  • Where are the histograms?
  • If you use log(vars), do you think your outliers are still outliers?
  • "no need to do log-transformation because PC-based approach has no distributional assumption;", but outlier influence it heavily
  • The R^2=0.96 is too high, you are estimating the "outliers" well, but I guess that you do not estimate all other observations well
  • Why not using log(Profits) vs. Market values. The graphics is unreadable. Why you do not add 9 regression lines in the graphic?