What drives Market Value: Analysis of the Forbes 500 US companies
Contents
Introduction[edit]
This paper analyzes the data of 79 companies included in the Forbes 500 list of the largest U.S. companies. We focus mainly on the following questions:
 What are potential criteria for being considered “the largest”? Which factors did Forbes analysts consider the most important in selection of the largest USA companies from the particular sector? (PCA).
 How “good” are the existing criteria of Forbes 500? To put it differently, how homogenous is our sample? (Cluster analysis).
 What drives the market value of the Forbes 500 companies (Regression analysis).
We intentionally use different statistical tools and methods to show different approaches to answer the questions indicated above.
Data description[edit]
 Datafile name  Companies
 Data source  The Data And Story Library (DASL)
This dataset holds several facts about 79 companies selected from the Forbes 500 list for 1986. This is a 1/10 systematic sample from the alphabetical list of companies. The Forbes 500 includes all companies in the top 500 on any of the criteria. To define the biggest companies Forbes used 6 criteria, also sectors to which companies belong are identified.
Therefore the dataset includes 79 observations and 8 variables  6 numerical, 2 nominal (text):
Company  Company Name  Nominal 
Assets  Amount of assets (in millions)  Numerical 
Sales  Amount of sales (in millions)  Numerical 
Market Value  Market Value of the company (in millions)  Numerical 
Profits  Profits (in millions)  Numerical 
Cash flow  Cash Flow (in millions)  Numerical 
Employees  Number of employees (in thousands)  Numerical 
Sector  Type of market the company is associated with  Nominal 
Descriptive Statistics[edit]
To obtain the detailed statistic summary for each variable, we apply "descriptive" quantlet. See example for ”Market Value”.

9 sectors are represented in the data sample  Communication, Energy, Finance, HiTech, Manufacturing, Medical, Retail, Transportation and Other. To learn more about our sample with relation to sectors we use XploRe “frequency” command and obtain a summary of sectors frequency. Later we will try to find some distinguishing features of sectors and find dependences typical for the sectors.

Principal Components Analysis[edit]
Following our objectives we will try to define which variables have the strongest impact on the data variability. The statistical tool for reducing the dimensionality of data and finding the most informative projections is principal components analysis. The weighting of the PCs tells us in which directions, expressed in original coordinates, the best variance explanation is obtained. It is advisable to apply PC analysis to data of approximately the same scale. Since we deal with variables measured in different units (e.g. Employees and Assets) and of different scale, the transformation is necessary. The original variables were standardized via deduction of mean and division by standard deviation.
A rough look at the 1st vs 2nd PC plot suggests that we have obvious outliers in the data set. Since the firs principal component usually has the highest explanatory power it is reasonable first to search along the 1st PC axes. Therefore first we exclude the two farthest points along the 1st PC. These points correspond to 2 companies – General Electric and IBM – both with the highest values of Sales, Market value, Profits, Cash flow and Employees, furthermore IBM has the highest value of Assets, which makes is the best in all characteristics among all presented companies.
Then we carry out the same procedure rescaling the remaining data sample. The second graph indicates that we again have points along the first PC that are quite distant from the rest cloud, so this time we exclude Bell Atlantic (the company with the highest market value among the remaining sample).
After several iterations it could seem that we will always have some outlying points which do not fit perfectly to our point cloud. We decided to constraint ourselves with 3 iterations. After excluding in sum 5 outliers (IBM, General Electric, Bell Atlantic, Cigna, LTV) we are left with 74 observations.
Next table presents the characteristics of outliers in comparison with the remaining data. As it is seen in almost all variables the outliers have values considerably exceeding the mean values of the remaining sample. Contents of summ

The decomposition of the covariance matrix of the remaining dataset gives us the following eigenvalues and corresponding eigenvectors:

The percentage of variation explained by each principal component:

The plot below shows which original variables are more strongly correlated with the first and the second PC. The correlation coefficients are shown in the table.

The main conclusion is that the variable of our main attention, “Market value”, is rather good explained by the first PC (the correlation coefficient is equal to 0.83). This supports the further analysis. In the regression section we will try to find the more precise dependence of MV on PC as well on individual variables.
Cluster Analysis[edit]
The next step is to figure out if there is any similarity between individual companies in the data sample, or, in other words, if it is possible to divide data in groups (clusters) with certain characteristics. As we are already aware, each company is assigned to the certain sector. Altogether 9 sectors are presented in data (Communication, Energy, Finance, HiTech, Manufacturing, Medical, Retail, Transportation and Other).
Our task is to figure out if the fact that a given company is a member of a sector endows it with some particular characteristics – sector–specific characteristics.
To make conclusions we use another statistical tool – cluster analysis. We divide our data in clusters using agglomerative hierarchical algorithm with distances computed by Ward method. On the dendrogramm below it is easy to see that the sample is clearly divided into two clusters.
The sample is divided by two groups: “blue”  the bigger (58 companies), and “red”– the smaller one (16 companies). Our guess that each sector possesses particular qualities was only partially confirmed. Thus all financial companies as well as the majority of energy and transportation are gathered in the “blue” cluster, but in general both groups are constituted by all sectors’ representatives. Nevertheless we can characterize obtained clusters by descriptive statistics. Below we compare statistical characteristics of both:

The descriptive statistics shows that “red” companies are higher in mean values, and variances (for all variables). But the “blue” cluster contains at least one company with very high assets. In fact it has more highassets companies, which are compensated by lowassets firms. As a result we see the smaller then of the “reds” mean value. This highassets contribution was made by financial companies, which are all concentrated in the "blue" cluster.
How this analysis can be helpful? Below we will try to run the regression inside the revealed clusters and figure out are there any dependences among the variables in two groups and if they differ from the whole sample and from each other.
Distribution Analysis[edit]
In order to analyze the distribution of each characteristic in our sample, we employ two statisctic methods, namely hystograms and empirical Kernel probability density functions (bandwidth = 0.15). Because both methods uncover roughly the same notion about distribution, we present here only Kernel pdf:
The graph shows that all our variables have strongly skewed distribution. We approach this by taking log, thus making the distribution more symmetric. The graph below presents log Kernel pdf, which resembles Normal density:
Regression Analysis[edit]
The aim of the regression analysis carried out in this section is to define, what drives market value of the biggest American companies, i.e. find out the link (if any exists) between the market value (MV) and other characteristics in our sample.
We begin with simple regression of MV on other numerical variables, namely: assets, sales, profits, cash flow, and number of employees. Note that from now on we are using log of variables, except for the variables profits and cash flow, which have negative values. Also the sample used excludes outliers, detected by PCA. Using the graph below we can make some preliminary remarks:
 The linear regression log(MV) vs log(Assets) doesn’t fit good the data, perhaps the true link function is nonlinear.
 Scatterplots for variables “Sales” and “Employees” look very similar. The same observation pertains to the pair “Profits” and “Cash Flow”.
 The best fit is provided by regression log(MV) vs log(Profits). To obtain this regression, we have eliminated from the sample 7 observations with negative profits’ value and then applied logtransformation to this variable.
Guess #1 is confirmed by the insignificant coefficient of this regressor.To prove our guess #2 we computed correlation matrix and got correlation of nearly 0.89 for both pairs of variables: “Sales”&”Employees” and “Profits”&”CF”. This coefficient is even higher (0.97 and 0.99 respectively) in the sample with 4 outliers included. High correlation between regressors leads to a multicollinearity problem in multiple regression.
Finally, we replace “Profits” with log(“Profits”) and run the multiple regression of “Market Value” on “Assets”, “Sales” and “Profits” (all variables are log). The results are presented in the table below.
Ftest shows that regression is significant. Rsquared equals 0.74, what means that regression explains 74% of the dispersion of “Market Value”. Only first regressor “Assets” has a coefficient that is insignificant on the 5% critical level.

Another way to analyze market value with the help of regression analysis is to use principal components (PC) as regressors. Thus we can avoid problems we encountered above, namely:
 no multicollinearity in model because PC are by construction independent;
 no need to do logtransformation because PCbased approach has no distributional assumption;
 use of PC leads to reduced data dimensionality that is good to know as far as we don’t have relatively little observations for regression analysis.
The only disadvantage of this approach is that it doesn’t give an explicit relationship between variables because principal components do not have economic interpretation. This however doesn’t mean that the model itself doesn’t yield results. Principal components are simply linear combinations of initial variables (which possess specific properties). So the true “link function” could be inferred from regression equation. Also we can treat principal components as indices, constructed by weighting the initial variables. The value of the index in turn is related to market value performance through the found linear regression function. See next section for the details.
Also we should notice that this time we ran regression on the first two PC without deleting the observations considered to be outliers in the beginning. We obtain high Rsquared that means that the regression makes a good fit. Nevertheless, let us try to make regression on the sample without 5 outliers, namely GeneralElectric, IBM, Cigna, LTV and BellAtlantic.

Though our fisrt two PC have significant coefficients, Rsquared fell dramatically from 0.96 to 0.71. Apparently the observations considered to be outliers improve the fit of our regression, thus they couldn’t be considered outliers any longer. Why could it happen? We assume that though on the plot these observations seemed to lie too far from other eintities, it is not enough to call them outliers. The first few principal components have large variances and explain the largest cumulative proportion of the total sample variance. These components are usually strongly related to the variables with relatively large variances and covariances. Consequently, the observations that are outliers with respect to the first few components usually correspond to outliers on one or more of the original variables. On the other hand, the last few principal components represent linear functions of the original variables with minimal variance. These components are sensitive to the observations that are inconsistent with the correlation structure of the data but are not outliers with respect to the original variables. Therefore, large values of observations on the minor components reflect multivariate outliers that are not detectable using the criterion based on large values of the original variables. And our 5 observations have large values on the features with high variance, so possibly they are not outliers in a multivariate sense.
Turning back to the cluster analysis we would like to see if there is any dependence inside the two groups of companies defined higher. We are interested in the question if the nature of dependence is similar to the one that already exist in the whole data set, or, on the contrary, there are grouppeculiarities. Again we run the regression of log Market Value on log Assets, log Sales and log Profits. The results are in the codeblock below:

The first output correspond to the first “blue” group of companies, the second one  to the “red”.
In the second group we do not observe any strong dependence of log Market Value on log Assets, log Sales and log Profits. All the coefficients are insignificant on the 5% level. Rsquared is also very low (0.18). But or the first, blue, group of companies we observe coefficient very similar to those we obtained for the whole data sample. Namely, 3.32(2.41), 0.15 (0.12), 0.14(0.25), 0.78(0.78). Rsquared declined and equals to 0.63 (0.74). The main difference now is the coefficient of “Sales” is insignificant on 5% significance level, but, on the contrary “Assets” coefficient became significant (but still only on the 5% level).
As it was already mentioned, the “blue” group, the biggest one, includes mainly the companies from financial sector as well as majority of energy and transportation companies. So we can conclude that in these particular industries the market value is driven by Profits and Assets rather then Sales.
Sector Analysis[edit]
We would like to use information about industry division in our analysis. Because the respective variable is not numeric, we would need to include 8 dummies for 9 sectors in our regression. However, insufficient amount of observations precludes us from estimating 14 coefficients with 68 observations (after excluding outliers and negative profits). Therefore, we deploy color options of XploRe to analyze the possible implications of each sector.
Each sector was assigned a distinctive color. Also in order to keep as much observations as possible, we used here the initial sample of 79 companies.
It is obvious that with regards to profits’ value the sector doesn’t matter so much – companies from different industries “lie” on the common mvprofit curve. Meanwhile, the cloudy scatterplot “MV vs Assets”, which led to the insignificant coefficient by the asset variable, presents now a clearer picture. We can observe that companies from different sectors possess a different relationship between their assets and market value.
Conclusion[edit]
PC analysis. Since the 1st PC explains quite high percentage of variation in the data sample, we tried to construct the index based on the 1st PC. The index we obtained is a linear combination of original variables with weights equal to the values of the eigenvector that corresponds to the first highest eigenvalue of the correlation matrix. Probably, this index could be the answer to the questions: how to defined the largest US companies, which parameters have stronger contribution to the index, or in other words which figures are the most important for the company willing to find itself in the Forbes’ “list of the best”. We obtained the following expression for the index that could be used to define which companies should be in the Forbes 500 list:
Regression analysis. We also tried to answer the question “what drives market value”. After log transformation of original variables we obtained the following dependence of Market Value from Profits and Sales: .
Sector analysis. We didn’t manage to define the inherent characteristics of given sectors, on the contrary the analysis allows us to conclude that in the data sample each sector of economy has companies with different values of assets, sales, market value and other parameters. But on the other hand have found similarities in data – two clusters ("blue" and "red") contain companies with different characteristics. Also the regression analysis showed different dependence of log Market Value on other variables.
Nevertheless we concluded that sector also matters when we examine dependences among variables. The scatter plot “MV vs Assets” clearly shows that companies from different sectors possess a different relationship between their assets and market value.
References[edit]
 Hardle, W., Simar, L.(2003) Applied Multivariate Statistical Analysis. Springer Verlag, Heidelberg. ISBN 3540030794 (486 p)
 Hardle, W., Hlavka, Z. and Klinke, S. (2000) XploRe Application Guide. Springer Verlag, Heidelberg. ISBN 3540675450 , (525 p)
 Hardle, W., Klinke, S., Muller, M. (2000) XploRe Learning Guide. Springer Verlag, Heidelberg. ISBN 3540662073 (526 p)
 Tsay, Ruey S. (2005) Analysis of Financial Time Series (Wiley Series in Probability and Statistics)  2nd edition. WileyInterscience. ISBN 0471690740 (640 p)
 MeiLing Shyu, ShuChing Chen, Kanoksri Sarinnapakorn, LiWu Chang, “A Novel Anomaly Detection Scheme Based on Principal Component Classifier”, Conference paper, University of Miami, Coral Gables, FL, USA, 2003
Comments[edit]
 Variable units: in millions what?
 What did you learn from the descriptive analysis?
 Where are the result of frequency ?
 Programs are missing
 Typos
 Analysis of 1.PC: could we not see this by a boxplot?
 The exclusion of outliers suggests a more complex transformation of the data, e.g. log
 Rather than showing the table of means, medians, etc for each group a graphic would be much better
 Why not showing a contingency table sector vs. cluster group?
 Why doing some univariate analysis AFTER doing multivariate analysis?
 Where are the histograms?
 If you use log(vars), do you think your outliers are still outliers?
 "no need to do logtransformation because PCbased approach has no distributional assumption;", but outlier influence it heavily
 The R^2=0.96 is too high, you are estimating the "outliers" well, but I guess that you do not estimate all other observations well
 Why not using log(Profits) vs. Market values. The graphics is unreadable. Why you do not add 9 regression lines in the graphic?