Innovation data analysis using XploRe
- 1 Introduction
- 2 Data exploration
- 3 Univariate analysis
- 4 Multivariate analysis
- 5 Conclusion
- 6 References
- 7 Comments
Innovation is one of the driving forces in today’s fast developing world. Different definitions of the term could be found in various sources. Wikipedia defines innovation as the “introduction of new ideas, goods, services, and practices which are intended to be useful, though a number of unsuccessful innovations can be found throughout history” (http://en.wikipedia.org/wiki/Innovation). Princeton University delines innovation as “creation (a new device or process) resulting from study and experimentation as well as the act of starting something for the first time” (http://wordnet.princeton.edu). The main reasons illustrating the importance of innovations for individuals and companies are the following:
- improvement of living standards: innovation increases life quality, changes lifestyles;
- increase of productivity: using innovative technologies more output can be produced with thesame amount of limited resources;
- commercial success: innovation increases competitiveness, helps to find new solutions;
- creation of employment;
- contribution to economic growth.
The great importance of innovation encourages to analyse this process in more detail. The goal of the analysis is thus to determine factors leading to innovation world-wide.
MDTech XploRe Software is used as the tool for data evaluation, the program code is provided in the thesis (appears by clicking the corresponding figure).
Description of the data set
The data set contains 146 observations corresponding to almost all countries in the world. It is taken from the environmental performance measurement project of Yale University, Columbia University and World Economic Forum (http://www.yale.edu/esi). 8 variables were chosen for the analysis: innovation index as a dependant variable and the rest as possible explanatory variables. For the purpose of regional analysis we divided the variables into 4 country groups according to two criteria, first, the region and the second, our subjective evaluation of country’s development level. From this grouping the 9th variable – „Country group“ was attained. Table 1 provides a detailed overview about each variable, its type and measurement units.
|1||Country group||Nominal||1 – 5|| 1 = Europe, USA, Canada, Japan,Australia, New Zealand, Russia (Most developed countries - MDC)
2 = Rest of Asia
3 = Africa
4 = South America
|2||Innovation index||Metric,continuous||Standardized score between 1 (lowest) and 7 (highest)|| Measures national innovation capacity
of countries through monetary investment in research and development and the number of new patents
|3||Civil and Political Liberties||Ordinal, discrete||1 (high levels of liberties) to 7 (low levels of liberties)|| Average of political and civil liberties
indices measuring freedom of expression, rights to organize, rule of law, economic rights
|4||Carbon emissions||Metric, continuous||Tons per capita||Measures pollution|
|5||Control of corruption||Metric, continuous||Standardized scale (z-score), high scores correspond to effective control of corruption|| Measures social and economic costs of
corruption, quality of public service delivery, business, environmental, and public sector vulnerability
|6||Number of researchers||Metric, continuous||Number of scientific researchers per million inhabitants||Indicates scientific capacity of the country|
|7||Tertiary enrollment||Metric, continuous||Percentage of pupils (both sexes) of relevant age enrolled at tertiary level of schooling||Measures the level of education within a population|
|8||Gasoline price||Metric, continuous||Ratio of gasoline price to world average price||Indicates the costs of production factors|
|9||Digital Access Index||Metric, continuous||Score between 0 and 1, higher scores correspond to better access||Composite index composed of the equally average of Infrastructure, Affordability, Quality and Usage of telecommunications like internet, cellular phones etc. Indicates access to information|
Missing and extreme values
Our original data set contains missing values. They were treated differently according to the steps of our analysis. The univariate analysis was performed simply neglecting the missing values in order not to violate the graphs by replacing the values artificially. It is to emphasise, that we did not exclude the whole observation which had a few missing values, because we expected that some countries (e.g. the poor ones, which usually lack some estimates) might show interesting results in available explanatory variables. However, we replaced the missing values for the multivariate part of the analysis in order to have enough observations (only about 20 observations had full sets of estimates). For this purpose the average of each region for each variable was calculated and used for replacement under assumption that regional and development level similarities between countries might lead to similar results in our 8 variables. The data set also contains several extreme values. These observations were not excluded in any of the anlysis steps because they belong to the countries like USA, Japan or South Korea, which are important for the evaluation.
To take a first deeper glance at the data the summary statistics is calculated for each variable (see Table 2). The fact that the median and mean do not coincide but rather are distant from each other indicates shifted distributions of all variables. Skewness is a measure which refers to the degree of asymmetry in a distribution, it is calculated according to the Formula 1 and is equal to 0 in normal distribution. According to this measure variables 2 – 8 might be also not normally distributed.
Formula 1. Skewness:
Kurtosis measures the fatness of the tails of a probability distribution (see Formula 2), it should not be confused with skewness, which measures the fatness of one tail (www.equanto.com/glossary/k.html).
Formula 2. Curtosis:
Since kurtosis of normal distribution is approximately equal to 3, it confirms the insight that none of our variables seem to belong to a normal distribution.
Further the univariate graphical analysis is performed.
Firstly we take a look at our dependent variable using boxplots. Figure 1 confirms our expectations about the innovation level in the world: on average the MDC countries innovate much more than the rest of the regions (its 25% quantile is higher than 75% quantile of the rest), Asia being in the second place followed by South America and Africa with the lowest innovation level. USA is the only MDC extreme value and a clear world‘s leader in the field of innovations what we assume to be the result of country’s well known research funding, high development standards and economic dominance.
It is notable that Asian extreme values, especially Taiwan, have higher results than 75% of MDC countries. We explain it by a clear economic dominance of these countries in the Asian region: Taiwan and South Korea are two of the Asian "Four Tigers", along with Singapore and Hong Kong. Taiwan has one of the world's highest standards of living and together with South Korea is well known for its high-tech exports. Israel is a world leader in software development
and the main industry of its strong economy is the innovation intense high-tech sector.
The positions of the regions according to the
DAI variable are almost the same as for
innovation (see figure 2), suggesting a
positive relationship between these two
variables. The only difference is that South
America is on average more digitally
advanced than Asia. The repeating extreme
values of South Korea and Taiwan confirm
their status as high-tech innovative countries
which is uncommon in the emerging markets
of Asian world.
We assumed that gasoline prices might be a
good proxy for the cost of production factors
in the economy. From the glance at the figure
3 it seems that high factor costs are
influential for innovations. The plot shows
interesting extreme values which can be
explained economically: South Korea has one
of the highest oil prices in Asia probably due
to its high demand for gasoline and oil
shipment costs to the island. Iraq and
Turkmenistan are able to hold the low local
oil prices as they are rich in oil resources and
one of the biggest world’s gasoline
producers. High oil prices in Cuba is
probably a result of the trade embargo of
United States tightened in 1992.
According to the carbon emmisions industrially
active MDC and Asian countries are the world
leaders (see figure 4). However, it is noticable that
the Asian extreme values are even higher than the
ones of the MDC what might be explained by Asia’s
low environmental standards. The regional
emmision results (the same order of boxplots as the
innovation variable) imply that the most innovations
are created in well industrially developed countries.
Figure 5 shows an interesting fact that
Asia, Africa and South America have a
very similar average level of low
corruption control. It is noticeable, that
the spread of MDC countries is high,
mainly because, e.g. in East and Central
European countries and Russia corruption
still thrives whereas it is strongly
controlled in West Europe and USA.
All in all, boxplots indicate that we have non-normal distributions in all variables, only DAI and gasoline prices seem to be nearly normally distributed. Also from boxplots we can imply that there is a relationship between all of the explanatory variables and innovation. However, deeper analysis to estimate it is necessary.
The following QQ plots (see Figure 6) confirm the results of summary statistics and boxplot analysis of nonnormally distributed variables. Gasoline price variable also shows a small deviation from the 45 degree line implying that the variable is also not normally distributed.
To estimate the density of variables 3 methods are used: histogram, average shifted histogram (red line) and kernel density estimate (blue line). We estimate the density function of the whole world data to have a sufficient number of observations. The 3 estimators enhance the insights of boxplots since they show possible multimodality. Average shifted histogram prevents the problem of the data origin choice. We consider 10 shifts to eliminate the influence of an origin. But as such histogram still has a problem of loss of information we also plot smooth kernel density estimators.
Density estimates (see figure 7) imply that all of our variables except gasoline prices are shifted, skewed and asymmetrically distributed. The graphs indicate two groups of very similarly distributed variables. First group is the Innovation Index and the Control of Corruption and the second – Carbon Emissions, Tertiary Enrollment and Number of Researchers. The fact that we could identify almost the same countries in the peaks of each group suggests that there are at least two clusters of similarly distributed observations in the data set. More precisely we identified that African and South American observations tend to be on the left of the variable distribution, Asia rather in the middle and MDC countries on the right side. This insight is clearly confirmed by the dotplots which are shown below. A more rigorous cluster analysis is presented in the multivariate analysis part.
The density estimators of Gasoline Price show that most of the countries have average petrol prices. However, fat tails of its distribution point out the groups of oil-producing countries (left tail) and high-oil-demand or remote countries (right tail). The variable of Digital Access with 3 peak-density shows a clear multimodality possibly identifying 3 stages of digital development in the world (e.g. telephone and TV connection, mobile connection, internet).
Civil and Political Liberties is a discrete variable so instead of histograms a more suitable method of bar charts was
applied to estimate its distribution. The variable is measured from 1, meaning high level, till 7 – low level. Figure 8
shows that the distributions of Liberties differ a lot among regions. For instance for MDC probability of is decreasing with low values of liberties and for Africa it is increasing. The outliers in MDC part are Russia, Ukraine and Belorussia.
Dotplots presented in figure 9 show the spread of all observations in each variable taking in count regions (see the legend for the relevant colours). The graphs confirm the insight from density estimators that the data could be divided into at least two big clusters. They can be identified from intuitive lines showing the region borders in the dotplot of Tertiary Education: MDC countries’ observations are spread in the right half of the plot whereas Africa, Asia and South America are all in the left half. Such behaviour is also typical for the rest of the variables except the Gasoline Price which is almost equally distributed over all regions.
|- Rest of Asia|
|- South America|
In order to have enough observations for application of multivariate statistical methods the missing values were replaced with the averages of the regions they belong to.
To check the cluster insights of the univariate analysis it is further proceeded with the clustering methods of multivariate analysis. The following clustering methods available in XploRe were applied to the data set:
- Complete linkage
- Simple average linkage
- Average linkage
However, all methods except the latter provided very unproportional clusters already by division into two groups (e.g. first cluster only contained 3-4 observations) what makes the application of a regression problematic. Therefore the Ward method which clustered the data into 2 groups of 30 well developed countries and 116 less developed (further identified as developing) countries was chosen for our analysis. Ward method uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the sum of squares of any two (hypothetical) clusters that can be formed at each step. The results of Ward method confirm the insights of the univariate graphs, especially the dotplots where the division of MDC countries and the rest of the world is especially clear in almost all variables. The analysis is carried on using these clusters (developed countries are marked red and developing – black).
Andrew curves and parallel coordinate plot
Andrew curves and parallel coordinare plot (PCP) were plotted using 40 observations of the dataset in order to avoid the problem of overplotting. Figure 10 shows that the observations which belong to the same cluster behave similarly thus supporting the results of Ward clustering. No outliers could be seen in the graph. PCP also justifies the clusters, especially in variables nr. 2, 3, 5 and 6. The plot also shows that the variables are closely related to each other, to name the most outstanding relations, negative correlation can be implied between variables 2 and 3, 3 and 5 and positive relation between variables 2 and 5.
Starplot and scatterplot matrix
Due to the reason that using Andrew curves and PCP only a limited number of observations can
be presented, we apply a method of star plot to get a good overview of all variables (see figure
Figure 13 points that on average stars of developed countries are bigger than those of the developing ones. It indicates that high level of innovations (in developed countries) are followed by high values in all variables except political liberties what can be explained by its measurement from 1 – high level till 7 – low level. The opposite bahaviour can be observed in the stars of developing countries. In black stars only the variables of Liberties and Gasoline Price can be seen as exemptions of this behaviour. This again could be explained by the measurement of Liberties and density function of Gasoline Price shown in figure 7. To sum up, starplot confirms relationships between the explanatory and dependent variables.
In order to identify these relations better we also present all variables using scatterplot matrix (see figure 14). All variables were plotted against each other to also determine relations among explanatory variables.First of all scatterplot matrix shows positive relationship between Innovation Index and the other explanatory variables, except Liberties (negative relationship).
A strong positive correlation also could be implied between Digital Access Index and Carbon Emissions, Control of Corruption, Number of Researchers as well as Tertiary Enrollment. Such relation is fairly expected as Digital Access Index is closely related to country’s industrial development, its educational and academic levels.
However, no clear relationship can be identified between the explanatory variables and
Liberties as well as Gasoline Price although these two variables seem to be correlated to
Innovation. These insights provide a first approximate suggestion that Digital Access Index,
Liberties and Gasoline Price could be the most suitable candidates for explanatory variables of a
regression. As they seem to be not correlated to each other multicollinearity problem could be
Correlation matrices help to determine the exact correlation between the variables. The matrix is
computed for each cluster in order to check if the relationships between variables differ
depending on the country group they belong to (see figure 15). In each group we used the red
frames to mark the highest correlated explanatory variables (correlation > ± 65%) and green
frames to identify the suitable (least correlated) candidates for a regression. It can be clearly seen
that Innovation has a fairly high correlation with all explanatory variables in each cluster.
However, the variables highly correlated to Digital Access Index differ slightly among clusters
as well as the candidates for a regression.
Before we run the multiple regression bivariate regression is applied to the data set in order to examine the linear relationships between variables and to determine multiple regression candidates using alternative measurements to correlation such as Adjusted R-Squared and P-Value. Regressions were run for each cluster plotting Innovation values of vertical axis and an appropriate explanatory variable on the horizontal one.
The numerical results of bivariate regressions are compared in table 3. One of the most important insights of the table is that due to very high P-values (11% and 67% significance level) Gasoline Price is not a suitable explanatory variable for Innovation (especially in the developing world, what could be also seen in figure 16) and thus will be excluded from multiple regression. Using the 5% significance level, which is typical for hypothesis testing, the given P-values are much larger than 5% which means that the coefficient of the explanatory Gasoline variable is not significantly different from zero in each cluster. It is also fairly striking that only comparably small proportion of total variation in Innovation (9,1%) in developing countries might be explained by Civil and Political Liberties which we would expect to be a potential of self expression and creativity and thus also innovation power. This relation seems to hold in more advanced countries.
It is also important to our analysis that the main factors which are related to innovation in each cluster differ: in the developed world three highest R-Squared values are provided by Digital Access, Control of Corruption and Carbon Emissions, whereas in the developing group the most variation of Innovation variable is explained by Enrollment in Tertiary Education, Digital Access and Number of Researchers. This confirms our previous insights that Innovation seems to be more dependant on industrial and political/legal framework variables in high-tech countries and on academic potential and knowledge resources in the other cluster.
|Number of Researchers||0.31053||0.44108||0.0008||0|
Even though the above way of choosing explanatory variables seems to be convenient, it does not consider interrelationships between the candidates and thus provides an isolated view of their relationship to Innovation. To eliminate this disadvantage and to test our hypothesis from bivariate regression we further proceed with multiple regression methods. As it could be already seen from the table 3 some of the variables are not suitable to explain Innovation and should be rather excluded from regression. However, the problem of how to eliminate these variables and which ones exactly need to be eliminated is faced at this phase. Therefore we avoid running a simple multiple regression on all variables but to solve this problem we directly apply stepwise model selection methods. They consider if the elimination or selection of a variable improves the result or not.
Three methods of multiple regression are presented and compared below in order to provide a final explanatory model for the level of innovation world-wide. First of all XploRe quantlet “linregfs” was applied to perform forward selection method. The main idea of this method is to start from one “good” variable and calculate the simple linear regression. Then it is decided stepwise for each of the remaining variables if its inclusion to the model improves the fit of the model.
Second implemented regression method by means of quantlet “linregbs” is backward elimination. Here it is started with the linear regression for the full model and then variables without influence are eliminated stepwise.
Thirdly, a quantlet “linregstep” was applied for stepwise selection regression. It is a kind of compromise between forward selection and backward elimination. Beginning with one variable just like in forward selection one of the four alternatives has to be chosen:
- Add a variable.
- Remove a variable.
- Exchange two variables.
- Stop the selection.
Despite their differences the methods provide exactly the same results about our explanatory models. The consistency of the results of different algorithms implies that the selected variables are appropriate to explain the variation of innovation level in each cluster. Regression results are presented in ANOVA output table format in 17 and 18 figures.
Below the regression results are summarized in equation form. In general, in regression with multiple independent variables, each equation coefficient tells how much the dependent variable is expected to increase (decrease) when that independent variable increases (decreases) by one unit, holding all the other independent variables constant.
According to the applied algorithms innovation level in developed countries can be most appropriately linearly explained by 3 variables: Carbon Emissions, Number of Researchers and Tertiary Enrollment. All of them are positively related to innovation.
INNOVATION = 0,98 + 0.6865*Nr. of researchers + 0.0003*Carbon emiss. + 4.4219*Enrollment
This model provides the Adjusted R-squared of approximately 63%.
We also would like to analyse which of the explanatory variables in the regression model has the strongest effects on innovation level. However, the unstandardized regression coefficients can not be used to determine the relative importance of the predictors because our independent variables are measured in different measurement units. The way to interpret these effects regardless of their measurement units is to use standardized regression coefficients. They are computed by standardizing all variables (dependent and independent) or converting them to z-scores. Standardized regression coefficients show by how many standard deviations the dependent variable changes for a change in any of the independent variables by one standard deviation. The standardised regression coefficients denoted as "StandB" from the above output tables are used to evaluate the importance of some variable in our model. I. e. the Enrollment has the greatest impact on our depndent variable. The standardised regression coefficients are:
- 0.7525 for Enrollment
- 0.3738 for Carbon emiss
- 0.3132 for Nr. of researchers
According to the applied algorithms there are also 3 most appropriate factors which explain innovation level in developing countries, they have a positive relationship to the dependent variable:
INNOVATION = 1.35 + 0.174*Corrupt. control + 0.0003* Carbon emiss. + 0.9315*Enrollment
The standardised regression coefficients for developing country model can be written as follows:
- 0.4227 for Corrupt. control
- 0.2632 for Carbon emiss.
- 0.2504 for Enrollment
This model provides the Adjusted R-Squared of approximately 68%. In comparison to the Adjusted R-Squared results (maximum 59% for developed cluster and 66% for developing group) which we obtained by running multiple regressions on explanatory candidates indicated by correlation matrices and bivariate regressions, these outcomes are better and thus suitable for the final model of our analysis. Due to space reasons output tables of these less efficient regressions are not published here.
In general, the results of stepwise model selection methods are only partly consistent with our previous hypothesis that Innovation is more dependant on industrial and political/legal framework variables in high-tech countries and on academic potential and knowledge resources in the other cluster. They show that in each cluster both: industrial power, indicated by Carbon Emission, and academic/education potential, indicated by Enrollment and Number of Researchers, is important to explain Innovation. To the opposite of our previous insight, legal/political framework, indicated by Control of Corruption, explains the Innovation level in developing countries rather than developed ones (0.17 units of change of innovation can be explained by a unit change in Control of Corruption). This result is more intuitive because one could expect very heterogeneous levels of Corruption Control between 116 less advanced countries. It could be inferred that in the other cluster legal system has been already developed up to the very similar high level and thus does not have a big influence on innovation level.
Other differences between the effect on innovation of independent variables in each cluster are also noticeable. First of all, the tertiary enrollment in developed countries is the main explanatory variable of innovation power (innovation level would be increased by 4.42 units if Enrollment would grow by one unit; one standard deviation change in Enrollment leads to 0.7525 standard deviation change in Innovation). However, in developing countries such change of Enrollment would only induce the change of the level of innovations by 0.25 standard deviations. This can be possibly explained by the striking differences in quality of education in each cluster. What is more, differences in industry's relation to universities play a role too: in developing countries firms do not cooperate with universities and do not fund them as much as in developed world, thus it might be that many innovative ideas from academia are not encouraged enough or are never implemented due to the lack of funds.
It can be seen that Carbon Emissions have a higher influence to Innovation in advanced countries as well (StandB equal to 0,3738 and to 0,2632 respectively). As we hold Carbon Emissions for a proxy for amount of production and other industrial activities it can be inferred that the growth of industry in developed countries is rather related to innovative, high-tech products whereas in developing countries it relates mostly to increasing mass-production of basic (non-innovative) products (nowadays mostly due to outsourcing) and very low environmental standards.
To sum up, final models of both clusters suggest that cooperation between industry and academic environment is crucial for the innovative potential of a country and thus can be seen as the main conclusion of our analysis.
What is more, in developing countries it should be first of all invested into the improvement of the legal environment in order to create suitable conditions to foster their innovative power, as in this cluster Control of Corruption is the variable with a highest impact on innovation.
It is to emphasise that our analysis focused on linear relation exploration, therefore it is suggestable to continue and possibly improve it by applying non-linear regression models as well as generalised linear regression models.
- Härdle, W., Klinke, S., Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.
- Härdle, W., Hlavka, Z., Klinke, S. (2000). XploRe Application Guide. Springer Verlag Berlin-Heidelberg.
- Härdle, W., Simar L. (2003). Applied Multivariate Statistical Analysis. Springer Verlag Berlin-Heidelberg.
- Project of Yale University and World Economic Forum http://www.yale.edu/esi/
- Wikipedia http://www.wikipedia.org
- missing values replacement is better than simple mean replacement, there are better strategies, but they ar not available in XploRe :(
- however, I would prefer to replace first the missing values and make then the anaylsis on the completed dataset then just mixing in different analysis replacement strategies
- what are shifted distributions?
- Is gasoline price really a proxy nowadays?
- Only partially programs integrated
- Kernel density: bandwidth choice?
- Civil and Political: ordinal-boxplot for metric-barchart for ordinal
- Figure 12 is nice
- You could compare the regression on both clusters with a regression on the whole world