BPMN-Selftest - Analyzing the Data of an Online Survey
- 1 Introduction
- 2 Data Set and Variable Description
- 3 Explorative Analysis
- 4 Extreme Values and Outlier Analysis
- 5 Transformation
- 6 Testing Equality of Means
- 7 Testing for Correlations
- 8 Summary
- 9 References
The idea of this paper is to analyze the data gathered by an online survey supervised by the department of Information Systems at the Humboldt-Universität zu Berlin. I am a member of the research team. The online survey can be found at www.bpmn-selftest.org. The survey addresses current research on process model comprehension. It tries to identify the factors which influence the understanding of process models. All data was collected and stored in a database while participants took part in the survey.
The goal of this paper is to first investigate the data set for extreme values and outliers. Outliers shall be removed from the data set. Afterwards the data set is used to further analyze the data. In each section the purpose of the statistical analysis is stated at the beginning. This paper focuses on the application of statistical techniques to analyze a given data set rather on a complete statistical analysis of the data. Different statistical analyses are applied which were taught in the course “Computergestützte Statistik” at Humboldt-Universität zu Berlin. The statistic software used in this paper is SPSS17.0.
Data Set and Variable Description
The data were first extracted from a database where they were stored in different tables. The extract and transform process was done within the database where all data was consolidated. In the transform-process some further constrains were added to the data set, e.g. the number of correct answers must at least be 16. Participants who could not succeed in passing this threshold were simply skipped. These assumptions are not stated and discussed in this paper as they were up to the research team. However, if they are relevant for some analysis, they are stated in the corresponding section and clearly explained. The data was stored within an Excel-file which could easily be loaded to SPSS. After having imported the data into SPSS the variable labels were edited in order to add semantic information to the variables and to make them more transparent.
The main objectives in the data set are participants who took part in the survey. First, the participants were asked for general information. In the survey 30 questions on process models were displayed where the user had to answer the question. Therefore, for each participant a number of data elements are stored. The elements can be defined in groups covering specific aspects of the participants, e.g. general data or expertise data. In this paper not all variables provided in the data set will be used for analyses. The current data set is aggregated to the participant level.
The following list shows the most important variables investigated in this paper:
- Participant-ID (email), the unique id’s of participants, nominal discrete
- Model understanding time (mu_time), the time the participant needed to answer all 30 questions, metric continuous
- Number of correct answers (correct_answers_absolute), the number of questions the participant answered correctly out of 30 questions, metric continuous
- Efficiency (efficiency), the efficiency of a participant stating the number of correct answer he can give per hour, metric continuous
- Gender (E1b), the gender of the participant, nominal discrete
The figures below show the variable view in SPSS and the data view and some example entries.
For the data analysis it is necessary to understand how the individual variables are computed. The variable “mu_time” is computed by summing up all individual durations of the 30 questions in the online survey. The sum was already computed in the database. However, this fact is important as the variable tends to be normally distributed according to the classical central limit theorem. The variable “correct_answers_absolute” indicates the number of correct answer also for all 30 questions which were stated to each participant. In the data set missing values cannot occur. This is due to the fact that the online survey was especially constructed to support a latter data analysis. Therefore, an analysis for missing values can be skipped for the whole data set. In Figure 2.3 it can be seen that missing values do not occur for several variables.
In this section the variables are investigated for the first time. Therefore, an explorative approach is used to visualize all data gathered. Statistical techniques shall support the initial investigation of the variables. For the visualization histograms, box-plots, stem-and-leaf plots, and the M-estimators are used. Here, measures e.g. mean, median, variance, and standard deviation are shown. These parameters show some information about how the values are distributed. For each variable the diagrams and parameters are shown below. Both types are interpreted before further analyses are started.
Model Understanding Time (Variable “mu_time”)
Figures 3.1 depict statistical parameters about the variable “mu_time”. Using the Descriptives one can see that the values are skewed to the right which is also shown in the histogram. Some values are far above the median and the mean as well. However, it can also be seen that the median and mean value do not differ much. The difference is only 82.86. This might indicate that only a few outliers might distort the mean value. By comparing the M-estimators with the mean value one can see that even these robust measures are far below the mean. Two estimators are about 70 below and the other two about 105. This supports the assumption of extreme values or outliers. The standard deviation is high which indicates that some values are far away from the mean. The histogram also shows that the normal distribution might be violated due to the values which are on the right tail of the distribution. The stem-and-leaf plot defines these values, which are greater or equal to 2304, to be extreme values. This will further be discussed in the next section. The plot also shows that the distribution is skewed to the right. Most values lie between the ranges of [0,999] or [1000, 1999] but there are 16 values in total which lie above these ranges. Taking the boxplot into account for checking the distribution of values, one can argue that the values might be normally distributed. The median lies approximately in the middle of the interquartile range. Although the difference between the upper whisker (75%-quartile plus 1.5 times the interquartile range) and the 75%-quartile is larger than the 25%-quartile and the lower whisker, this might not necessarily disturb the assumption of normal distribution. This might be due to the fact that the distribution is skewed to the right. The boxplot further shows some extreme values which lie above the upper whisker. This data values are marked as dots.
Number of Correct Answers (Variable “correct_answers_absolute”)
For the variable “correct_answers_absolute” one can see that the difference between the mean and median value is very small, only 0.30. This might indicate that few or no outliers or extreme values distort the mean value. This assumption holds as the histogram indicates that there might not be outliers or extreme values in the data. Even the M-estimator values come quite close to the mean. However, it must be noted that the data, which was retrieved from the database, were limited to the lowest boundary of 16 correct answers per participant. Therefore, no values below 16 do occur and the highest possible value is 30 as for 30 possible correct answers. The normal distribution curve within the histogram contributes to the fact that the values are normally distributed. The boxplot supports this assumption as well. No extreme values are shown within the plot. Although the line for the median is moved slightly from the middle of the box, the values might still be normally distributed. The stem-and-leaf-plot does not show any extreme values, too.
Efficiency (Variable “efficiency”)
The variable “efficiency” was computed in the online survey using the variables “correct_answers” and “mu_time”. It shows the total performance of each participant which can be interpreted as how well he/she understood the process models and were able to answer the corresponding questions. Also for this variable the difference between mean and median is quite small, only 2.51. The M-estimators yield values of about 70 which are fairly close to the mean of 71.51. This supports that only a few data points distort the mean value. In the Descriptives the maximum value is stated to be 167. This value clearly shows up at the right side in the histogram and might definitely be an extreme value. The skewness-parameter is at 0.631 which is low and signals a small skewness to the right. However, the small skewness to the right might occur due to the extreme value. This phenomenon is discussed in later section again when dealing with outliers and extreme values. The histogram shows a fairly well fitting normal distribution of the underlying values. This is supported by the boxplot as well. Both, the stem-and-leaf plot and the boxplot show an extreme value.
Extreme Values and Outlier Analysis
Extreme Value Analysis
As discussed in the previous section some extreme values already showed up in the histograms, stem-and-leaf plots, and boxplots. Extreme values were found for the variable “mu_time” and “efficiency” while for the variable “correct_answers_absolute” no extreme values seem to exist. This first investigation will now be supported by the Figures 4.1 - 4.3. All figures show extreme values of the variables which means that the five highest and five lowest values are stated. It must be noted that the extreme values shown might not necessarily be outliers, too. The values might be interpreted as follows:
- Variable “mu_time”: The stem-and-leaf plot as well as the boxplot showed that some extreme values exist. In the boxplot four values were marked with stars. The stem-and-leaf plot also marked values above 2304 as extreme values. These four values show up in Figure 4.1, too. The lower values are no extreme values. This can be concluded from the fact that the data were limited to a lower time value of 580 when being extracted from the database. Therefore, extreme values or outliers do not exist at the lower boundary. At the higher boundary the data entry of 2700 might definitely be an extreme value and possibly be an outlier which has to be tested.
- Variable “correct_answers_absolute”: In the previous section no extreme values could be detected from the plots. The values in the figure below simply show normal values as the lower boundary for correct answers is 15 and the highest possible number 30. Concluding this, no extreme values can exist.
- Variable “efficiency”: In the preceding section one extreme value has been detected in the boxplot and stem-and-leaf plot. In the figure the highest value of 167 is far away from the second highest. Therefore, this value is an extreme value and also possibly an outlier.
In order to perform an outlier analysis the requirement of normal distribution must be met in order to allow the statistical tests for outliers in this section. Once this test is successfully met, several tests for outliers shall be computed, the Grubbs test, Dixon’s r-test, and the David-Heartley-Pearson-Test. Once outliers are detected, they shall be removed from the data set. The purpose of removing outliers lies in the fact that outliers might distort other statistical analyses. In order to avoid this, removing outliers might be a good idea. However, one must keep in mind that some values have been removed while doing some analyses later. This is very important. Therefore, sometimes it is important to compare the original data set with the data set without outliers. However, in this paper this procedure is skipped for studying purposes. The statistical methods applied in this paper can, of course, simply be applied to both data sets if needed.
Testing for Normal Distribution
The most important requirement for outlier tests is the normal distribution of variables. Here, all variables are initially checked by applying the Kolmogorov-Smirnov-Test (K-S-test). The tests declare a normal distribution for all three variables as the asymptotic significance lies greatly above the significance level of 0.05. Accordingly, hypothesis H0 holds which signals normal distribution. By having studied the boxplot and histogram in the previous section it became doubtful whether the variable “mu_time” is normally distributed or not. This becomes obvious as the asymptotic significance is the lowest in the K-S-test. It must be noted that the K-S-test computes the mean and the standard deviation from the given data set which is also stated by SPSS as shown in Figure 4.4. Therefore, we assume normal distribution for all variables.
There are still other tests for normal distribution: the Kolmogorov-Smirnov-Test with Lilliefors significance correction and the Shapiro-Wilk-test. These two tests do not compute the mean and standard deviation from the given data set. Thus, these two tests might be harder tests for normal distribution. The outcome of the first test, K-S-test with correction, shows that only the variable “efficiency” is normally distributed. By studying the result of the Shapiro-Wilk-test no single variable is normally distributed. Summing up all results for normality tests, one must assume that no single variable is normally distributed by the given significance level of 0.05.
Next to the statistical tests, one can also graphically check for normal distribution with the help of Q-Q-plots (quantile-quantile-plots) and the detrended Q-Q-plots. These two plots are shown For all variables below. For the number of correct answers we can see that the values fairly well follow the line in the Q-Q-plot. Graphically one cannot test whether this variable is normally distributed or not. For the second variable “mu_time” the Q-Q-plot clearly shows that data values deviate from the normal distribution at both tails as well as in the body. However, the deviation in the body is small. In the detrended Q-Q-plot one can further see the deviation to the normal distribution and how large it is with the help of the y-axis. The third variable “efficiency” also deviates at both tails whereas the body follows a normal distribution quite well. At the left tail several values deviate from normal distribution while at the right tail one value clearly deviates.
Another possibility to test for normal distribution is to use the Jarque-Bera-Test. This test makes use of the skewness and kurtosis to compute the test statistic. As this test is not implemented in SPSS one can simply take the values of skewness and kurtosis and compute the test statistic by itself or using Microsoft Excel. In Figure 4.12 the values of the test statistic are shown. As the outcome of the test asymptotic approaches a Chi-Squared-distribution with two degrees of freedom, one can simply look up the critical value for the test statistic which is in this case 0.10. Therefore, all values are fairly above this value and, thus, no variable is normally distributed. Summing up, test for outliers cannot be performed as one cannot assume normal distribution for all three variables.
Testing for Outliers #1
In the previous section the requirement for outlier tests is not met. This means that the statistical tests cannot be applied. However, one can still semantically check for outliers. This procedure is not relying on statistical tests neither requires normal distribution. The check can always be applied by humans who are familiar with the data set, what the variables express and how they have been computed or derived. In the following this procedure is applied for the variable “efficiency” to illustrate how such a semantic evaluation can look like.
From Figure 4.10 one can assume that there is an outlier in the data. The boxplot and the stem-and-leaf diagram clearly show this value. This value is now taken and semantically checked. The value of 167 is the highest value in the data while the second highest value is at 116. The next values continue to slowly decrease starting from 116. By knowing the difference of 51 the assumption of an outlier becomes obvious. However, an important detail could be derived from the online survey by the research team. Only experts and professionals who took part in the survey could achieve values close to 167. No single student came close to such a value whereas all other students, except the one under suspicion, are far below this value. Knowing this, one can mark the participant with an efficiency of 167 as outlier and thus remove the participant from the data set.
Testing for Outliers #2
After having removed the identified outlier, the tests for normal distribution must be repeated again. As shown in the figure below, the variable “efficiency” is now normally distributed. This requirement for outlier tests is now met and they can be applied.
With the help of the Grubbs-test two values can be tested for being outliers. These are the lowest and the highest value in the data set. Although the test is not implemented in SPSS it can easily be applied to the data set. The test statistic is defined as follows: for the lowest entry and for the highest entry , where s is the standard deviation and x is the mean value. In order to compute the test statistic one can simply use the Descriptives which can be computed in SPSS. These are shown below. The critical value for the test can be found on http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Tabelle-Grubbs/basis-statistik-tabelle-grubbs.html and is 3.171. The values for test statistic can be found in Figure 4.16. The null hypothesis is that the lowest value (or the highest value, respectively) is not an outlier. As both values are below this critical value, the hypothesis that the values are outliers must be denied. The values are therefore no outliers.
For the Dixon’s r-test no critical values are given for large samples. The only values for a sample size of similar size were for the test where two values on both sides are tested for being outliers. However, the assumption will unlikely be true and is not tested. The better assumption would have been that two high efficiency values and the lowest value are outliers. However, for such a test no critical values can be found and the test is not performed.
This test has the assumptions on hypothesis as the Grubbs-test: if the null hypothesis is true, the value which is the farest to the mean value is no outlier. The test statistic is defined as follows: where R is the range and s the standard deviation of the variable. Both values can easily be found in the Descriptives in SPSS. The test statistic for the variable “efficiency” is then computed by T=90/22.652=3.973. Now it must be checked whether this value lies in between the lower percentage and the upper percentage point. These critical values can be found in http://mars.wiwi.hu-berlin.de/mediawiki/statwiki/index.php/David-Hartley-Pearson-Test. The lower percentage point is 4.24 and the upper percentage point is 13.34. Therefore, the value of the test statistic does not lie in between both percentage points and consequently the null hypothesis is denied. In the next step one has to determine the highest or lowest value and check which one is further away from the mean. In this case the value 116 is identified as an outlier. Now the question is whether this value shall really be removed from the data set? In this case the value is not removed as some other values (115, 113, 111, 110) are extremely close to this value, see Figure 4.15.
In this section the whole data set was first investigated for extreme values. This question was already partly answered in the previous section as some diagrams answered this question. In the first half the five highest and five lowest value for each variable were investigated. Once these values are identified, they must be tested for being outliers. Most tests for outliers require a variable to be normally distributed. Therefore, tests for normal distribution must prove the distribution. In the given data set no variable was normally distributed and therefore no test for outliers could be performed. With the help of a semantic check for the variable “efficiency” one value could be removed. A new check for normal distribution could then positively answer the test and two tests for outliers could be performed. A third test could not be performed due to missing critical values. One outlier was detected. As this value was really close to other values, it was not removed from the data set.
With the help of a transformation a variable can be transformed into another one by using a function. This function transforms each value of the initial variable into another one. Especially relevant are transformation when the data set of a variable is not distributed as it is required to perform some function or tests, e.g. outlier tests as introduced in the previous section. If the required distribution is violated, some tests can simply not be performed. Here, a transformation can assist and help to get the desired distribution on the new variable. With the help of an example it shall be shown how a transformation process can look like.
As shown in Figure 4.13 the variable “mu_time” is not normally distributed. If outlier tests shall be performed, the requirements for this are not met. In Figure 3.2 it is shown that the data is skewed to the right. Knowing this an appropriate transformation function can be chosen. In this case a logarithmic function does the job to achieve a normal distribution after the transformation. The function chosen here is simply where y is the new variable and x is the old variable. However, it must be mentioned that it is checked that the old variable does not contain any negative numbers. For the given data this is not possible as “mu_time” is always a non-negative number. In other cases, a positive offset must first be added to avoid negative numbers. In the figures below a histogram and a boxplot of the variables “mu_time” and “mu_timeNEW” is shown.
The results of the transformation can easily be seen in the test of normality. The variable “mu_timeNEW” is very well normally distributed whereas “mu_time” is not normally distributed according to the Shapiro-Wilk-test. The transformation did the job to achieve the desired distribution.
However, the transformation of “mu_time” brought up another question. The interpretation of the new variable “mu_timeNEW” is difficult. This becomes especially true if some other tests are performed with this variable. The user must keep this in mind before using transformed variables. If statements on the new transformed variable can easily be traced back to the original variable, then a transformation can be really helpful.
Testing Equality of Means
In this section the research question is whether two different groups (male and female participants) behave similar with respect to their efficiency. This means that we want to answer the question whether the variance of efficiency is equal in both groups and whether the mean of efficiency is equal in both groups. To answer these questions one can find the solution by using a graphical solution and/or using statistical tests. For the subsequent diagrams and tests it is assumed that the groups are not dependent on each other. This assumption holds for our groups as one participant can only be male or female.
The graphical solution builds on exploring the data with e.g. the Descriptives, histograms or boxplots. All figures can be found below. By investigating the Descriptives one can see that the mean and the variance in both groups are close to each other. However, it is not possible to conclude that the variance and the mean are equal in both groups. This must be tested by applying the appropriate statistical test but one can still assume that they do not differ too much. Another graphical approach is the error bar diagram shown in Figure 6.6. This diagram shows the confidence intervals (95 percent for the upper and lower value) for the mean values in both groups. From the diagram one concludes that the mean values might be equal due to the following reasons. The female group has a broader confidence interval which seems to incorporate the complete confidence interval of the male group. In other words, the confidence interval of the male group fully lies in between the female one. This might indicate that the mean in both groups are equal. The subsequent statistical tests must prove this assumption.
The statistical tests are divided into two blocks: the test for equal variances and the test for equal means. Variance homogeneity should be tested before testing for equal means.
Proof of Variance Homogeneity
With the help of the Levene-Test it is first tested whether the variances in both groups are equal. Before one should perform the Levene-Test, he should test the requirements. One needs two groups, here male and female. For the Levene-Test we do not necessarily need a normal distributed variable. However, the variable “efficiency” is actually normally distributed for the given groups as shown in Figure 6.6. Further, both groups are not dependent on each other. From Figure 6.7 one can derive that there is variance homogeneity in both groups with a high significance of 0.794 based on mean. The significance is much greater than the significance level of 5 percent which was assumed. The Spread vs. Level plot shows on the x-axis the median values for both groups and on the y-axis the corresponding interquartile range. Both points in the diagram can also be derived using the Descriptives in Figure 6.1 Descriptives above.
Proof of Equality of Means
The test for equality of means is performed. One might use the ANOVA (one-way analysis of variance). The requirements for this test are checked first. Normal distribution for both groups on the variable “efficiency” is required. This was also done and shown in Figure 6.6. Variance homogeneity is also needed which was already proven before. All other requirements, e.g. independent groups, are still given for this example. With the help of ANOVA the question is answered whether there is a significant difference between the means in both groups. The null hypothesis is that the means are equal in both groups. Having performed the ANOVA Figure 6.9 shows the test result. With a significance of 0.444 the test proves that the means in both groups are equal. The means plot is similar to the spread vs. level-plot. Simply the means are connected by a straight line.
Testing Equality of Distribution
Both the variance and the means in the groups are equal. Another interesting fact addresses the questions whether both distributions are equal or not. In Figure 6.2 and 6.3 the histogram might obviously show different distributions. With the help of the Kolmogorow-Smirnow-Test it might become more clearly. The test uses the maximum absolute difference between both distributions to compare them. The null hypothesis states that both distributions are equal. The outcome of the test reveals with a significance of 0.423 that the null hypothesis holds and both distributions are equal.
With the help of the error bar chart a first graphical investigation of the equality of means was possible. However, the visualization did not answer the question whether the means in both are really equal or not. Therefore, statistical tests proved it. First, a test for variance homogeneity was done which is a prerequisite for the later ANOVA. The outcome of ANOVA was that both groups have equal means. This was what should be tested. Additionally to this result a test for equal distributions of the variable “efficiency” was performed which also confirmed that both distributions are equal.
Testing for Correlations
For the analysis of correlations between variables several statistical methods exist. Of course, one can visualize the relationship in a diagram, e.g. scatterplots, or one can compute some parameters which indicate the relationship and how strong the relationship is. For both approaches the user must be careful about the measure (e.g. nominal) of the individual variables. There exist several parameters which make only sense if they are used together with variables of the appropriate measure.
In the following a graphical approach is used trying to identify correlations between variables. With the help of a scatter matrix three variables are shown against each other. Knowing the semantic interpretation of the variables one might think that the variables “E4a” and “E4b” might positively influence the variable “efficiency”, e.g. users with a large number of work days on formal training should have a higher efficiency score. However the graphical representation might not indicate this. Interpreting the result would mean that there exists a huge number of people with less formal training and they often perform better than people with formal training. The same might even be true for the variables “E4b” versus “efficiency” as shown in Figure 7.1. One possible reason for this phenomenon might be the fact that only students took part in the survey and the amount of people with significant amount of formal training is missing.
Another way of checking the correlation between variables is to use correlation matrices as shown below. Several metric variables are printed against each other. For each pair the Pearson correlation coefficient and the significance level are computed and shown. From the figure one can retrieve that the variable “efficiency” and “mu_time” are negatively correlated with -0.868 with a significance of 0.000. Of course, this is not surprising as the efficiency is computed with the help of the model understanding time (variable "mu_time") but another interesting effect occurs for the variables “E4a” and E4b”. From the scatterplot above one might see that the variables are positively correlated but not to which extent. The figure below states a correlation factor of 0.758 which is significant. The semantic interpretation is that participants who got a formal training most often also did some kind of self-education.
Regression In the given data set not all variables from the online survey are listed as the data set is aggregated for the participant-level. That means only values for each participant are listed. In the online survey 30 models were shown together with 30 questions. Once the online survey is done it might be interesting to do a regression analysis with the parameters of the different models which are not available now. This might help to better understand the individual factors influencing the overall efficiency. However, as the given data set lacks some parameters especially for the models such a regression analysis cannot be done yet.
Summary A graphical test for the correlation of variables is often useful once it is expected that two variables correlate. However, some graphical representations do not clearly show the results. Here a statistical parameter should be computed and used to really specify the correlation. Before these parameters are used it must be evaluated which statistical parameter to use. This is dependent on the measure of the variable.
In this paper the data gathered by an online survey are analyzed with the help of statistical approaches. Before turning into the data set and applying statistical tests the data set and the variables contained are explained briefly. Afterwards an explorative analysis starts by investigating three important variables with diagrams and some statistical parameters. One goal of this paper is to identify outliers in the variables. Prior to these extreme values in the data are analyzed before trying to identify outliers with the help of tests. While doing so, tests for normal distribution fail which causes the outlier tests to fail. With the help of a semantic check it can be achieved that one variable is normally distributed by removing one outlier. Then the outlier tests are applied and checked where again one outlier is identified. It is rather important to identify the outliers as they might disturb statistical approaches to a huge extent. For one variable which does not follow a normal distribution a transformation is done as an example. This brings the data into the desired distribution but with the lack of interpretation of the values. The research question whether male and female participants behave in the same way is answered by applying statistical tests for checking variance homogeneity, equality of means and equality of distributions. The outcome is that both groups have equal variances, equal means, and equal distributions. At the end an example is shown how the correlation between variables can be checked.
Applying statistical methods to a data set is not a straight forward process. One must really be careful in choosing the correct methods and checking the requirements for these methods. Simply knowing the methods and approaches is not enough, a lot of knowledge and experience helps while performing such statistical tests. However, studying how to apply these statistical methods was really interesting although not always easy. In the end some of the applied methods and findings will assist and support the current research project where the data set originates.
- http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Tabelle-Grubbs/basis-statistik-tabelle-grubbs.html (access: 2010/03/25)
- http://mars.wiwi.hu-berlin.de/mediawiki/statwiki/index.php/David-Hartley-Pearson-Test (access: 2010/03/26)
- A. Bühl (2008), SPSS Version 16: Einführung in die moderne Datenanalyse, Pearson Studium
- J. Bortz (2006), Statistik für Human- und Sozialwissenschaftler, Springer Verlag
- Skript Computergestützte Statistik I & II, Prof. Dr. Bernd Rönz, 2000