Performance of Classification Methods - A Monte Carlo Study2

From Teachwiki
Jump to: navigation, search

Abstract[edit]

In this thesis I will test some classification methods we got to know in the lecture Data Mining and Statistical Learning held by Dr. Sigbert Klinke and Dipl.-Kfm. Uwe Ziegenhagen, M.Sc. I will test also some further weka classification methods on the famous Iris dataset to see how they perform. We have learned a lot about these methods but in the real world we often have to find out which results are stable and from a practical view interpretable. I have performed a Monte Carlo Study with fivehundred runs and visualized some results for the interested reader. All Analysis was done in Software R 2.4.1.

Some Bibliography[edit]

R.A Fisher

Sir Ronald Aylmer Fisher (* 17. Februar 1890 in London, England; † 29. Juli 1962 in Adelaide, Australia) was a famous British statistician, evolutionary biologist, and geneticist. He is the father of many statistical methods we are using today. Fisher had very poor eyesight. But he was a very good student, winning a mathematical competition at the age of 16. He was tutored without pen and paper and was therefore able to visualize geometric problems without using algebra. He could produce mathematical results without stating the intermediate steps. In the early 20ths he was the pioneer of the principles of the design of experiments and developed the well known technique "analysis of variance". He began a systematic approach of the analysis of real world data to support the development of modern statistical methods. In addition to analysis of variance, Fisher invented the technique of maximum likelihood, the F-Distribution and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminant analysis and Fisher information. He also began the research on the field of non-parametrics, even he didn't believe it was necessary to move from parametric statistics. He died in Adelaide, Australia in 1962. (source mainly from Wikipedia)

Dataset[edit]

Figure 2: Iris Virginica
Figure 1: Iris Virginica
Figure 4: Iris Versicolor
Figure 3: Iris Versicolor
Figure 6: Iris Setosa
Figure 5: Iris Setosa

The Iris Flower Dataset is a popular multivariate dataset available in nearly every statistical software package. Iris (germ. "Schwertlilie") is the greek name for rainbow. It was introduced by R.A. Fisher in 1936 as an example for his famous linear discriminant analysis he developed also in the same year.

The dataset contains of 3 different iris species. Iris Versicolor, Iris virginica and Iris setosa. The Iris Flower with more than 300 subspecies over all the world and about 30 species over the North American Continent is spread widely depending on the climate. There are six subgenera (Iris, Limniris, Xiphium, Nepalensis, Scorpiris, Hermodactyloides) of which five are restricted to the Old World and one to the New World (Limniris).

They are spread ranging from cold regions into the grassy slopes, meadow lands, stream banks and deserts of Europe, the Middle East and northern Africa, Asia and across North America.



The Dataset measures four variables sepal (sepal germ. "Kelchblatt") width and sepal length and petal (petal germ. "Blütenblatt") width and petal length. The surfaces on the outer petals form are a landing stage for flying insects, which are pollinating the flower and collect the nectar. There are 150 observations alltogether 50 each species. In the descriptive part you will get to know a bit more about the differences between the species.

The Iris Virginica can be found on a straight line from Quebec down to Texas, a region with hot summers and mild winters in the south and mild summers and cold winters in the north. This region is also characterized by the great lakes and a the great north American rivers Mississippi, Missouri, Ohio and Arkansas.

The Iris Versicolor can be found in the North East of the United States, the great lakes region and the Canadian provinces of Quebec, Ontario, Newfoundland, Manitoba and Saskatchewan. From coincidences of the location we can think of some similiarities of Iris Versicolor and Iris Virginica.

The Iris Setosa therefore is principal spread in the Northwestern region of the Rocky Mountains mainly in the Canadian provinces of British Columbia, Yukon Territory and in Alaska (US). This region is characterised through very cold winters, short mild summers and a large amount of rainfall per year cause by rain clouds from the pacific.


A sample of the data is given over here


Obs	   Sepal.Length	    Sepal.Width	    Petal.Length    Petal.Width	       Species
1		5,1		3,5		1,4		0,2		setosa
2		4,9		3		1,4		0,2		setosa
3		4,7		3,2		1,3		0,2		setosa
4		4,6		3,1		1,5		0,2		setosa
5		5		3,6		1,4		0,2		setosa
6		5,4		3,9		1,7		0,4		setosa
7		4,6		3,4		1,4		0,3		setosa
8		5		3,4		1,5		0,2		setosa
9		4,4		2,9		1,4		0,2		setosa
10		4,9		3,1		1,5		0,1		setosa
...		...		...		...             ...             ....
51		7		3,2		4,7		1,4		versicolor
52		6,4		3,2		4,5		1,5		versicolor
53		6,9		3,1		4,9		1,5		versicolor
54		5,5		2,3		4		1,3		versicolor
55		6,5		2,8		4,6		1,5		versicolor
56		5,7		2,8		4,5		1,3		versicolor
57		6,3		3,3		4,7		1,6		versicolor
58		4,9		2,4		3,3		1		versicolor
59		6,6		2,9		4,6		1,3		versicolor
60		5,2		2,7		3,9		1,4		versicolor
...		...		...	        ...		...		....
101		6,3		3,3		6		2,5		virginica
102		5,8		2,7		5,1		1,9		virginica
103		7,1		3		5,9		2,1		virginica
104		6,3		2,9		5,6		1,8		virginica
105		6,5		3		5,8		2,2		virginica
106		7,6		3		6,6		2,1		virginica
107		4,9		2,5		4,5		1,7		virginica
108		7,3		2,9		6,3		1,8		virginica
109		6,7		2,5		5,8		1,8		virginica
110		7,2		3,6		6,1		2,5		virginica
...             ...             ...             ...             ...             ....

Figure 7: Iris Data

Descriptives and some tests[edit]

Scatterplots[edit]

To look for some relationship in the data I want to start with some scatterplots. Scatterplot provides a graphical display of the relationship between two variables. An upward-sloping scatterplot indicates that as we increase the variable on the horizontal axis, the variable on the vertical axes increases. We can discover some structure in the data and the relationship about the different variables (linear, quadratic, etc.). When a scatterplot shows an relationship between two variables, there is NOT necessarily a cause and effect relationship. Both variables could be related to some third or more variables that explains their variation or there could be some other cause. Later on, I want to explain the advanced concept of partial correlation which will allow us to have nearly experimental conditions.





Picture 1: Scatterplots of Iris Variables


From the scatterplots we can see somehow the ability of the examinated variables to discriminate the different species. There are combinations which allow a very good discrimination like Petal Width vs. Petal Length (plot 1) and Sepal Length vs. Petal Length (plot 5). The combinations Sepal Length vs. Sepal Width (plot 2) and Petal Width vs. Sepal Length (plot 6) have many overlappings in the species versicolor and virginica which complicates differentiation. Alltogether we can see that all of the 4 variables are useful to describe the dataset. For the later purpose of discrimination analysis I have to choose 2 variables to visualise results. Another way would be to work with principal components.

Boxplots[edit]

Boxplots are grahical techniques which allows us to display the distribution of a variable. It helps us to see location, skewness, spread, tail length and outlying points. Boxplots are graphical representations of the 5 number summary which is also given below.

Convert.php: File "/var/www/html/mediawiki/teachwiki/Rfiles/R/5027eb17728a4ae2ebe2fd061ce9ad967745f2e4_0.pdf.png" does not exist
in

library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <-boxplot(iris[1:50,1:4],main="setosa") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="setosa",ylim=c(0,8))


Convert.php: File "/var/www/html/mediawiki/teachwiki/Rfiles/R/4660b2e5fcc58a9c757fc4cf0a7bc8dc50476807_0.pdf.png" does not exist
in

library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <- boxplot(iris[51:100,1:4],main="versicolor") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="versicolor",ylim=c(0,8))


Convert.php: File "/var/www/html/mediawiki/teachwiki/Rfiles/R/473983b43505df9da16b2db38869a160aff753e1_0.pdf.png" does not exist
in

library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <- boxplot(iris[101:150,1:4],main="virginica") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="virginica",ylim=c(0,8))

Picture 2: Boxplots


I scaled the Y axis for all the right plots from 0 to 8 cm. From the boxplots we can see that the distributions of the 4 attributes are quite similar for versicolor and virginica. The species setosa has a quite different distribution in its attributes. Without testing it outliers seem not a problem. After all we can think of some difficulties in discrimination analyis because of the similar distributions of versicolor and virginica.

Figure 1: Iris Virginica

Figure 8: Summary statistics

Kernel Density Estimation[edit]

Density Estimation is a nonparametric tool which allows us to estimate a probability density function to see how an random variable is distributed. The easiest method is the histogramm. An more advanced method is kernel density estimation. We therefore need a bandwidth h and a so called Kernel (weigthing) function to assign weigth to observations x_i whose distance from X is not bigger than h. Playing with bandwidth and kernel weight we can determine the smoothness of the density. It has been done a lot of research of calculating the optimal bandwidth. If you are interested in the topic you can have a look here (Härdle, Müller, Sperlich 2004). I have used the Gaussian kernel and Silverman's rule of thumb which is one way of determing the optimal bandwidth h_{opt}.


Picture 3: Density Estimation all Species



We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.


Kernel Density Estimation setosa



Picture 4: Density Estimation Setosa


We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.


Kernel Density Estimation versicolor


Picture 5: Density Estimation Versicolor




We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.


Kernel Density Estimation virginica



Picture 6: Density Estimation Virginica



In the next section I have performed Shapiro Wilk and Kolmogorov Smirnov test for normality. Normal distribution of the attributes inside the classes is often an assumption for classifiers like linear discriminant analysis. However more flexible classifiers like neural networks or support vector machines can overcome this problem.

Test for Normality[edit]

Some methods require normal distributed data. An more sophisticated way of evaluating if an random variable has normal distribution ist to perform a test.


The Shapiro-Wilk test tests the null hypothesis that a statistical sample came from a normal distribution.

The test statistic is

W = {\left(\sum_{i=1}^n a_i x_{(i)}\right)^2 \over \sum_{i=1}^n (x_i-\overline{x})^2}

for p-values > 0,05 null hypothesis cannot be rejected.


Kolmogorov-Smirnov-Test

The Kolmogorov-Smirnov-Test tests the null hypothesis that a statistical sample came from a normal distribution. Since it is a nonparametric test it is very stable but not very exact. It more sensitive than Shapiro Wilk test. The main idea is to compare the frequencies of an empirical distribution function S(xi) with the frequencies of a standardnormal distribution function.


For every i the absolute difference

 d_{oi} = |S(x_i)-F_0(x_i)|~

and

 d_{ui} = |S(x_{i-1})-F_0(x_i)|~


is computed. The biggest differences dmax is computed from all differences. If dmax succeed a critical value da, null hypothesis cannot be rejected. The table below gives the p-values of the tests. The R Code of the test you find in the appendix. To get the R Code for the Pictures please click on it.


Test for normality with all species


Figure 1: Iris Virginica
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0 0.01 0 0.1
Kolmogorov Test 0 0.18 0.003 0.07

Table 1: Test for normality with all species


Nearly all test rejected null hypothesis of normal distribution. Only Sepal width passed the Kolmogorov Smirnov and Shapiro Wilk test.


Test for normality for setosa


Figure 1: Iris Virginica
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0.05 0.46 0 0.27
Kolmogorov Test 0.19 0.52 0 0.64

Table 2: Test for normality for species setosa


For setosa species both test reject null hypothesis of normal distribution for Petal width. Petal length just slidely passed the Shapiro Wilk test.


Test for normality for versicolor


Figure 1: Iris Virginica
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0.15 0.46 0.02 0.33
Kolmogorov Test 0.50 0.74 0.23 0.46

Table 3: Test for normality for species versicolor


For versicolor species nearly all test couldn't reject null hypothesis of normal distribution. Only Petal width didn't passed the Shapiro Wilk test.


Test for normality for virginica


Figure 1: Iris Virginica
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0.11 0.25 0.09 0.39
Kolmogorov Test 0.53 0.52 0.46 0.18

Table 4: Test for normality for species virginica


For virginica species all test couldn't reject null hypothesis of normal distribution.

Hypothesis Testing[edit]

In this section I want shortly pairwise perform F-test for equal Variances, t-test for equal means and nonparametric wilcoxon rank sum test for equal means which is equal to the Mann Whitney U test. F-test and t-test should be known from MVA. The wilcoxon rank sum test is a free of parameter test for comparison of medians from different distributions. Intuitive it is an alternative for t-test where the data is replaced by its ranks. The null-hypothesis is, their are no differences between the distributions. The alternativ-hypothesis say that their is a difference between the distributions. For p-values smaller than 0.05 we can reject null-hypothesis. The assumptions are that we have independent samples, a contineous distribution and at least ordinal data. The most important point is that the samples are more or less from the same form of distribution. The sample from setosa, versicolor, virginica comes more or less from the normal distribution.

Test results (p-values) setosa vs. versicolor
Figure 1: Iris Virginica
setosa vs. versicolor Petal.length Sepal.Length Petal.width Sepal.width
F-test 0 0 0.03 0.18
t-test 0 0 0 0
w-test 0 0 0 0

Table 5: Testresults 1


Test results (p-values) setosa vs. virginica
Figure 1: Iris Virginica
setosa vs. virginica Petal.length Sepal.Length Petal.width Sepal.width
F-test 0 0 0.01 0.19
t-test 0 0 0 0
w-test 0 0 0 0

Table 6: Testresults 2


Test results (p-values) versicolor vs. virginica


Figure 1: Iris Virginica
versicolor vs. virginica Petal.length Sepal.Length Petal.width Sepal.width
F-test 0 0 0.03 0.84
t-test 0 0 0 0.002
w-test 0 0 0 0.004

Table 7: Testresults 3

From the results we can conclude that the means and variances of the iris classes distributions are different. From the boxplots we could at least think about some commons between versicolor and virginica. Nevertheless thats a good result for discrimination analysis. The iris classes in Sepal width has some equal variances.

Correlation Analysis[edit]

To look for causal relationships we can use the concept of partial correlation. The main idea which relationship still exists between two variables when we calculating out the influences of all other "disturbing" variables. We can nearly have experimental conditions. Partial correlation analysis is also able to discover spurious correlation. In the first table below the pearson correlation coefficients are give. In the second table the partial correlation is given. For example correlation between height and body weight when we eliminate age. Partial correlation still requires meeting all the usual assumptions of Pearson correlation, linearity of relationship or homoscedasticity. If we want to examine relationsship between variables X and Y we regress variable X on variable Z and get a residual e. This e will be uncorrelated with Z, so we get correlation X with Y which is independent of Z.


We can distinguish between there cases


1. Partial correlation < Pearson correlation


Pearson correlation is overestimated because of the influence of another or more variables.


2. Partial correlation > Pearson correlation


Pearson correlation is underestimated because of the influence of another or more variables. Partial Correlation describes the relationship better.


3. Partial correlation = 0


If the partial correlation approaches 0, the original correlation is spurious. (famous stork and baby correlation)


Correlation Analyis for whole dataset


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.87 0.96 -0.43
Sepal.Length 1 0.82 -0.12
Petal.width 1 -0.37
Sepal.width 1

Table 8: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.71 0.87 -0.62
Sepal.Length 1 -0.33 0.63
Petal.width 1 0.35
Sepal.width 1

Table 9: Partial Correlation


We have not detected spurious correlation. Most of pearson correlation is overestimated. The partial correlation makes more sense because of the now positiv correlation of sepal length and width. Positiv correlation of petal length and width is also more likely in nature. Sepal length and Petal width was much overestimated.

Correlation Analyis for Iris setosa


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.27 0.33 0.18
Sepal.Length 1 0.28 0.74
Petal.width 1 0.23
Sepal.width 1

Table 10: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.17 0.28 -0.04
Sepal.Length 1 0.11 0.72
Petal.width 1 0.05
Sepal.width 1

Table 11: Partial Correlation


We have detected some spurious correlation (approaching zero). Most of pearson correlation is overestimated.


Correlation Analyis for Iris versicolor


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.75 0.79 0.56
Sepal.Length 1 0.55 0.74
Petal.width 1 0.66
Sepal.width 1

Table 12: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.63 0.65 -0.11
Sepal.Length 1 0.22 0.27
Petal.width 1 0.47
Sepal.width 1

Table 13: Partial Correlation


We have not detected spurious correlation. Most of pearson correlation is much overestimated. The partial correlation therefore makes more sense. Petal length and Sepal width was much overestimated.

Correlation Analyis for Iris virginica


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.86 0.32 0.40
Sepal.Length 1 0.28 0.46
Petal.width 1 -0.37
Sepal.width 1

Table 14: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.84 0.18 -0.08
Sepal.Length 1 -0.12 0.26
Petal.width 1 0.48
Sepal.width 1

Table 15: Partial Correlation


We have not detected spurious correlation. Most of pearson correlation seems overestimated. Positiv correlation of petal width and sepal width is more likely in nature. A just slidely coorelation of Petal length and Sepal width should now be more reasonable.

Cluster Analysis[edit]

Cluster analysis is an exploratory tool for solving classification problems. Given the Iris dataset we want to see if we can find our different species also in homogeneous clusters. Observations which are similar according to some appropriate criterion are put into one cluster. The clusters should be as homogeneous as possible. Discrimination analysis which is the objective of this paper adresses the other issue of classification where the groups are known a priori and we want to classify new observations. If you want to get mor information you can have a look here (Härdle, Simar 2003).


Cluster Analysis can be devided into the following steps.


1.) Select a distance measure

e.g Squared Euclidean distance, Manhattan distance, Chebychev distance


2.) Select a clustering procedure

e.g Hierarchical clustering like Ward clustering or centroid clustering, Linkage methods like single linkage method, complete linkage, average linkage


3.) Decide on the number of clusters

e.g from the dendogram


For my analysis I have use euclidean distance and Ward clustering. To visualise results I have randomly selected 10 observations from each species and performed cluster analysis. I have repeated this about 10 times and results are every time very similar. Below we can see from the dendograms that the clusters stay quite homogeneous which is a very good result for later discriminant analysis. In the next chapter we have to see if discriminant analysis is also able to produce stable results.

The "ward" method has been renamed to "ward.D"; note new "ward.D2"
The "ward" method has been renamed to "ward.D"; note new "ward.D2"


Picure 8: Cluster Analysis

Starsplot[edit]

The star plot (Chambers 1983) is a dimension reduction technique for visualizing a high dimensional multivariate data set. Each star represents a single observation. We can look at these plots to see the differences of observations from eye or we can use them to identify clusters or Iris flowers with similar features. We can look for dominant observations or for outliers. We can see from the starsplot that the first 38 observations are quite similar. Observations 39 two 89 are also looking very similar. The left observation build the last cluster. The results from star plot are not so good like cluster analysis but helps to get an impression of the data and confirm the impression that this data set is very suitable for classification. In the next section I will give you a short overview about the methods I will use before I give the results of the monte carlo study.


Picture 9: Stars Plot

Classification Methods[edit]

In this section I will give an short idea about the strength and weaknesses of some classification methods. For my further analyis I also used some weka classifiers which I don't describe here. If you are interested in the theorie behind this you can check here. weka

Linear Discriminant Analysis[edit]

The LDA was invented by Fisher in 1936. It has been developed further by Beaver(1966) and Altman(1968). The idea of LDA is to classify an new observation into a known group such as setosa, virginica or versicolor. The assumptions of the LDA are normal distributed classes and equal class covariances. LDA works only well when we deal with continuous variables. For a two class problem the Maximum Likelihood rule allocates x\, to \Pi_1\, if


 (\vec x- \vec \mu_1)^T \Sigma^{-1} ( \vec x- \vec \mu_1) \leq (\vec x- \vec \mu_2)^T \Sigma^{-1} ( \vec x- \vec \mu_2) .


The Z-Score model proposed by Altman (1968) is a linear discriminant function of some measures which are objectively weighted and summed up to an overall score that then becomes the basis for classification of new oberservations into different scores. The linear discriminant function should seperate your observations as good as possible. A version of the Z-Score was used from the german Schufa which tries to classifies private and buisiness customers. You obtain a certain score from attribrutes like age, income, place of residence etc.

Z_i=a_1x_{i1}+a_2x_{i2}+...+a_dx_{id}=a^\top x_i

Quadratic discriminant analyis[edit]

The quadratic discriminant analysis is equivalent to the linear unlike QDA does not assume that the variances of each of the classes have to be identical. In the F Test we found out that variances are mostly significant differences of the classes in its attributes. The qda is therefore more flexible but we loose also power for interpretating results.

Cart Model[edit]

The CART Model was developed by Leo Breiman in 1984. CART builds classification and regression trees for predicting continuous dependent variables (regression like ) and categorical predictor variables (classification issue). The main idea ist that only binary decision trees are used to find an optimal separation. The choice of variables is done via maximation of the informational content. The variables with the most informational gain (measured e.g in Entropy) are used early in the decision tree. Main technique is to reduce complexity in the model and to“prune“ the decision tree by cutting the nodes with smallest information. For more information you can look over here CART.


Figure 1: Iris Virginica
Picture 10: Picture of a Cart Model

Multinomial Logit Model[edit]

Multinomial Logit Model is a statistical technique for multi-class classification using multinomial logit analyses. The MLM allows for linear seperation. The main idea ist to analyse the dependence of an dependent nominal variable from independent countinous or dummy coded variables. To estimate the coefficents with logistic regression, the logarithmized odds ratio is estimated via maximum likelihood.

The model is given here.

\Pr(y_{i}=j)=\frac{\exp(X_{i}\beta_{j})}{\sum_{j}^{J}\exp(X_{i}\beta_{j})}

where y_{i}\, is the observed outcome e.g our classes setosa, versicolor, virginica.

The score function is the same as for LDA. Z_i=a_1x_{i1}+a_2x_{i2}+...+a_dx_{id}=a^\top x_i

Neural Networks[edit]

Neural networks in mathematics are based on the biological neurons. Neural networks are very sophisticated modeling techniques able of modeling extremely complex functions. In particular, neural networks are nonlinear. The structure is not fixed. Neural networks learn from experience. Neural networks have ability to predict very complex processes. Learning is done via new connections, weights changes, change of critical values or adding or removing of neurons. For more information in detail have a look over here Neural Networks 1 Neural Networks 2.


Figure 1: Iris Virginica
Picture 11: Picture of a Neural Network Model

Support Vector Machines[edit]

SVM is a classification method that performs classification by constructing hyperplanes in a multidimensional space that separates cases of different classes. The main idea ist that objects are divided into classes in a way that the border between the classes is chosen such that the distance between it and the objects is maximized. The vector w points perpendicular to the separating hyperplane. The distance between the hyperplanes is 2/|w|, so we want to minimize |w|.

Figure 1: Iris Virginica
Picture 12: Picture of a SVM Model


minimize (1/2)||\mathbf{w}||^2, subject to c_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1,\quad 1 \le i \le n. For nonlinear seperation kernels are used to find maximum-margin hyperplanes. SVM is therefore very flexible. For more information have a look over here SVM.

K-Nearest-Neighbor Estimation[edit]

k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on a distance function for pairs of observations, such as the Euclidean distance. The training examples are mapped into multidimensional space. The space is partitioned into regions by classes of the training examples. A point in the space is assigned to the class j if it is the most frequent class label among the k nearest training examples. Decision is based on a small neighborhood of similar objects. So, even if the target class is multi-modal it can still lead to good accuracy. When using only a small subset of the variables (poor similarity structure) k-nn produces more classification errors than other techniques. Due this algorithim is very fast it can be suggested to use all variables. K-nn is also a very fast technique for estimating missing values.


Figure 1: Iris Virginica
Picture 13: Picture of a k-NN Model


The figure shows a classification problem for a new observation. Based on the distance to its 9 nearest neigbors it is classified red or blue.


Simulation Study[edit]

For the simulation I haven chosen randomly 2/3 of the Iris Data set as test data to estimate a model. Than I took the complementary data to validate the model. I have used all available variables for this. I therefor used classification methods in its standardsettings without optimizing for its parameters. I repeated this about 500 times. In the table and the figure below you can see the accuracy of prediction and standarddeviation of the accuracy. A good estimator should show high accuracy and has minimum variance. In the second part I visualized results vor some variable combination. We can see the degree of complexity of the classifier and which regions are more likely to be a certain species. Since Weka classifiers didn't performed well and due to limited space in this seminar work I just put some weka results in the appendix.

Results[edit]

Figure 1: Results
Table 16: Simulation Results


From the table we can see that the two easiest models performs best. SVM also performs well. Cart performs poor since you have to prune the classification tree for outsample fit. The question arises which size is optimal. The weka classifiers perform not so well for this benchmarking dataset.

Figure 1: Results
Picture 14: Simulation Results


From this picture we can see e.g. that neural nets and cart are quite unstable in out of sample fit. This may come from in-sample overfit.

Linear Discriminant Analyis[edit]

Submit form "lars" ?

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width).

We can see that the Linear Discriminant analysis produces very clear results. The results are very good interpretable. The results can be easy presented to people e.g customers not involved into the topic.


Quadratic Discriminant Analysis[edit]

Rform.php: Attribute "id" can not be used twice
in
<Rform name="lars" />

We can see that the Quadratic Discriminant analysis produces not such clear results. The results lack of interpretability. The results can not be presented to people not involved into the topic. Some regions far away from relevant observations are classified into this group.


Neural Net[edit]

Rform.php: Rform can not be nested
in
<Rform name="lars" />
# weights:  51
initial  value 143.591233
iter  10 value 64.408153
iter  20 value 43.510291
iter  30 value 37.291547
iter  40 value 34.110092
iter  50 value 32.196161
iter  60 value 30.947388
iter  70 value 30.914738
iter  80 value 30.913355
iter  90 value 30.913273
final  value 30.913271
converged
# weights:  51
initial  value 111.599926
iter  10 value 51.812763
iter  20 value 44.837518
iter  30 value 40.055208
iter  40 value 38.671731
iter  50 value 37.084159
iter  60 value 35.257029
iter  70 value 35.022653
iter  80 value 35.014108
iter  90 value 34.985853
iter 100 value 34.921521
final  value 34.921521
stopped after 100 iterations
# weights:  51
initial  value 113.613285
iter  10 value 50.381753
iter  20 value 34.261599
iter  30 value 32.831956
iter  40 value 32.013909
iter  50 value 30.645378
iter  60 value 29.774723
iter  70 value 29.707265
iter  80 value 29.215775
iter  90 value 27.415515
iter 100 value 26.944971
final  value 26.944971
stopped after 100 iterations
# weights:  51
initial  value 130.127089
iter  10 value 48.989510
iter  20 value 40.464201
iter  30 value 37.947832
iter  40 value 37.486323
iter  50 value 36.306217
iter  60 value 32.852259
iter  70 value 31.572990
iter  80 value 31.178452
iter  90 value 30.379354
iter 100 value 29.703551
final  value 29.703551
stopped after 100 iterations

We can see that Neural nets produces quit good results. I would have expected more complecated structures. But we already know that the outcome is not very stable.

Support Vector machines[edit]

Rform.php: Rform can not be nested
in
<Rform name="lars" />
REngine.php: <!--- Start of program --->
Error in library("e1071") : there is no package called 'e1071'
Execution halted
in

library('lattice') library('e1071') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


If library svm is not installed please look in the appendix for the picture. We can see that SVM produces very complicated functions. But we see also that the results are quite stable. It can be used for interpretation.

Multinomial Logit[edit]

Rform.php: Rform can not be nested
in
<Rform name="lars" />
# weights:  12 (6 variable)
initial  value 109.861229
iter  10 value 41.872538
iter  20 value 40.390267
iter  30 value 36.842683
iter  40 value 36.663815
iter  50 value 36.563396
iter  60 value 36.519359
iter  70 value 36.473781
iter  80 value 36.408262
iter  90 value 36.389289
iter 100 value 36.378849
final  value 36.378849
stopped after 100 iterations
# weights:  12 (6 variable)
initial  value 109.861229
iter  10 value 37.779543
iter  20 value 35.551232
iter  30 value 35.109549
iter  40 value 35.101683
iter  50 value 35.096107
iter  60 value 35.071997
iter  70 value 35.070601
iter  80 value 35.066898
iter  90 value 35.064249
iter 100 value 35.063871
final  value 35.063871
stopped after 100 iterations
# weights:  12 (6 variable)
initial  value 109.861229
iter  10 value 46.846306
iter  20 value 45.650663
iter  30 value 42.253107
iter  40 value 42.114046
iter  50 value 41.973118
iter  60 value 41.947883
iter  70 value 41.881955
final  value 41.785206
converged
# weights:  12 (6 variable)
initial  value 109.861229
iter  10 value 42.568737
iter  20 value 39.726483
iter  30 value 38.349406
iter  40 value 37.781959
iter  50 value 37.391824
iter  60 value 37.228920
iter  70 value 37.056443
iter  80 value 36.972660
iter  90 value 36.956299
iter 100 value 36.931599
final  value 36.931599
stopped after 100 iterations

We can see that multinomial logit looks very similar to the linear discriminant analysis. It can be used well for interpretating results.

CART[edit]

Rform.php: Rform can not be nested
in
<Rform name="lars" />
REngine.php: <!--- Start of program --->
Error in library("tree") : there is no package called 'tree'
Execution halted
in

library('lattice') library('tree') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; par(mfrow = c(2, 2)) plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")

We can see that CART model produces very interpretable results. Unfortunaly the out of sample fit is not well if we do not prune the classification Tree. If library tree is not installed please look in the appendix for the picture.

Stability[edit]

In this section I want to find out about the stability of some predicted regions. I coloured the regions due to its frequencies of predicted classes. Due to computational time I calculated this about 150 times. Regions where the colours are clear are predicted the same every time. The worst result would be a mixture of colours in all regions. Click on the pictures to get the R Source.

Stability LDA[edit]

Figure 1: Stability

This is a very good result. Just the borders are a bit unstable but in generell linear discriminant analysis is perfect in accuray and stability.


Stability QDA[edit]

Figure 1: Stability
The qda is stable for a few combinations of variables. But for some combinations very unstable due to the flexibility. I would not present this result to a customer.

Stability Nnet[edit]

Figure 1: Stability
Figure 1: Stability

The worst result. The regions are quite unstable not only for the borders. An interesting result is that the predicted regions in average looks now quite linear.

Stability CART[edit]

Figure 1: Stability

This is also not a good result. The regions often change and there is a clear mixure of colours.

Stability Multinomial Model[edit]

Figure 1: Stability

The result of multinomial logit looks similar to the lda. There is a bit more uncertainty on the borders. But a very good result.

Stability SVM[edit]

Figure 1: Stability

For surprise SVM are also very stable. It may take a bit fantasy to present it to a customer but the results are very good.

Conclusions[edit]

From simulation we can conclude that the easiest models are superior to the complicated. SVM performs very good with its standard setting. Results after 500 runs can be seen as stable. In R.Weka library you can implement your own rules and classifier. You can also use e.g cross validation to optimize for the parameters of neural net and support vector machines. But do you think you can perform better than? If we have no further information about the data and don‘t know something about the structure we should not use complicated models with many degrees of freedoms like nnet or weka stuff.

References[edit]

  • Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.
  • Härdle, W. and Simar, L. (2003). Applied Multivariate Statistical Analysis. Springer-Verlag Berlin Heidelberg.
  • W. N. Venables and B. D. Ripley (2004). Modern Applied Statistics with S. Fourth Edition. Springer-Verlag Berlin Heidelberg.

Appendix[edit]

In the Appendix you find the R Code which I haven't used for this thesis or I havent't integrated into the programm. e.g the tables of the testresults.

#pie chart
pie.sales <- c(1/3,1/3,1/3) 
names(pie.sales) =c("setosa" ,  "versicolor", "virginica")
pie(pie.sales,col=c("blue","green","red"),main="Piechart Classes Iris Dataset")


#Tests for Normality
normaltests <-function(x) {

   myMittelwert <- mean(x)
   myStandardabweichung <- sd(x)
   t=ks.test(x,pnorm,sd=myStandardabweichung,mean=myMittelwert) 
   print(t)
   t=shapiro.test(x)
   print(t)
}

s=c("setosa")

normaltests(iris$Petal.Width[iris$Species==c(s)])
normaltests(iris$Petal.Length[iris$Species==c(s)])
normaltests(iris$Sepal.Width[iris$Species==c(s)])
normaltests(iris$Sepal.Length[iris$Species==c(s)])

v1=c("versicolor")
normaltests(iris$Petal.Width[iris$Species==c(v1)])
normaltests(iris$Petal.Length[iris$Species==c(v1)])
normaltests(iris$Sepal.Width[iris$Species==c(v1)])
normaltests(iris$Sepal.Length[iris$Species==c(v1)])

v2=c("virginica")
normaltests(iris$Petal.Width[iris$Species==c(v2)])
normaltests(iris$Petal.Length[iris$Species==c(v2)])
normaltests(iris$Sepal.Width[iris$Species==c(v2)])
normaltests(iris$Sepal.Length[iris$Species==c(v2)])

#Hypothesis Testing

standardtests<-function(n,m)
{
	n=n
	m=m
        for (s in 1:4)
        {
	z<-as.vector(iris[,s][iris$Species==c(n)])
        z1<-as.vector(iris[,s][iris$Species==c(m)])
	ftest <- var.test(z,z1)
	ttest <- t.test(z,z1)
	wtest <- wilcox.test(z,z1)
	#write.csv2(f,"C:/f-test.csv",append=TRUE)
	print(data.frame(Statistic=c(ftest$statistic,ttest$statistic,wtest$statistic),
        P=c(ftest$p.value,ttest$p.value,wtest$p.value),row.names=
	c("Equal Variances", "Equal Means","Nonparametric"))) 
	s+1
      }
}

#setosa vs. versicolor

n=c("setosa")
m=c("versicolor")

standardtests(n,m)

#setosa vs. virginica

n=c("setosa")
m=c("virginica")

standardtests(n,m)

#versicolor vs. virginica

n=c("setosa")
m=c("virginica")

standardtests(n,m)

Figure 1: Stability
Figure 1: Stability
Figure 1: Stability
Figure 1: Stability
Figure 1: Stability
Figure 1: Stability

Comments[edit]

  • The report should have been decomposed in several parts such that it does not take several minutes to load the page!
  • Why not a scatterplot matrix?
  • Why is the text after Picture 3 repeated by Picture 4?
  • Shapiro-Wilk: what is x_{(i)}?
  • What do I see in the tables? p-values?
  • If I do a test for all data and later restrict on some subgroups then I will increase the acceptance of the H_0. This means most probably I will accept the null hypothesis for the subgroups.
  • Not all models for classification are explained
  • What is the importance of the third, fourth, ... digit after the comma in the standard deviation?