# Performance of Classification Methods - A Monte Carlo Study2

## Contents

- 1 Abstract
- 2 Some Bibliography
- 3 Dataset
- 4 Descriptives and some tests
- 5 Classification Methods
- 6 Simulation Study
- 7 References
- 8 Appendix
- 9 Comments

## Abstract[edit]

In this thesis I will test some classification methods we got to know in the lecture Data Mining and Statistical Learning held by Dr. Sigbert Klinke and Dipl.-Kfm. Uwe Ziegenhagen, M.Sc. I will test also some further weka classification methods on the famous Iris dataset to see how they perform. We have learned a lot about these methods but in the real world we often have to find out which results are stable and from a practical view interpretable. I have performed a Monte Carlo Study with fivehundred runs and visualized some results for the interested reader. All Analysis was done in Software R 2.4.1.

## Some Bibliography[edit]

Sir Ronald Aylmer Fisher (* 17. Februar 1890 in London, England; † 29. Juli 1962 in Adelaide, Australia) was a famous British statistician, evolutionary biologist, and geneticist. He is the father of many statistical methods we are using today. Fisher had very poor eyesight. But he was a very good student, winning a mathematical competition at the age of 16. He was tutored without pen and paper and was therefore able to visualize geometric problems without using algebra. He could produce mathematical results without stating the intermediate steps. In the early 20ths he was the pioneer of the principles of the design of experiments and developed the well known technique "analysis of variance". He began a systematic approach of the analysis of real world data to support the development of modern statistical methods. In addition to analysis of variance, Fisher invented the technique of maximum likelihood, the F-Distribution and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminant analysis and Fisher information. He also began the research on the field of non-parametrics, even he didn't believe it was necessary to move from parametric statistics. He died in Adelaide, Australia in 1962. (source mainly from Wikipedia)

## Dataset[edit]

The Iris Flower Dataset is a popular multivariate dataset available in nearly every statistical software package. Iris (germ. "Schwertlilie") is the greek name for rainbow. It was introduced by R.A. Fisher in 1936 as an example for his famous linear discriminant analysis he developed also in the same year.

The dataset contains of 3 different iris species. Iris Versicolor, Iris virginica and Iris setosa. The Iris Flower with more than 300 subspecies over all the world and about 30 species over the North American Continent is spread widely depending on the climate. There are six subgenera (Iris, Limniris, Xiphium, Nepalensis, Scorpiris, Hermodactyloides) of which five are restricted to the Old World and one to the New World (Limniris).

They are spread ranging from cold regions into the grassy slopes, meadow lands, stream banks and deserts of Europe, the Middle East and northern Africa, Asia and across North America.

The Dataset measures four variables sepal (sepal germ. "Kelchblatt") width and sepal length and petal (petal germ. "Blütenblatt") width and petal length. The surfaces on the outer petals form are a landing stage for flying insects, which are pollinating the flower and collect the nectar. There are 150 observations alltogether 50 each species. In the descriptive part you will get to know a bit more about the differences between the species.

The Iris Virginica can be found on a straight line from Quebec down to Texas, a region with hot summers and mild winters in the south and mild summers and cold winters in the north. This region is also characterized by the great lakes and a the great north American rivers Mississippi, Missouri, Ohio and Arkansas.

The Iris Versicolor can be found in the North East of the United States, the great lakes region and the Canadian provinces of Quebec, Ontario, Newfoundland, Manitoba and Saskatchewan. From coincidences of the location we can think of some similiarities of Iris Versicolor and Iris Virginica.

The Iris Setosa therefore is principal spread in the Northwestern region of the Rocky Mountains mainly in the Canadian provinces of British Columbia, Yukon Territory and in Alaska (US). This region is characterised through very cold winters, short mild summers and a large amount of rainfall per year cause by rain clouds from the pacific.

Obs Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5,1 3,5 1,4 0,2 setosa 2 4,9 3 1,4 0,2 setosa 3 4,7 3,2 1,3 0,2 setosa 4 4,6 3,1 1,5 0,2 setosa 5 5 3,6 1,4 0,2 setosa 6 5,4 3,9 1,7 0,4 setosa 7 4,6 3,4 1,4 0,3 setosa 8 5 3,4 1,5 0,2 setosa 9 4,4 2,9 1,4 0,2 setosa 10 4,9 3,1 1,5 0,1 setosa ... ... ... ... ... .... 51 7 3,2 4,7 1,4 versicolor 52 6,4 3,2 4,5 1,5 versicolor 53 6,9 3,1 4,9 1,5 versicolor 54 5,5 2,3 4 1,3 versicolor 55 6,5 2,8 4,6 1,5 versicolor 56 5,7 2,8 4,5 1,3 versicolor 57 6,3 3,3 4,7 1,6 versicolor 58 4,9 2,4 3,3 1 versicolor 59 6,6 2,9 4,6 1,3 versicolor 60 5,2 2,7 3,9 1,4 versicolor ... ... ... ... ... .... 101 6,3 3,3 6 2,5 virginica 102 5,8 2,7 5,1 1,9 virginica 103 7,1 3 5,9 2,1 virginica 104 6,3 2,9 5,6 1,8 virginica 105 6,5 3 5,8 2,2 virginica 106 7,6 3 6,6 2,1 virginica 107 4,9 2,5 4,5 1,7 virginica 108 7,3 2,9 6,3 1,8 virginica 109 6,7 2,5 5,8 1,8 virginica 110 7,2 3,6 6,1 2,5 virginica ... ... ... ... ... .... |

Figure 7: Iris Data

## Descriptives and some tests[edit]

### Scatterplots[edit]

To look for some relationship in the data I want to start with some scatterplots. Scatterplot provides a graphical display of the relationship between two variables. An upward-sloping scatterplot indicates that as we increase the variable on the horizontal axis, the variable on the vertical axes increases. We can discover some structure in the data and the relationship about the different variables (linear, quadratic, etc.). When a scatterplot shows an relationship between two variables, there is NOT necessarily a cause and effect relationship. Both variables could be related to some third or more variables that explains their variation or there could be some other cause. Later on, I want to explain the advanced concept of partial correlation which will allow us to have nearly experimental conditions.

Picture 1: Scatterplots of Iris Variables

From the scatterplots we can see somehow the ability of the examinated variables to discriminate the different species. There are combinations which allow a very good discrimination like Petal Width vs. Petal Length (plot 1) and Sepal Length vs. Petal Length (plot 5). The combinations Sepal Length vs. Sepal Width (plot 2) and Petal Width vs. Sepal Length (plot 6) have many overlappings in the species versicolor and virginica which complicates differentiation. Alltogether we can see that all of the 4 variables are useful to describe the dataset. For the later purpose of discrimination analysis I have to choose 2 variables to visualise results.
Another way would be to work with principal components.

### Boxplots[edit]

Boxplots are grahical techniques which allows us to display the distribution of a variable. It helps us to see location, skewness, spread, tail length and outlying points. Boxplots are graphical representations of the 5 number summary which is also given below.

Convert.php: File "/var/www/html/mediawiki/teachwiki/Rfiles/R/5027eb17728a4ae2ebe2fd061ce9ad967745f2e4_0.pdf.png" does not existin

library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <-boxplot(iris[1:50,1:4],main="setosa") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="setosa",ylim=c(0,8))

Convert.php: File "/var/www/html/mediawiki/teachwiki/Rfiles/R/4660b2e5fcc58a9c757fc4cf0a7bc8dc50476807_0.pdf.png" does not existin

library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <- boxplot(iris[51:100,1:4],main="versicolor") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="versicolor",ylim=c(0,8))

Convert.php: File "/var/www/html/mediawiki/teachwiki/Rfiles/R/473983b43505df9da16b2db38869a160aff753e1_0.pdf.png" does not existin

library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <- boxplot(iris[101:150,1:4],main="virginica") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="virginica",ylim=c(0,8))

Picture 2: Boxplots

I scaled the Y axis for all the right plots from 0 to 8 cm. From the boxplots we can see that the distributions of the 4 attributes are quite similar for versicolor and virginica. The species setosa has a quite different distribution in its attributes. Without testing it outliers seem not a problem. After all we can think of some difficulties in discrimination analyis because of the similar distributions of versicolor and virginica.

Figure 8: Summary statistics

### Kernel Density Estimation[edit]

Density Estimation is a nonparametric tool which allows us to estimate a probability density function to see how an random variable is distributed. The easiest method is the histogramm. An more advanced method is kernel density estimation. We therefore need a bandwidth h and a so called Kernel (weigthing) function to assign weigth to observations whose distance from X is not bigger than h. Playing with bandwidth and kernel weight we can determine the smoothness of the density. It has been done a lot of research of calculating the optimal bandwidth. If you are interested in the topic you can have a look here (Härdle, Müller, Sperlich 2004). I have used the Gaussian kernel and Silverman's rule of thumb which is one way of determing the optimal bandwidth .

Picture 3: Density Estimation all Species

We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.

**Kernel Density Estimation setosa**

Picture 4: Density Estimation Setosa

We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.

**Kernel Density Estimation versicolor**

Picture 5: Density Estimation Versicolor

We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.

**Kernel Density Estimation virginica**

Picture 6: Density Estimation Virginica

In the next section I have performed Shapiro Wilk and Kolmogorov Smirnov test for normality. Normal distribution of the attributes inside the classes is often an assumption for classifiers like linear discriminant analysis. However more flexible classifiers like neural networks or support vector machines can overcome this problem.

### Test for Normality[edit]

Some methods require normal distributed data. An more sophisticated way of evaluating if an random variable has normal distribution ist to perform a test.

The **Shapiro-Wilk test** tests the null hypothesis that a statistical sample came from a normal distribution.

The test statistic is

for p-values > 0,05 null hypothesis cannot be rejected.

**Kolmogorov-Smirnov-Test**

The Kolmogorov-Smirnov-Test tests the null hypothesis that a statistical sample came from a normal distribution.
Since it is a nonparametric test it is very stable but not very exact. It more sensitive than Shapiro Wilk test.
The main idea is to compare the frequencies of an empirical distribution function S(x_{i}) with the frequencies of a standardnormal distribution function.

For every i the absolute difference

and

is computed. The biggest differences d_{max} is computed from all differences. If d_{max} succeed a critical value d_{a}, null hypothesis cannot be rejected. The table below gives the p-values of the tests. The R Code of the test you find in the appendix. To get the R Code for the Pictures please click on it.

**Test for normality with all species**

Test Results | Petal.length | Sepal.length | Petal.width | Sepal.width |
---|---|---|---|---|

Shapiro Test | 0 | 0.01 | 0 | 0.1 |

Kolmogorov Test | 0 | 0.18 | 0.003 | 0.07 |

Table 1: Test for normality with all species

**Test for normality for setosa**

Test Results | Petal.length | Sepal.length | Petal.width | Sepal.width |
---|---|---|---|---|

Shapiro Test | 0.05 | 0.46 | 0 | 0.27 |

Kolmogorov Test | 0.19 | 0.52 | 0 | 0.64 |

Table 2: Test for normality for species setosa

**Test for normality for versicolor**

Test Results | Petal.length | Sepal.length | Petal.width | Sepal.width |
---|---|---|---|---|

Shapiro Test | 0.15 | 0.46 | 0.02 | 0.33 |

Kolmogorov Test | 0.50 | 0.74 | 0.23 | 0.46 |

Table 3: Test for normality for species versicolor

**Test for normality for virginica**

Test Results | Petal.length | Sepal.length | Petal.width | Sepal.width |
---|---|---|---|---|

Shapiro Test | 0.11 | 0.25 | 0.09 | 0.39 |

Kolmogorov Test | 0.53 | 0.52 | 0.46 | 0.18 |

Table 4: Test for normality for species virginica

### Hypothesis Testing[edit]

In this section I want shortly pairwise perform F-test for equal Variances, t-test for equal means and nonparametric wilcoxon rank sum test for equal means which is equal to the Mann Whitney U test. F-test and t-test should be known from MVA. The wilcoxon rank sum test is a free of parameter test for comparison of medians from different distributions. Intuitive it is an alternative for t-test where the data is replaced by its ranks. The null-hypothesis is, their are no differences between the distributions. The alternativ-hypothesis say that their is a difference between the distributions. For p-values smaller than 0.05 we can reject null-hypothesis. The assumptions are that we have independent samples, a contineous distribution and at least ordinal data. The most important point is that the samples are more or less from the same *form* of distribution. The sample from setosa, versicolor, virginica comes more or less from the normal distribution.

**Test results (p-values) setosa vs. versicolor**

setosa vs. versicolor | Petal.length | Sepal.Length | Petal.width | Sepal.width |
---|---|---|---|---|

F-test | 0 | 0 | 0.03 | 0.18 |

t-test | 0 | 0 | 0 | 0 |

w-test | 0 | 0 | 0 | 0 |

Table 5: Testresults 1

**Test results (p-values) setosa vs. virginica**

setosa vs. virginica | Petal.length | Sepal.Length | Petal.width | Sepal.width |
---|---|---|---|---|

F-test | 0 | 0 | 0.01 | 0.19 |

t-test | 0 | 0 | 0 | 0 |

w-test | 0 | 0 | 0 | 0 |

Table 6: Testresults 2

**Test results (p-values) versicolor vs. virginica**

versicolor vs. virginica | Petal.length | Sepal.Length | Petal.width | Sepal.width |
---|---|---|---|---|

F-test | 0 | 0 | 0.03 | 0.84 |

t-test | 0 | 0 | 0 | 0.002 |

w-test | 0 | 0 | 0 | 0.004 |

Table 7: Testresults 3

From the results we can conclude that the means and variances of the iris classes distributions are different. From the boxplots we could at least think about some commons between versicolor and virginica. Nevertheless thats a good result for discrimination analysis. The iris classes in Sepal width has some equal variances.

### Correlation Analysis[edit]

To look for causal relationships we can use the concept of partial correlation. The main idea which relationship still exists between two variables when we calculating out the influences of all other "disturbing" variables. We can nearly have experimental conditions. Partial correlation analysis is also able to discover spurious correlation. In the first table below the pearson correlation coefficients are give. In the second table the partial correlation is given. For example correlation between height and body weight when we eliminate age. Partial correlation still requires meeting all the usual assumptions of Pearson correlation, linearity of relationship or homoscedasticity. If we want to examine relationsship between variables X and Y we regress variable X on variable Z and get a residual e. This e will be uncorrelated with Z, so we get correlation X with Y which is independent of Z.

**We can distinguish between there cases**

1. Partial correlation **<** Pearson correlation

Pearson correlation is overestimated because of the influence of another or more variables.

2. Partial correlation **>** Pearson correlation

Pearson correlation is underestimated because of the influence of another or more variables. Partial Correlation describes the relationship better.

3. Partial correlation **=** 0

If the partial correlation approaches 0, the original correlation is spurious. *(famous stork and baby correlation)*

**Correlation Analyis for whole dataset**

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.87 | 0.96 | -0.43 |

Sepal.Length | 1 | 0.82 | -0.12 | |

Petal.width | 1 | -0.37 | ||

Sepal.width | 1 |

Table 8: Pearson Correlation

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.71 | 0.87 | -0.62 |

Sepal.Length | 1 | -0.33 | 0.63 | |

Petal.width | 1 | 0.35 | ||

Sepal.width | 1 |

Table 9: Partial Correlation

We have not detected spurious correlation. Most of pearson correlation is overestimated. The partial correlation makes more sense because of the now positiv correlation of sepal length and width. Positiv correlation of petal length and width is also more likely in nature. Sepal length and Petal width was much overestimated.

**Correlation Analyis for Iris setosa**

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.27 | 0.33 | 0.18 |

Sepal.Length | 1 | 0.28 | 0.74 | |

Petal.width | 1 | 0.23 | ||

Sepal.width | 1 |

Table 10: Pearson Correlation

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.17 | 0.28 | -0.04 |

Sepal.Length | 1 | 0.11 | 0.72 | |

Petal.width | 1 | 0.05 | ||

Sepal.width | 1 |

Table 11: Partial Correlation

**Correlation Analyis for Iris versicolor**

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.75 | 0.79 | 0.56 |

Sepal.Length | 1 | 0.55 | 0.74 | |

Petal.width | 1 | 0.66 | ||

Sepal.width | 1 |

Table 12: Pearson Correlation

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.63 | 0.65 | -0.11 |

Sepal.Length | 1 | 0.22 | 0.27 | |

Petal.width | 1 | 0.47 | ||

Sepal.width | 1 |

Table 13: Partial Correlation

We have not detected spurious correlation. Most of pearson correlation is much overestimated. The partial correlation therefore makes more sense. Petal length and Sepal width was much overestimated.

**Correlation Analyis for Iris virginica**

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.86 | 0.32 | 0.40 |

Sepal.Length | 1 | 0.28 | 0.46 | |

Petal.width | 1 | -0.37 | ||

Sepal.width | 1 |

Table 14: Pearson Correlation

Petal.length | Sepal.Length | Petal.width | Sepal.width | |
---|---|---|---|---|

Petal.length | 1 | 0.84 | 0.18 | -0.08 |

Sepal.Length | 1 | -0.12 | 0.26 | |

Petal.width | 1 | 0.48 | ||

Sepal.width | 1 |

Table 15: Partial Correlation

We have not detected spurious correlation. Most of pearson correlation seems overestimated. Positiv correlation of petal width and sepal width is more likely in nature. A just slidely coorelation of Petal length and Sepal width should now be more reasonable.

### Cluster Analysis[edit]

Cluster analysis is an exploratory tool for solving classification problems. Given the Iris dataset we want to see if we can find our different species also in homogeneous clusters. Observations which are similar according to some appropriate criterion are put into one cluster. The clusters should be as homogeneous as possible. Discrimination analysis which is the objective of this paper adresses the other issue of classification where the groups are known a priori and we want to classify new observations. If you want to get mor information you can have a look here (Härdle, Simar 2003).

Cluster Analysis can be devided into the following steps.

1.) **Select a distance measure**

e.g Squared Euclidean distance, Manhattan distance, Chebychev distance

2.) **Select a clustering procedure**

e.g Hierarchical clustering like Ward clustering or centroid clustering, Linkage methods like single linkage method, complete linkage, average linkage

3.) **Decide on the number of clusters**

e.g from the dendogram

For my analysis I have use euclidean distance and Ward clustering. To visualise results I have randomly selected 10 observations from
each species and performed cluster analysis. I have repeated this about 10 times and results are every time very similar. Below we can see from the dendograms that the clusters stay quite homogeneous which is a very good result for later discriminant analysis. In the next chapter we have to see if discriminant analysis is also able to produce stable results.

The "ward" method has been renamed to "ward.D"; note new "ward.D2"

The "ward" method has been renamed to "ward.D"; note new "ward.D2"

Picure 8: Cluster Analysis

### Starsplot[edit]

The star plot (Chambers 1983) is a dimension reduction technique for visualizing a high dimensional multivariate data set. Each star represents a single observation. We can look at these plots to see the differences of observations from eye or we can use them to identify clusters or Iris flowers with similar features. We can look for dominant observations or for outliers. We can see from the starsplot that the first 38 observations are quite similar. Observations 39 two 89 are also looking very similar. The left observation build the last cluster. The results from star plot are not so good like cluster analysis but helps to get an impression of the data and confirm the impression that this data set is very suitable for classification. In the next section I will give you a short overview about the methods I will use before I give the results of the monte carlo study.

Picture 9: Stars Plot

## Classification Methods[edit]

In this section I will give an short idea about the strength and weaknesses of some classification methods. For my further analyis I also used some weka classifiers which I don't describe here. If you are interested in the theorie behind this you can check here. weka

### Linear Discriminant Analysis[edit]

The LDA was invented by Fisher in 1936. It has been developed further by Beaver(1966) and Altman(1968). The idea of LDA is to classify an new observation into a known group such as setosa, virginica or versicolor. The assumptions of the LDA are normal distributed classes and equal class covariances. LDA works only well when we deal with continuous variables. For a two class problem the Maximum Likelihood rule allocates to if

.

The **Z-Score** model proposed by Altman (1968) is a linear discriminant function of some measures which are objectively weighted and summed up to an overall score that then becomes the basis for classification of new oberservations into different scores. The linear discriminant function should seperate your observations as good as possible. A version of the Z-Score was used from the german Schufa which tries to classifies private and buisiness customers. You obtain a certain score from attribrutes like age, income, place of residence etc.

### Quadratic discriminant analyis[edit]

The quadratic discriminant analysis is equivalent to the linear unlike QDA does not assume that the variances of each of the classes have to be identical. In the F Test we found out that variances are mostly significant differences of the classes in its attributes. The qda is therefore more flexible but we loose also power for interpretating results.

### Cart Model[edit]

The CART Model was developed by Leo Breiman in 1984. CART builds classification and regression trees for predicting continuous dependent variables (regression like ) and categorical predictor variables (classification issue). The main idea ist that only binary decision trees are used to find an optimal separation. The choice of variables is done via maximation of the informational content. The variables with the most informational gain (measured e.g in Entropy) are used early in the decision tree. Main technique is to reduce complexity in the model and to“prune“ the decision tree by cutting the nodes with smallest information. For more information you can look over here CART.

### Multinomial Logit Model[edit]

Multinomial Logit Model is a statistical technique for multi-class classification using multinomial logit analyses. The MLM allows for linear seperation. The main idea ist to analyse the dependence of an dependent nominal variable from independent countinous or dummy coded variables. To estimate the coefficents with logistic regression, the logarithmized odds ratio is estimated via maximum likelihood.

The model is given here.

where is the observed outcome e.g our classes setosa, versicolor, virginica.

The score function is the same as for LDA.

### Neural Networks[edit]

Neural networks in mathematics are based on the biological neurons. Neural networks are very sophisticated modeling techniques able of modeling extremely complex functions. In particular, neural networks are nonlinear. The structure is not fixed. Neural networks learn from experience. Neural networks have ability to predict very complex processes. Learning is done via new connections, weights changes, change of critical values or adding or removing of neurons. For more information in detail have a look over here Neural Networks 1 Neural Networks 2.

### Support Vector Machines[edit]

SVM is a classification method that performs classification by constructing hyperplanes in a multidimensional space that separates cases of different classes. The main idea ist that objects are divided into classes in a way that the border between the classes is chosen such that the distance between it and the objects is maximized. The vector w points perpendicular to the separating hyperplane. The distance between the hyperplanes is 2/|w|, so we want to minimize |w|.

minimize , subject to
For nonlinear seperation kernels are used to find maximum-margin hyperplanes. SVM is therefore very flexible. For more information have a look over here SVM.

### K-Nearest-Neighbor Estimation[edit]

k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on a distance function for pairs of observations, such as the Euclidean distance. The training examples are mapped into multidimensional space. The space is partitioned into regions by classes of the training examples. A point in the space is assigned to the class j if it is the most frequent class label among the k nearest training examples. Decision is based on a small neighborhood of similar objects. So, even if the target class is multi-modal it can still lead to good accuracy. When using only a small subset of the variables (poor similarity structure) k-nn produces more classification errors than other techniques. Due this algorithim is very fast it can be suggested to use all variables. K-nn is also a very fast technique for estimating missing values.

The figure shows a classification problem for a new observation. Based on the distance to its 9 nearest neigbors it is classified red or blue.

## Simulation Study[edit]

For the simulation I haven chosen randomly 2/3 of the Iris Data set as test data to estimate a model. Than I took the complementary data to validate the model. I have used all available variables for this. I therefor used classification methods in its **standardsettings** without optimizing for its parameters. I repeated this about 500 times. In the table and the figure below you can see the accuracy of prediction and standarddeviation of the accuracy. A good estimator should show high accuracy and has minimum variance. In the second part I visualized results vor some variable combination. We can see the degree of complexity of the classifier and which regions are more likely to be a certain species. Since Weka classifiers didn't performed well and due to limited space in this seminar work I just put some weka results in the appendix.

### Results[edit]

From the table we can see that the two easiest models performs best. SVM also performs well. Cart performs poor since you have to prune the classification tree for outsample fit. The question arises which size is optimal. The weka classifiers perform not so well for this benchmarking dataset.

From this picture we can see e.g. that neural nets and cart are quite unstable in out of sample fit. This may come from in-sample overfit.

### Linear Discriminant Analyis[edit]

We can see that the Linear Discriminant analysis produces very clear results. The results are very good interpretable. The results can be easy presented to people e.g customers not involved into the topic.

### Quadratic Discriminant Analysis[edit]

Rform.php: Attribute "id" can not be used twicein

<Rform name="lars" />

We can see that the Quadratic Discriminant analysis produces not such clear results. The results lack of interpretability. The results can not be presented to people not involved into the topic. Some regions far away from relevant observations are classified into this group.

### Neural Net[edit]

Rform.php: Rform can not be nestedin

<Rform name="lars" />

# weights: 51 initial value 143.591233 iter 10 value 64.408153 iter 20 value 43.510291 iter 30 value 37.291547 iter 40 value 34.110092 iter 50 value 32.196161 iter 60 value 30.947388 iter 70 value 30.914738 iter 80 value 30.913355 iter 90 value 30.913273 final value 30.913271 converged # weights: 51 initial value 111.599926 iter 10 value 51.812763 iter 20 value 44.837518 iter 30 value 40.055208 iter 40 value 38.671731 iter 50 value 37.084159 iter 60 value 35.257029 iter 70 value 35.022653 iter 80 value 35.014108 iter 90 value 34.985853 iter 100 value 34.921521 final value 34.921521 stopped after 100 iterations # weights: 51 initial value 113.613285 iter 10 value 50.381753 iter 20 value 34.261599 iter 30 value 32.831956 iter 40 value 32.013909 iter 50 value 30.645378 iter 60 value 29.774723 iter 70 value 29.707265 iter 80 value 29.215775 iter 90 value 27.415515 iter 100 value 26.944971 final value 26.944971 stopped after 100 iterations # weights: 51 initial value 130.127089 iter 10 value 48.989510 iter 20 value 40.464201 iter 30 value 37.947832 iter 40 value 37.486323 iter 50 value 36.306217 iter 60 value 32.852259 iter 70 value 31.572990 iter 80 value 31.178452 iter 90 value 30.379354 iter 100 value 29.703551 final value 29.703551 stopped after 100 iterations

We can see that Neural nets produces quit good results. I would have expected more complecated structures. But we already know that the outcome is not very stable.

### Support Vector machines[edit]

Rform.php: Rform can not be nestedin

<Rform name="lars" />

REngine.php: <!--- Start of program ---> Error in library("e1071") : there is no package called 'e1071' Execution haltedin

library('lattice') library('e1071') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")

If library svm is not installed please look in the appendix for the picture. We can see that SVM produces very complicated functions. But we see also that the results are quite stable. It can be used for interpretation.

### Multinomial Logit[edit]

Rform.php: Rform can not be nestedin

<Rform name="lars" />

# weights: 12 (6 variable) initial value 109.861229 iter 10 value 41.872538 iter 20 value 40.390267 iter 30 value 36.842683 iter 40 value 36.663815 iter 50 value 36.563396 iter 60 value 36.519359 iter 70 value 36.473781 iter 80 value 36.408262 iter 90 value 36.389289 iter 100 value 36.378849 final value 36.378849 stopped after 100 iterations # weights: 12 (6 variable) initial value 109.861229 iter 10 value 37.779543 iter 20 value 35.551232 iter 30 value 35.109549 iter 40 value 35.101683 iter 50 value 35.096107 iter 60 value 35.071997 iter 70 value 35.070601 iter 80 value 35.066898 iter 90 value 35.064249 iter 100 value 35.063871 final value 35.063871 stopped after 100 iterations # weights: 12 (6 variable) initial value 109.861229 iter 10 value 46.846306 iter 20 value 45.650663 iter 30 value 42.253107 iter 40 value 42.114046 iter 50 value 41.973118 iter 60 value 41.947883 iter 70 value 41.881955 final value 41.785206 converged # weights: 12 (6 variable) initial value 109.861229 iter 10 value 42.568737 iter 20 value 39.726483 iter 30 value 38.349406 iter 40 value 37.781959 iter 50 value 37.391824 iter 60 value 37.228920 iter 70 value 37.056443 iter 80 value 36.972660 iter 90 value 36.956299 iter 100 value 36.931599 final value 36.931599 stopped after 100 iterations

We can see that multinomial logit looks very similar to the linear discriminant analysis. It can be used well for interpretating results.

### CART[edit]

Rform.php: Rform can not be nestedin

<Rform name="lars" />

REngine.php: <!--- Start of program ---> Error in library("tree") : there is no package called 'tree' Execution haltedin

library('lattice') library('tree') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; par(mfrow = c(2, 2)) plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom")val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")

We can see that CART model produces very interpretable results. Unfortunaly the out of sample fit is not well if we do not prune the classification Tree. If library tree is not installed please look in the appendix for the picture.

### Stability[edit]

In this section I want to find out about the stability of some predicted regions. I coloured the regions due to its frequencies of predicted classes. Due to computational time I calculated this about 150 times. Regions where the colours are clear are predicted the same every time. The worst result would be a mixture of colours in all regions. Click on the pictures to get the R Source.

#### Stability LDA[edit]

This is a very good result. Just the borders are a bit unstable but in generell linear discriminant analysis is perfect in accuray and stability.

#### Stability QDA[edit]

The qda is stable for a few combinations of variables. But for some combinations very unstable due to the flexibility. I would not present this result to a customer.#### Stability Nnet[edit]

The worst result. The regions are quite unstable not only for the borders. An interesting result is that the predicted regions in average looks now quite linear.

#### Stability CART[edit]

This is also not a good result. The regions often change and there is a clear mixure of colours.

#### Stability Multinomial Model[edit]

The result of multinomial logit looks similar to the lda. There is a bit more uncertainty on the borders. But a very good result.

#### Stability SVM[edit]

For surprise SVM are also very stable. It may take a bit fantasy to present it to a customer but the results are very good.

### Conclusions[edit]

From simulation we can conclude that the easiest models are superior to the complicated. SVM performs very good with its standard setting. Results after 500 runs can be seen as stable. In R.Weka library you can implement your own rules and classifier. You can also use e.g cross validation to optimize for the parameters of neural net and support vector machines. But do you think you can perform better than? If we have no further information about the data and don‘t know something about the structure we should not use complicated models with many degrees of freedoms like nnet or weka stuff.

## References[edit]

- Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.

- Härdle, W. and Simar, L. (2003). Applied Multivariate Statistical Analysis. Springer-Verlag Berlin Heidelberg.

- Wikipedia, The Free Ecyclopedia. http://en.wikipedia.org

- W. N. Venables and B. D. Ripley (2004). Modern Applied Statistics with S. Fourth Edition. Springer-Verlag Berlin Heidelberg.

## Appendix[edit]

In the Appendix you find the R Code which I haven't used for this thesis or I havent't integrated into the programm. e.g the tables of the testresults.

#pie chart pie.sales <- c(1/3,1/3,1/3) names(pie.sales) =c("setosa" , "versicolor", "virginica") pie(pie.sales,col=c("blue","green","red"),main="Piechart Classes Iris Dataset") #Tests for Normality normaltests <-function(x) { myMittelwert <- mean(x) myStandardabweichung <- sd(x) t=ks.test(x,pnorm,sd=myStandardabweichung,mean=myMittelwert) print(t) t=shapiro.test(x) print(t) } s=c("setosa") normaltests(iris$Petal.Width[iris$Species==c(s)]) normaltests(iris$Petal.Length[iris$Species==c(s)]) normaltests(iris$Sepal.Width[iris$Species==c(s)]) normaltests(iris$Sepal.Length[iris$Species==c(s)]) v1=c("versicolor") normaltests(iris$Petal.Width[iris$Species==c(v1)]) normaltests(iris$Petal.Length[iris$Species==c(v1)]) normaltests(iris$Sepal.Width[iris$Species==c(v1)]) normaltests(iris$Sepal.Length[iris$Species==c(v1)]) v2=c("virginica") normaltests(iris$Petal.Width[iris$Species==c(v2)]) normaltests(iris$Petal.Length[iris$Species==c(v2)]) normaltests(iris$Sepal.Width[iris$Species==c(v2)]) normaltests(iris$Sepal.Length[iris$Species==c(v2)]) #Hypothesis Testing standardtests<-function(n,m) { n=n m=m for (s in 1:4) { z<-as.vector(iris[,s][iris$Species==c(n)]) z1<-as.vector(iris[,s][iris$Species==c(m)]) ftest <- var.test(z,z1) ttest <- t.test(z,z1) wtest <- wilcox.test(z,z1) #write.csv2(f,"C:/f-test.csv",append=TRUE) print(data.frame(Statistic=c(ftest$statistic,ttest$statistic,wtest$statistic), P=c(ftest$p.value,ttest$p.value,wtest$p.value),row.names= c("Equal Variances", "Equal Means","Nonparametric"))) s+1 } } #setosa vs. versicolor n=c("setosa") m=c("versicolor") standardtests(n,m) #setosa vs. virginica n=c("setosa") m=c("virginica") standardtests(n,m) #versicolor vs. virginica n=c("setosa") m=c("virginica") standardtests(n,m)

## Comments[edit]

- The report should have been decomposed in several parts such that it does not take several minutes to load the page!
- Why not a scatterplot matrix?
- Why is the text after Picture 3 repeated by Picture 4?
- Shapiro-Wilk: what is ?
- What do I see in the tables? p-values?
- If I do a test for all data and later restrict on some subgroups then I will increase the acceptance of the . This means most probably I will accept the null hypothesis for the subgroups.
- Not all models for classification are explained
- What is the importance of the third, fourth, ... digit after the comma in the standard deviation?