Performance of Classification Methods - A Monte Carlo Study2

Aus Teachwiki
Wechseln zu: Navigation, Suche

Abstract

In this thesis I will test some classification methods we got to know in the lecture Data Mining and Statistical Learning held by Dr. Sigbert Klinke and Dipl.-Kfm. Uwe Ziegenhagen, M.Sc. I will test also some further weka classification methods on the famous Iris dataset to see how they perform. We have learned a lot about these methods but in the real world we often have to find out which results are stable and from a practical view interpretable. I have performed a Monte Carlo Study with fivehundred runs and visualized some results for the interested reader. All Analysis was done in Software R 2.4.1.

Some Bibliography

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
R.A Fisher

Sir Ronald Aylmer Fisher (* 17. Februar 1890 in London, England; † 29. Juli 1962 in Adelaide, Australia) was a famous British statistician, evolutionary biologist, and geneticist. He is the father of many statistical methods we are using today. Fisher had very poor eyesight. But he was a very good student, winning a mathematical competition at the age of 16. He was tutored without pen and paper and was therefore able to visualize geometric problems without using algebra. He could produce mathematical results without stating the intermediate steps. In the early 20ths he was the pioneer of the principles of the design of experiments and developed the well known technique "analysis of variance". He began a systematic approach of the analysis of real world data to support the development of modern statistical methods. In addition to analysis of variance, Fisher invented the technique of maximum likelihood, the F-Distribution and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminant analysis and Fisher information. He also began the research on the field of non-parametrics, even he didn't believe it was necessary to move from parametric statistics. He died in Adelaide, Australia in 1962. (source mainly from Wikipedia)

Dataset

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Figure 2: Iris Virginica
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Figure 1: Iris Virginica
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Figure 4: Iris Versicolor
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Figure 3: Iris Versicolor
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Figure 6: Iris Setosa
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Figure 5: Iris Setosa

The Iris Flower Dataset is a popular multivariate dataset available in nearly every statistical software package. Iris (germ. "Schwertlilie") is the greek name for rainbow. It was introduced by R.A. Fisher in 1936 as an example for his famous linear discriminant analysis he developed also in the same year.

The dataset contains of 3 different iris species. Iris Versicolor, Iris virginica and Iris setosa. The Iris Flower with more than 300 subspecies over all the world and about 30 species over the North American Continent is spread widely depending on the climate. There are six subgenera (Iris, Limniris, Xiphium, Nepalensis, Scorpiris, Hermodactyloides) of which five are restricted to the Old World and one to the New World (Limniris).

They are spread ranging from cold regions into the grassy slopes, meadow lands, stream banks and deserts of Europe, the Middle East and northern Africa, Asia and across North America.



The Dataset measures four variables sepal (sepal germ. "Kelchblatt") width and sepal length and petal (petal germ. "Blütenblatt") width and petal length. The surfaces on the outer petals form are a landing stage for flying insects, which are pollinating the flower and collect the nectar. There are 150 observations alltogether 50 each species. In the descriptive part you will get to know a bit more about the differences between the species.

The Iris Virginica can be found on a straight line from Quebec down to Texas, a region with hot summers and mild winters in the south and mild summers and cold winters in the north. This region is also characterized by the great lakes and a the great north American rivers Mississippi, Missouri, Ohio and Arkansas.

The Iris Versicolor can be found in the North East of the United States, the great lakes region and the Canadian provinces of Quebec, Ontario, Newfoundland, Manitoba and Saskatchewan. From coincidences of the location we can think of some similiarities of Iris Versicolor and Iris Virginica.

The Iris Setosa therefore is principal spread in the Northwestern region of the Rocky Mountains mainly in the Canadian provinces of British Columbia, Yukon Territory and in Alaska (US). This region is characterised through very cold winters, short mild summers and a large amount of rainfall per year cause by rain clouds from the pacific.


A sample of the data is given over here


Obs	   Sepal.Length	    Sepal.Width	    Petal.Length    Petal.Width	       Species
1		5,1		3,5		1,4		0,2		setosa
2		4,9		3		1,4		0,2		setosa
3		4,7		3,2		1,3		0,2		setosa
4		4,6		3,1		1,5		0,2		setosa
5		5		3,6		1,4		0,2		setosa
6		5,4		3,9		1,7		0,4		setosa
7		4,6		3,4		1,4		0,3		setosa
8		5		3,4		1,5		0,2		setosa
9		4,4		2,9		1,4		0,2		setosa
10		4,9		3,1		1,5		0,1		setosa
...		...		...		...             ...             ....
51		7		3,2		4,7		1,4		versicolor
52		6,4		3,2		4,5		1,5		versicolor
53		6,9		3,1		4,9		1,5		versicolor
54		5,5		2,3		4		1,3		versicolor
55		6,5		2,8		4,6		1,5		versicolor
56		5,7		2,8		4,5		1,3		versicolor
57		6,3		3,3		4,7		1,6		versicolor
58		4,9		2,4		3,3		1		versicolor
59		6,6		2,9		4,6		1,3		versicolor
60		5,2		2,7		3,9		1,4		versicolor
...		...		...	        ...		...		....
101		6,3		3,3		6		2,5		virginica
102		5,8		2,7		5,1		1,9		virginica
103		7,1		3		5,9		2,1		virginica
104		6,3		2,9		5,6		1,8		virginica
105		6,5		3		5,8		2,2		virginica
106		7,6		3		6,6		2,1		virginica
107		4,9		2,5		4,5		1,7		virginica
108		7,3		2,9		6,3		1,8		virginica
109		6,7		2,5		5,8		1,8		virginica
110		7,2		3,6		6,1		2,5		virginica
...             ...             ...             ...             ...             ....

Figure 7: Iris Data

Descriptives and some tests

Scatterplots

To look for some relationship in the data I want to start with some scatterplots. Scatterplot provides a graphical display of the relationship between two variables. An upward-sloping scatterplot indicates that as we increase the variable on the horizontal axis, the variable on the vertical axes increases. We can discover some structure in the data and the relationship about the different variables (linear, quadratic, etc.). When a scatterplot shows an relationship between two variables, there is NOT necessarily a cause and effect relationship. Both variables could be related to some third or more variables that explains their variation or there could be some other cause. Later on, I want to explain the advanced concept of partial correlation which will allow us to have nearly experimental conditions.



<R output="display"> library('lattice') trellisSK(rpdf, width=10.6, height=5.3) leg.txt=c("setosa" , "versicolor","virginica") par(mfrow = c(1, 2),bg="white") x=c(0,0) plot(x,pch=,ylim=c(0,9),xlim=c(0,8),xlab="Petal Width", ylab="Petal Length"); points(iris$Petal.Width[1:50],iris$Petal.Length[1:50],col="blue") points(iris$Petal.Width[51:100],iris$Petal.Length[51:100],col="green") points(iris$Petal.Width[101:150],iris$Petal.Length[101:150],col="red") legend(5.5,2.5, leg.txt, fill=c("blue","green","red"),bg="white") title("Fisher's Iris Data")

plot(x,pch=,ylim=c(0,9),xlim=c(0,8),xlab="Sepal Width", ylab="Sepal Length"); points(iris$Sepal.Width[1:50],iris$Sepal.Length[1:50],col="blue") points(iris$Sepal.Width[51:100],iris$Sepal.Length[51:100],col="green") points(iris$Sepal.Width[101:150],iris$Sepal.Length[101:150],col="red") legend(5.5,2.5, leg.txt, fill=c("blue","green","red"),bg="white") title("Fisher's Iris Data") </R>

<R output="display"> library('lattice') trellisSK(rpdf, width=10.6, height=5.3) leg.txt=c("setosa" , "versicolor","virginica") par(mfrow = c(1, 2),bg="white") x=c(0,0) plot(x,pch=,ylim=c(0,9),xlim=c(0,8),xlab="Sepal Width", ylab="Petal Width"); points(iris$Sepal.Width[1:50],iris$Petal.Width[1:50],col="blue") points(iris$Sepal.Width[51:100],iris$Petal.Width[51:100],col="green") points(iris$Sepal.Width[101:150],iris$Petal.Width[101:150],col="red") legend(0,9, leg.txt, fill=c("blue","green","red"),bg="white") title("Fisher's Iris Data")

plot(x,pch=,ylim=c(0,9),xlim=c(0,8),xlab="Sepal Length", ylab="Petal Length"); points(iris$Sepal.Length[1:50],iris$Petal.Length[1:50],col="blue") points(iris$Sepal.Length[51:100],iris$Petal.Length[51:100],col="green") points(iris$Sepal.Length[101:150],iris$Petal.Length[101:150],col="red") legend(0,9, leg.txt, fill=c("blue","green","red"),bg="white") title("Fisher's Iris Data") </R>


<R output="display"> library('lattice') trellisSK(rpdf, width=10.6, height=5.3) leg.txt=c("setosa" , "versicolor","virginica") par(mfrow = c(1, 2),bg="white") x=c(0,0) plot(x,pch=,ylim=c(0,9),xlim=c(0,8),xlab="Sepal Width", ylab="Petal Length"); points(iris$Sepal.Width[1:50],iris$Petal.Length[1:50],col="blue") points(iris$Sepal.Width[51:100],iris$Petal.Length[51:100],col="green") points(iris$Sepal.Width[101:150],iris$Petal.Length[101:150],col="red") legend(0,2.5, leg.txt, fill=c("blue","green","red"),bg="white") title("Fisher's Iris Data") plot(x,pch=,ylim=c(0,9),xlim=c(0,8),xlab="Sepal Length", ylab="Petal Width"); points(iris$Sepal.Length[1:50],iris$Petal.Width[1:50],col="blue") points(iris$Sepal.Length[51:100],iris$Petal.Width[51:100],col="green") points(iris$Sepal.Length[101:150],iris$Petal.Width[101:150],col="red") legend(0,2.5, leg.txt, fill=c("blue","green","red"),bg="white") title("Fisher's Iris Data") </R>


Picture 1: Scatterplots of Iris Variables


From the scatterplots we can see somehow the ability of the examinated variables to discriminate the different species. There are combinations which allow a very good discrimination like Petal Width vs. Petal Length (plot 1) and Sepal Length vs. Petal Length (plot 5). The combinations Sepal Length vs. Sepal Width (plot 2) and Petal Width vs. Sepal Length (plot 6) have many overlappings in the species versicolor and virginica which complicates differentiation. Alltogether we can see that all of the 4 variables are useful to describe the dataset. For the later purpose of discrimination analysis I have to choose 2 variables to visualise results. Another way would be to work with principal components.

Boxplots

Boxplots are grahical techniques which allows us to display the distribution of a variable. It helps us to see location, skewness, spread, tail length and outlying points. Boxplots are graphical representations of the 5 number summary which is also given below.

<R output="display" name="boxplot"> library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <-boxplot(iris[1:50,1:4],main="setosa") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="setosa",ylim=c(0,8)) </R>


<R output="display" name="boxplot"> library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <- boxplot(iris[51:100,1:4],main="versicolor") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="versicolor",ylim=c(0,8)) </R>


<R output="display" name="boxplot"> library('lattice') trellisSK(rpdf, width=5, height=5) par(mfrow = c(1, 1)) bx.p <- boxplot(iris[101:150,1:4],main="virginica") bxp(bx.p, notch = TRUE, axes = TRUE, pch = 4, boxfill=1:4,main="virginica",ylim=c(0,8)) </R>

Picture 2: Boxplots


I scaled the Y axis for all the right plots from 0 to 8 cm. From the boxplots we can see that the distributions of the 4 attributes are quite similar for versicolor and virginica. The species setosa has a quite different distribution in its attributes. Without testing it outliers seem not a problem. After all we can think of some difficulties in discrimination analyis because of the similar distributions of versicolor and virginica.

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

Figure 8: Summary statistics

Kernel Density Estimation

Density Estimation is a nonparametric tool which allows us to estimate a probability density function to see how an random variable is distributed. The easiest method is the histogramm. An more advanced method is kernel density estimation. We therefore need a bandwidth h and a so called Kernel (weigthing) function to assign weigth to observations x_i whose distance from X is not bigger than h. Playing with bandwidth and kernel weight we can determine the smoothness of the density. It has been done a lot of research of calculating the optimal bandwidth. If you are interested in the topic you can have a look here (Härdle, Müller, Sperlich 2004). I have used the Gaussian kernel and Silverman's rule of thumb which is one way of determing the optimal bandwidth h_{opt}.

<R output="display" name="lars"> library('lattice') trellisSK(rpdf, width=7, height=7) par(mfrow = c(2, 2))

leg.txt=c("Sepal Length", "Sepal Width" ) x=c(0,0) plot(x,xlim=c(3.5,8),ylim=c(0,1.2),ylab="Density",xlab="Sepal.Length" ,main="Density Plot") polygon(density(iris$Sepal.Length),col = "grey") plot(x,xlim=c(1.5,5.5),ylim=c(0,1.2),ylab="Density",xlab="Sepal.Width" ,main="Density Plot") polygon(density(iris$Sepal.Width),col = "wheat") plot(x,xlim=c(0,8),ylim=c(0,1.2),ylab="Density",xlab="Petal.Length" ,main="Density Plot") polygon(density(iris$Petal.Length),col = "wheat") plot(x,xlim=c(0,3.5),ylim=c(0,1.2),ylab="Density",xlab="Petal.Width" ,main="Density Plot") polygon(density(iris$Petal.Width),col = "grey") </R>


Picture 3: Density Estimation all Species



We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.


Kernel Density Estimation setosa


<R output="display" name="lars"> library('lattice') trellisSK(rpdf, width=7, height=7) par(mfrow = c(2, 2))

leg.txt=c("Sepal Length", "Sepal Width" ) x=c(0,0) plot(x,xlim=c(3.5,8),ylim=c(0,1.4),ylab="Density",xlab="Sepal.Length" ,main="Density Plot setosa") polygon(density(iris$Sepal.Length[1:50]),col = "grey") plot(x,xlim=c(1.5,5.5),ylim=c(0,1.4),ylab="Density",xlab="Sepal.Width" ,main="Density Plot setosa") polygon(density(iris$Sepal.Width[1:50]),col = "wheat") plot(x,xlim=c(0.5,2.5),ylim=c(0,2.6),ylab="Density",xlab="Petal.Length" ,main="Density Plot setosa") polygon(density(iris$Petal.Length[1:50]),col = "wheat") plot(x,xlim=c(0,0.9),ylim=c(0,9),ylab="Density",xlab="Petal.Width" ,main="Density Plot setosa") polygon(density(iris$Petal.Width[1:50]),col = "grey") </R>


Picture 4: Density Estimation Setosa


We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.


Kernel Density Estimation versicolor

<R output="display" name="lars"> library('lattice') trellisSK(rpdf, width=7, height=7) par(mfrow = c(2, 2))

leg.txt=c("Sepal Length", "Sepal Width" ) x=c(0,0) plot(x,xlim=c(3.5,8),ylim=c(0,1.4),ylab="Density",xlab="Sepal.Length" ,main="Density Plot versicolor") polygon(density(iris$Sepal.Length[51:100]),col = "grey") plot(x,xlim=c(1.5,4),ylim=c(0,1.4),ylab="Density",xlab="Sepal.Width" ,main="Density Plot versicolor") polygon(density(iris$Sepal.Width[51:100]),col = "wheat") plot(x,xlim=c(2.5,6),ylim=c(0,1),ylab="Density",xlab="Petal.Length" ,main="Density Plot versicolor") polygon(density(iris$Petal.Length[51:100]),col = "wheat") plot(x,xlim=c(0.8,2.2),ylim=c(0,2.6),ylab="Density",xlab="Petal.Width" ,main="Density Plot versicolor") polygon(density(iris$Petal.Width[51:100]),col = "grey")

</R>


Picture 5: Density Estimation Versicolor




We can see that Petal length and Petal width have some bimodal distribution. That first mode should come from the much smaller setosa. Sepal length and Sepal width looks quite normal but we have to take into account that gaussian kernel tends to oversmooth. In the next pictures you see the results of density estimation for every class.


Kernel Density Estimation virginica


<R output="display" name="lars"> library('lattice') trellisSK(rpdf, width=7, height=7) par(mfrow = c(2, 2))

leg.txt=c("Sepal Length", "Sepal Width" ) x=c(0,0) plot(x,xlim=c(3.5,8),ylim=c(0,1.4),ylab="Density",xlab="Sepal.Length" ,main="Density Plot virginica") polygon(density(iris$Sepal.Length[101:150]),col = "grey") plot(x,xlim=c(1.5,4),ylim=c(0,1.4),ylab="Density",xlab="Sepal.Width" ,main="Density Plot virginica") polygon(density(iris$Sepal.Width[101:150]),col = "wheat") plot(x,xlim=c(3.5,7.5),ylim=c(0,1),ylab="Density",xlab="Petal.Length" ,main="Density Plot virginica") polygon(density(iris$Petal.Length[101:150]),col = "wheat") plot(x,xlim=c(1,3),ylim=c(0,2.6),ylab="Density",xlab="Petal.Width" ,main="Density Plot virginica") polygon(density(iris$Petal.Width[101:150]),col = "grey")


</R>


Picture 6: Density Estimation Virginica



In the next section I have performed Shapiro Wilk and Kolmogorov Smirnov test for normality. Normal distribution of the attributes inside the classes is often an assumption for classifiers like linear discriminant analysis. However more flexible classifiers like neural networks or support vector machines can overcome this problem.

Test for Normality

Some methods require normal distributed data. An more sophisticated way of evaluating if an random variable has normal distribution ist to perform a test.


The Shapiro-Wilk test tests the null hypothesis that a statistical sample came from a normal distribution.

The test statistic is

W = {\left(\sum_{i=1}^n a_i x_{(i)}\right)^2 \over \sum_{i=1}^n (x_i-\overline{x})^2}

for p-values > 0,05 null hypothesis cannot be rejected.


Kolmogorov-Smirnov-Test

The Kolmogorov-Smirnov-Test tests the null hypothesis that a statistical sample came from a normal distribution. Since it is a nonparametric test it is very stable but not very exact. It more sensitive than Shapiro Wilk test. The main idea is to compare the frequencies of an empirical distribution function S(xi) with the frequencies of a standardnormal distribution function.


For every i the absolute difference

 d_{oi} = |S(x_i)-F_0(x_i)|~

and

 d_{ui} = |S(x_{i-1})-F_0(x_i)|~


is computed. The biggest differences dmax is computed from all differences. If dmax succeed a critical value da, null hypothesis cannot be rejected. The table below gives the p-values of the tests. The R Code of the test you find in the appendix. To get the R Code for the Pictures please click on it.


Test for normality with all species


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0 0.01 0 0.1
Kolmogorov Test 0 0.18 0.003 0.07

Table 1: Test for normality with all species


Nearly all test rejected null hypothesis of normal distribution. Only Sepal width passed the Kolmogorov Smirnov and Shapiro Wilk test.


Test for normality for setosa


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0.05 0.46 0 0.27
Kolmogorov Test 0.19 0.52 0 0.64

Table 2: Test for normality for species setosa


For setosa species both test reject null hypothesis of normal distribution for Petal width. Petal length just slidely passed the Shapiro Wilk test.


Test for normality for versicolor


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0.15 0.46 0.02 0.33
Kolmogorov Test 0.50 0.74 0.23 0.46

Table 3: Test for normality for species versicolor


For versicolor species nearly all test couldn't reject null hypothesis of normal distribution. Only Petal width didn't passed the Shapiro Wilk test.


Test for normality for virginica


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Test Results Petal.length Sepal.length Petal.width Sepal.width
Shapiro Test 0.11 0.25 0.09 0.39
Kolmogorov Test 0.53 0.52 0.46 0.18

Table 4: Test for normality for species virginica


For virginica species all test couldn't reject null hypothesis of normal distribution.

Hypothesis Testing

In this section I want shortly pairwise perform F-test for equal Variances, t-test for equal means and nonparametric wilcoxon rank sum test for equal means which is equal to the Mann Whitney U test. F-test and t-test should be known from MVA. The wilcoxon rank sum test is a free of parameter test for comparison of medians from different distributions. Intuitive it is an alternative for t-test where the data is replaced by its ranks. The null-hypothesis is, their are no differences between the distributions. The alternativ-hypothesis say that their is a difference between the distributions. For p-values smaller than 0.05 we can reject null-hypothesis. The assumptions are that we have independent samples, a contineous distribution and at least ordinal data. The most important point is that the samples are more or less from the same form of distribution. The sample from setosa, versicolor, virginica comes more or less from the normal distribution.

Test results (p-values) setosa vs. versicolor
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
setosa vs. versicolor Petal.length Sepal.Length Petal.width Sepal.width
F-test 0 0 0.03 0.18
t-test 0 0 0 0
w-test 0 0 0 0

Table 5: Testresults 1


Test results (p-values) setosa vs. virginica
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
setosa vs. virginica Petal.length Sepal.Length Petal.width Sepal.width
F-test 0 0 0.01 0.19
t-test 0 0 0 0
w-test 0 0 0 0

Table 6: Testresults 2


Test results (p-values) versicolor vs. virginica


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
versicolor vs. virginica Petal.length Sepal.Length Petal.width Sepal.width
F-test 0 0 0.03 0.84
t-test 0 0 0 0.002
w-test 0 0 0 0.004

Table 7: Testresults 3

From the results we can conclude that the means and variances of the iris classes distributions are different. From the boxplots we could at least think about some commons between versicolor and virginica. Nevertheless thats a good result for discrimination analysis. The iris classes in Sepal width has some equal variances.

Correlation Analysis

To look for causal relationships we can use the concept of partial correlation. The main idea which relationship still exists between two variables when we calculating out the influences of all other "disturbing" variables. We can nearly have experimental conditions. Partial correlation analysis is also able to discover spurious correlation. In the first table below the pearson correlation coefficients are give. In the second table the partial correlation is given. For example correlation between height and body weight when we eliminate age. Partial correlation still requires meeting all the usual assumptions of Pearson correlation, linearity of relationship or homoscedasticity. If we want to examine relationsship between variables X and Y we regress variable X on variable Z and get a residual e. This e will be uncorrelated with Z, so we get correlation X with Y which is independent of Z.


We can distinguish between there cases


1. Partial correlation < Pearson correlation


Pearson correlation is overestimated because of the influence of another or more variables.


2. Partial correlation > Pearson correlation


Pearson correlation is underestimated because of the influence of another or more variables. Partial Correlation describes the relationship better.


3. Partial correlation = 0


If the partial correlation approaches 0, the original correlation is spurious. (famous stork and baby correlation)


Correlation Analyis for whole dataset


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.87 0.96 -0.43
Sepal.Length 1 0.82 -0.12
Petal.width 1 -0.37
Sepal.width 1

Table 8: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.71 0.87 -0.62
Sepal.Length 1 -0.33 0.63
Petal.width 1 0.35
Sepal.width 1

Table 9: Partial Correlation


We have not detected spurious correlation. Most of pearson correlation is overestimated. The partial correlation makes more sense because of the now positiv correlation of sepal length and width. Positiv correlation of petal length and width is also more likely in nature. Sepal length and Petal width was much overestimated.

Correlation Analyis for Iris setosa


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.27 0.33 0.18
Sepal.Length 1 0.28 0.74
Petal.width 1 0.23
Sepal.width 1

Table 10: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.17 0.28 -0.04
Sepal.Length 1 0.11 0.72
Petal.width 1 0.05
Sepal.width 1

Table 11: Partial Correlation


We have detected some spurious correlation (approaching zero). Most of pearson correlation is overestimated.


Correlation Analyis for Iris versicolor


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.75 0.79 0.56
Sepal.Length 1 0.55 0.74
Petal.width 1 0.66
Sepal.width 1

Table 12: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.63 0.65 -0.11
Sepal.Length 1 0.22 0.27
Petal.width 1 0.47
Sepal.width 1

Table 13: Partial Correlation


We have not detected spurious correlation. Most of pearson correlation is much overestimated. The partial correlation therefore makes more sense. Petal length and Sepal width was much overestimated.

Correlation Analyis for Iris virginica


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.86 0.32 0.40
Sepal.Length 1 0.28 0.46
Petal.width 1 -0.37
Sepal.width 1

Table 14: Pearson Correlation


m  /  n Petal.length Sepal.Length Petal.width Sepal.width
Petal.length 1 0.84 0.18 -0.08
Sepal.Length 1 -0.12 0.26
Petal.width 1 0.48
Sepal.width 1

Table 15: Partial Correlation


We have not detected spurious correlation. Most of pearson correlation seems overestimated. Positiv correlation of petal width and sepal width is more likely in nature. A just slidely coorelation of Petal length and Sepal width should now be more reasonable.

Cluster Analysis

Cluster analysis is an exploratory tool for solving classification problems. Given the Iris dataset we want to see if we can find our different species also in homogeneous clusters. Observations which are similar according to some appropriate criterion are put into one cluster. The clusters should be as homogeneous as possible. Discrimination analysis which is the objective of this paper adresses the other issue of classification where the groups are known a priori and we want to classify new observations. If you want to get mor information you can have a look here (Härdle, Simar 2003).


Cluster Analysis can be devided into the following steps.


1.) Select a distance measure

e.g Squared Euclidean distance, Manhattan distance, Chebychev distance


2.) Select a clustering procedure

e.g Hierarchical clustering like Ward clustering or centroid clustering, Linkage methods like single linkage method, complete linkage, average linkage


3.) Decide on the number of clusters

e.g from the dendogram


For my analysis I have use euclidean distance and Ward clustering. To visualise results I have randomly selected 10 observations from each species and performed cluster analysis. I have repeated this about 10 times and results are every time very similar. Below we can see from the dendograms that the clusters stay quite homogeneous which is a very good result for later discriminant analysis. In the next chapter we have to see if discriminant analysis is also able to produce stable results.

<R output="display"> library('lattice') trellisSK(rpdf, width=7, height=7) par(mfrow = c(1, 1)) iris2=data.frame(cbind(iris$Species,iris[,1:4])) z1 <- sample(1:50,10,replace=FALSE) z2 <- sample(51:100,10,replace=FALSE) z3 <- sample(101:150,10,replace=FALSE) z=rbind(z1,z2,z3) x <- as.matrix(iris2[z,c(-1)]) d1 <- dist(x,method="euclidean",p=1) h1 <- hclust(d1, method="ward") plot(h1, labels=as.matrix(iris2[z,c(1)]),xlab="Ward + Euclidean",axes=FALSE) </R>

<R output="display"> library('lattice') trellisSK(rpdf, width=7, height=7) par(mfrow = c(1, 1)) iris2=data.frame(cbind(iris$Species,iris[,1:4])) z1 <- sample(1:50,10,replace=FALSE) z2 <- sample(51:100,10,replace=FALSE) z3 <- sample(101:150,10,replace=FALSE) z=rbind(z1,z2,z3) x <- as.matrix(iris2[z,c(-1)]) d1 <- dist(x,method="euclidean",p=1) h1 <- hclust(d1, method="ward") plot(h1, labels=as.matrix(iris2[z,c(1)]),xlab="Ward + Euclidean",axes=F) </R>


Picure 8: Cluster Analysis

Starsplot

The star plot (Chambers 1983) is a dimension reduction technique for visualizing a high dimensional multivariate data set. Each star represents a single observation. We can look at these plots to see the differences of observations from eye or we can use them to identify clusters or Iris flowers with similar features. We can look for dominant observations or for outliers. We can see from the starsplot that the first 38 observations are quite similar. Observations 39 two 89 are also looking very similar. The left observation build the last cluster. The results from star plot are not so good like cluster analysis but helps to get an impression of the data and confirm the impression that this data set is very suitable for classification. In the next section I will give you a short overview about the methods I will use before I give the results of the monte carlo study.


<R output="display">

library('lattice') trellisSK(rpdf, width=7, height=7) stars(iris[, 1:4], key.loc = c(24, 1),

     main = "Iris Data - Star Plot",flip.labels=FALSE)

</R>

Picture 9: Stars Plot

Classification Methods

In this section I will give an short idea about the strength and weaknesses of some classification methods. For my further analyis I also used some weka classifiers which I don't describe here. If you are interested in the theorie behind this you can check here. weka

Linear Discriminant Analysis

The LDA was invented by Fisher in 1936. It has been developed further by Beaver(1966) and Altman(1968). The idea of LDA is to classify an new observation into a known group such as setosa, virginica or versicolor. The assumptions of the LDA are normal distributed classes and equal class covariances. LDA works only well when we deal with continuous variables. For a two class problem the Maximum Likelihood rule allocates x\, to \Pi_1\, if


 (\vec x- \vec \mu_1)^T \Sigma^{-1} ( \vec x- \vec \mu_1) \leq (\vec x- \vec \mu_2)^T \Sigma^{-1} ( \vec x- \vec \mu_2) .


The Z-Score model proposed by Altman (1968) is a linear discriminant function of some measures which are objectively weighted and summed up to an overall score that then becomes the basis for classification of new oberservations into different scores. The linear discriminant function should seperate your observations as good as possible. A version of the Z-Score was used from the german Schufa which tries to classifies private and buisiness customers. You obtain a certain score from attribrutes like age, income, place of residence etc.

Z_i=a_1x_{i1}+a_2x_{i2}+...+a_dx_{id}=a^\top x_i

Quadratic discriminant analyis

The quadratic discriminant analysis is equivalent to the linear unlike QDA does not assume that the variances of each of the classes have to be identical. In the F Test we found out that variances are mostly significant differences of the classes in its attributes. The qda is therefore more flexible but we loose also power for interpretating results.

Cart Model

The CART Model was developed by Leo Breiman in 1984. CART builds classification and regression trees for predicting continuous dependent variables (regression like ) and categorical predictor variables (classification issue). The main idea ist that only binary decision trees are used to find an optimal separation. The choice of variables is done via maximation of the informational content. The variables with the most informational gain (measured e.g in Entropy) are used early in the decision tree. Main technique is to reduce complexity in the model and to“prune“ the decision tree by cutting the nodes with smallest information. For more information you can look over here CART.


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Picture 10: Picture of a Cart Model

Multinomial Logit Model

Multinomial Logit Model is a statistical technique for multi-class classification using multinomial logit analyses. The MLM allows for linear seperation. The main idea ist to analyse the dependence of an dependent nominal variable from independent countinous or dummy coded variables. To estimate the coefficents with logistic regression, the logarithmized odds ratio is estimated via maximum likelihood.

The model is given here.

\Pr(y_{i}=j)=\frac{\exp(X_{i}\beta_{j})}{\sum_{j}^{J}\exp(X_{i}\beta_{j})}

where y_{i}\, is the observed outcome e.g our classes setosa, versicolor, virginica.

The score function is the same as for LDA. Z_i=a_1x_{i1}+a_2x_{i2}+...+a_dx_{id}=a^\top x_i

Neural Networks

Neural networks in mathematics are based on the biological neurons. Neural networks are very sophisticated modeling techniques able of modeling extremely complex functions. In particular, neural networks are nonlinear. The structure is not fixed. Neural networks learn from experience. Neural networks have ability to predict very complex processes. Learning is done via new connections, weights changes, change of critical values or adding or removing of neurons. For more information in detail have a look over here Neural Networks 1 Neural Networks 2.


Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Picture 11: Picture of a Neural Network Model

Support Vector Machines

SVM is a classification method that performs classification by constructing hyperplanes in a multidimensional space that separates cases of different classes. The main idea ist that objects are divided into classes in a way that the border between the classes is chosen such that the distance between it and the objects is maximized. The vector w points perpendicular to the separating hyperplane. The distance between the hyperplanes is 2/|w|, so we want to minimize |w|.

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Picture 12: Picture of a SVM Model


minimize (1/2)||\mathbf{w}||^2, subject to c_i(\mathbf{w}\cdot\mathbf{x_i} - b) \ge 1,\quad 1 \le i \le n. For nonlinear seperation kernels are used to find maximum-margin hyperplanes. SVM is therefore very flexible. For more information have a look over here SVM.

K-Nearest-Neighbor Estimation

k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on a distance function for pairs of observations, such as the Euclidean distance. The training examples are mapped into multidimensional space. The space is partitioned into regions by classes of the training examples. A point in the space is assigned to the class j if it is the most frequent class label among the k nearest training examples. Decision is based on a small neighborhood of similar objects. So, even if the target class is multi-modal it can still lead to good accuracy. When using only a small subset of the variables (poor similarity structure) k-nn produces more classification errors than other techniques. Due this algorithim is very fast it can be suggested to use all variables. K-nn is also a very fast technique for estimating missing values.


Figure 1: Iris Virginica
Picture 13: Picture of a k-NN Model


The figure shows a classification problem for a new observation. Based on the distance to its 9 nearest neigbors it is classified red or blue.


Simulation Study

For the simulation I haven chosen randomly 2/3 of the Iris Data set as test data to estimate a model. Than I took the complementary data to validate the model. I have used all available variables for this. I therefor used classification methods in its standardsettings without optimizing for its parameters. I repeated this about 500 times. In the table and the figure below you can see the accuracy of prediction and standarddeviation of the accuracy. A good estimator should show high accuracy and has minimum variance. In the second part I visualized results vor some variable combination. We can see the degree of complexity of the classifier and which regions are more likely to be a certain species. Since Weka classifiers didn't performed well and due to limited space in this seminar work I just put some weka results in the appendix.

Results

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Table 16: Simulation Results


From the table we can see that the two easiest models performs best. SVM also performs well. Cart performs poor since you have to prune the classification tree for outsample fit. The question arises which size is optimal. The weka classifiers perform not so well for this benchmarking dataset.

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Picture 14: Simulation Results


From this picture we can see e.g. that neural nets and cart are quite unstable in out of sample fit. This may come from in-sample overfit.

Linear Discriminant Analyis

<Rform name="lars">

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width). </Rform>

<R output="display" name="lars"> library('lattice') library('nnet') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1

library(MASS);

xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1] par(mfrow = c(2, 2))

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- lda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("LDA")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- lda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("LDA")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- lda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("LDA")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- lda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("LDA")


</R>

We can see that the Linear Discriminant analysis produces very clear results. The results are very good interpretable. The results can be easy presented to people e.g customers not involved into the topic.


Quadratic Discriminant Analysis

<Rform name="lars">

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width). </Rform>

<R output="display" name="lars"> library('lattice') library('nnet') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1

library(MASS);

xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1] par(mfrow = c(2, 2))

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- qda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("QDA")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- qda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("QDA")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- qda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("QDA")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- qda(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes)$class]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("QDA")


</R>

We can see that the Quadratic Discriminant analysis produces not such clear results. The results lack of interpretability. The results can not be presented to people not involved into the topic. Some regions far away from relevant observations are classified into this group.


Neural Net

<Rform name="lars">

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width). </Rform>

<R output="display" name="lars"> library('lattice') library('nnet') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1

xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1] par(mfrow = c(2, 2))

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- nnet(Species ~ var1 + var2 , data=learn, size=8,maxiter=500) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdes,type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("nnet")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- nnet(Species ~ var1 + var2 , data=learn, size=8,maxiter=500) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdes,type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("nnet")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- nnet(Species ~ var1 + var2 , data=learn, size=8,maxiter=500) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdes,type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("nnet")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- nnet(Species ~ var1 + var2 , data=learn, size=8,maxiter=500) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdes,type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("nnet")

</R>

We can see that Neural nets produces quit good results. I would have expected more complecated structures. But we already know that the outcome is not very stable.

Support Vector machines

<Rform name="lars">

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width). </Rform>

<R output="display" name="lars"> library('lattice') library('e1071') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- svm(Species ~ var1 + var2 , data=learn, kernel="radial") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[as.factor(predict(lda.des,newdata=newdes, type = "class"))]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("svm")

</R>


If library svm is not installed please look in the appendix for the picture. We can see that SVM produces very complicated functions. But we see also that the results are quite stable. It can be used for interpretation.

Multinomial Logit

<Rform name="lars">

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width). </Rform>

<R output="display" name="lars"> library('lattice') library('nnet') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- multinom(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; par(mfrow = c(2, 2)) plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- multinom(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- multinom(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- multinom(Species ~ var1 + var2 , data=learn) newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom") </R>

We can see that multinomial logit looks very similar to the linear discriminant analysis. It can be used well for interpretating results.

CART

<Rform name="lars">

var1: <input name="var1" type="text" size="1" value="2"> var2: <input name="var2" type="text" size="1" value="1"> <input type="submit" value=" Submit "> (1=Sepal Length 2=Sepal Width 3=Petal Length 4=Petal Width). </Rform>

<R output="display" name="lars"> library('lattice') library('tree') trellisSK(rpdf, width=7, height=7) if (exists("var1")) var1<-as.numeric(var1) else var1<- 2 if (exists("var2")) var2<-as.numeric(var2) else var2<- 1 par(mfrow = c(2, 2)) xseq<-seq(0,10,length=100); xvech<-rep(xseq,100); yvech<-rep(xseq,each=100); newdes<-cbind(0.8*xvech,0.8*yvech); basiscolors=c("blue","green","red") leg.txt <- c("Setosa","Versicolor","Virginica") m <- dim(iris)[1]

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; par(mfrow = c(2, 2)) plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")


val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("multinom")

val <- sample(1:m, size = round(m/3), replace = FALSE,prob = rep(1/m, m)) learn <- iris[-val,c(var1,var2,5)] valid <- iris[val,c(var1,var2,5)] colnames(learn)[1]<-c("var1") colnames(learn)[2]<-c("var2") lda.des <- tree(Species ~ var1 + var2 , data=learn,split="gini") newdes<-as.data.frame(newdes); names(newdes)<-c("var1","var2"); newcol<-basiscolors[predict(lda.des,newdes, type = "class")]; plot(newdes[,1],newdes[,2],col=newcol,pch='°',xlab=colnames(iris)[var1], ylab=colnames(iris)[var2]); points(learn$var1[learn$Species==c("setosa")],learn$var2[learn$Species==c("setosa")],col="white",pch='S') points(learn$var1[learn$Species==c("versicolor")],learn$var2[learn$Species==c("versicolor")],col="yellow",pch='V') points(learn$var1[learn$Species==c("virginica")],learn$var2[learn$Species==c("virginica")],col="pink",pch='V') title("CART")

</R>

We can see that CART model produces very interpretable results. Unfortunaly the out of sample fit is not well if we do not prune the classification Tree. If library tree is not installed please look in the appendix for the picture.

Stability

In this section I want to find out about the stability of some predicted regions. I coloured the regions due to its frequencies of predicted classes. Due to computational time I calculated this about 150 times. Regions where the colours are clear are predicted the same every time. The worst result would be a mixture of colours in all regions. Click on the pictures to get the R Source.

Stability LDA

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

This is a very good result. Just the borders are a bit unstable but in generell linear discriminant analysis is perfect in accuray and stability.


Stability QDA

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
The qda is stable for a few combinations of variables. But for some combinations very unstable due to the flexibility. I would not present this result to a customer.

Stability Nnet

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

The worst result. The regions are quite unstable not only for the borders. An interesting result is that the predicted regions in average looks now quite linear.

Stability CART

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

This is also not a good result. The regions often change and there is a clear mixure of colours.

Stability Multinomial Model

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

The result of multinomial logit looks similar to the lda. There is a bit more uncertainty on the borders. But a very good result.

Stability SVM

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

For surprise SVM are also very stable. It may take a bit fantasy to present it to a customer but the results are very good.

Conclusions

From simulation we can conclude that the easiest models are superior to the complicated. SVM performs very good with its standard setting. Results after 500 runs can be seen as stable. In R.Weka library you can implement your own rules and classifier. You can also use e.g cross validation to optimize for the parameters of neural net and support vector machines. But do you think you can perform better than? If we have no further information about the data and don‘t know something about the structure we should not use complicated models with many degrees of freedoms like nnet or weka stuff.

References

  • Härdle, W., Klinke, S. and Müller, M. (2000). XploRe – Learning Guide. Springer-Verlag Berlin Heidelberg.
  • Härdle, W. and Simar, L. (2003). Applied Multivariate Statistical Analysis. Springer-Verlag Berlin Heidelberg.
  • W. N. Venables and B. D. Ripley (2004). Modern Applied Statistics with S. Fourth Edition. Springer-Verlag Berlin Heidelberg.

Appendix

In the Appendix you find the R Code which I haven't used for this thesis or I havent't integrated into the programm. e.g the tables of the testresults.

#pie chart
pie.sales <- c(1/3,1/3,1/3) 
names(pie.sales) =c("setosa" ,  "versicolor", "virginica")
pie(pie.sales,col=c("blue","green","red"),main="Piechart Classes Iris Dataset")


#Tests for Normality
normaltests <-function(x) {

   myMittelwert <- mean(x)
   myStandardabweichung <- sd(x)
   t=ks.test(x,pnorm,sd=myStandardabweichung,mean=myMittelwert) 
   print(t)
   t=shapiro.test(x)
   print(t)
}

s=c("setosa")

normaltests(iris$Petal.Width[iris$Species==c(s)])
normaltests(iris$Petal.Length[iris$Species==c(s)])
normaltests(iris$Sepal.Width[iris$Species==c(s)])
normaltests(iris$Sepal.Length[iris$Species==c(s)])

v1=c("versicolor")
normaltests(iris$Petal.Width[iris$Species==c(v1)])
normaltests(iris$Petal.Length[iris$Species==c(v1)])
normaltests(iris$Sepal.Width[iris$Species==c(v1)])
normaltests(iris$Sepal.Length[iris$Species==c(v1)])

v2=c("virginica")
normaltests(iris$Petal.Width[iris$Species==c(v2)])
normaltests(iris$Petal.Length[iris$Species==c(v2)])
normaltests(iris$Sepal.Width[iris$Species==c(v2)])
normaltests(iris$Sepal.Length[iris$Species==c(v2)])

#Hypothesis Testing

standardtests<-function(n,m)
{
	n=n
	m=m
        for (s in 1:4)
        {
	z<-as.vector(iris[,s][iris$Species==c(n)])
        z1<-as.vector(iris[,s][iris$Species==c(m)])
	ftest <- var.test(z,z1)
	ttest <- t.test(z,z1)
	wtest <- wilcox.test(z,z1)
	#write.csv2(f,"C:/f-test.csv",append=TRUE)
	print(data.frame(Statistic=c(ftest$statistic,ttest$statistic,wtest$statistic),
        P=c(ftest$p.value,ttest$p.value,wtest$p.value),row.names=
	c("Equal Variances", "Equal Means","Nonparametric"))) 
	s+1
      }
}

#setosa vs. versicolor

n=c("setosa")
m=c("versicolor")

standardtests(n,m)

#setosa vs. virginica

n=c("setosa")
m=c("virginica")

standardtests(n,m)

#versicolor vs. virginica

n=c("setosa")
m=c("virginica")

standardtests(n,m)

Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden
Fehler beim Erstellen des Vorschaubildes: Die Miniaturansicht konnte nicht am vorgesehenen Ort gespeichert werden

Comments

  • The report should have been decomposed in several parts such that it does not take several minutes to load the page!
  • Why not a scatterplot matrix?
  • Why is the text after Picture 3 repeated by Picture 4?
  • Shapiro-Wilk: what is x_{(i)}?
  • What do I see in the tables? p-values?
  • If I do a test for all data and later restrict on some subgroups then I will increase the acceptance of the H_0. This means most probably I will accept the null hypothesis for the subgroups.
  • Not all models for classification are explained
  • What is the importance of the third, fourth, ... digit after the comma in the standard deviation?