Sunday, February 12, 2012

Optimal decision tree based multi-class support vector machine

In this paper, decision tree SVMs architecture is constructed to solve multi-class problems. To maintain high generalization ability, the optimal structure of decision tree is determined using statistical measures for obtaining class separability. The proposed optimal decision tree SVM (ODT-SVM) takes advantage of both the efficient computation of the decision tree architecture and the high classification accuracy of SVM. A robust non-parametric test is carried out for statistical comparison of proposed ODT-SVM with other classifiers over multiple data sets'. Performance is evaluated in terms of classification accuracy and computation time. The statistical analysis on UCI repository datasets indicate that ten cross validation accuracy of our proposed framework is" significantly better than widely used multi-class classifiers. Experimental results and statistical tests' have shown that the proposed ODT-SVM is significantly better in comparison to conventional OvO and OAA in terms of both training and testing time.

Povzetek: Metoda odlocitvenega drevesa s SVM dosega signifikantno boljse rezultate kot izvirni SVM.

Keywords: support vector machine, decision tree, class separability, information gain, Gini Index and Chi-squared, interclass scatter, intraclass scatter.

1 Introduction

Support Vector Machine (SVM) has been proved to be a successful learning machine in literature, especially for classification. SVM is based on statistical learning theory developed by Vapnik [6, 25]. Since it was originally designed for binary classification [3], it is not easy to extend binary SVM to multi-class problem. Constructing k-class SVMs (k > 2) is an on-going research issue [1, 4]. Two approaches are suggested in literature to solve multi-class SVM. One is considering all data in one optimization [7]. The other is decomposing multi-class into a series of binary SVMs, such as "One-Against-All" (OAA) [25] and "One-versus-One" (OvO) [16].
( | ) = () - ( | )

(4) where ( | ) is the information gain of the label for a given attribute E, ( )is the system's entropy and ( | ) is the system's relative entropy when the value of the attribute E is known.

The system's entropy indicates its degree of disorder and is given by the following formula

( )=- ()log( ( )) (5)

where ( ) is the probability of class C. The relative entropy is calculated as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Where ( ) is the probability of value j for attribute e, and C is the probability of C with a given

The optimal binary SVM model is selected on the basis of maximum value of that signifies more separability between patterns belonging to two different classes, for a given independent binary SVM containing [n.sub.i] elements of [C.sub.i] and [n.sub.j] elements of [C.sub.j] can be calculated as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (7)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (8)

(C) = -- and C = -- (9)

where , , , and denote number of true positive, false positive, true negative and false negative data points respectively.

The higher value of signifies less overlap or more distance between two different classes of data points. Hence, can be a natural measure to determine class separability of different classes of data points.

Similarly for every independent binary OAA SVM, assume there are two classes of dataset, C and C and training set D contains elements of C and elements of class C . for a given OAA SVM model i can be calculated as
= (19)

where and are given as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (20)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (21)

Using kernel trick [7], data points from and are implicitly mapped from [R.sup.d] to a high dimension feature space H. Let [empty set](x) : R [right arrow] H denote the mapping and , ={[empty set] ( ), [empty set] ) denote the kernel function, where is the set of kernel parameters and <.,.> is the inner product. K denotes the kernel matrix and { } , is defined as , . Let [K.sub.A, B] be kernel matrix computed with the data points from A and B which denote two subsets of training sample set D. Let [empty set] and [empty set] denotes the between class scatter matrix and within class scatter matrix in H, respectively and defined as follows

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (22)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (23)

0 comments:

Post a Comment

newer post older post Home