CLUSTERING-BASED DECISION TREE CLASSIFIER CONSTRUCTION

. This article studies data structure investigation possibilities using cluster analysis. Density structures within classes are explored to implement class decomposition in order to enhance performance of decision tree classifiers. Classes are decomposed using cluster analysis and cluster merge evaluation using decision tree classifiers. Then impact of class decomposition is shown on C4.5 and CART classifiers. The main focus is on experiments carried out with real-valued data sets. The experiments are described in a step-by-step manner to illustrate the patterns discovered which affect previously proposed patterns in class decomposition methodology.


Introduction
The task described in this article is a classification task; it deals with objects allocation to predefined classes using a classification model.Classifiers can be either linear (e.g., Naïve Bayes classifier) or non-linear (e.g., decision trees).A classification model is considered linear if it divides the whole attribute space into classes using hyperplanes (each class is separated by a hyperplane).This also causes a problem if classes cannot be separated using hyperplanes (see Fig. 1 for an example) which causes increased error and complexity of a classification model.In these cases more sophisticated and non-linear models are used but they are more resourceconsuming and therefore it is important to broaden the possibilities of using linear classifiers by improving their efficiency on non-linear cases.Decision trees divide attribute space iteratively using hyperplanes that are orthogonal to one of axis executing approximation of non-linear dividing lines.Taking into account the similarity between decision tree construction and linear methods we can assume that methods that can improve performance of linear classifiers can also be applied to decision tree construction.This work discusses decision tree algorithms C4.5 and CART as well as cluster analysis for class decomposition in order to improve the performance of the classifiers.
Decision trees were proposed by J. Ross Quinlan in (Quinlan 1986) describing algorithm ID3 that was used as a basis for other decision tree classifiers that were created changing evaluation functions and construction parameters.Algorithm C4.5 was proposed in (Quinlan 1993) and CART algorithm was presented in (Breiman et al. 1984).Both algorithms divide attribute space in a similar manner but they differ in tree structure, split criteria and pruning method.To improve these decision tree classifiers, we research the initial data structure of a training data set in order to gain information for constructions of more efficient decision trees.It is done by class decomposition (Vilalta et al. 2003) i.e., dividing initial classes into smaller areas according to some similarity.
To explore the structure of a data set, we use hierarchical agglomerative classification that gives information about separate areas of high density of records that are represented by clusters.The Ward method is preferred in these experiments because it produces compact clusters that is of significance in class decomposition method.The distance between clusters shows how well these groups can be divided.Two groups that have a large distance between them can be separated using hyperplanes created by decision tree classifiers and easily separable clusters have a smaller error.
Information about data structure -the character of clusters -is used in class decomposition by replacing the initial classes by cluster labels that can be mapped to the original class after classification.
But a larger amount of classes often means an increased mistake.For example, the point labeled 'x' in Figure 2 can be assigned label A1, A2 (class A) or B2 (class B) which would be wrong.If classes A1 and A2 are merged, it becomes obvious that this point belongs to class A. Due to these differences, the impact of different merges should be researched but it is a very resource-consuming process because the number of possible combinations grows exponentially with each extra cluster.And clusters that are easily separable within one class are not necessarily the best choice for class decomposition because the performance differs depending on the character of overlapping areas.Vilalta et al. (2003) use a heuristic method that is based on assumption that combinations with higher cardinality merges perform better in class decomposition; but this also depends on the structure of classes.

Problem statement
While working with decision trees and algorithm C4.5 in particular, we came across the problem of significantly varying performance on various data sets.In the decision tree studies the C4.5 algorithm was applied to data sets with different data types and structures and got apparently different results even when data and tasks seemed similar.To find the cause of these differences, more should be learned about the behavior of decision trees and its dependency on class distribution in attribute space structure.Then performance of the algorithm could be improved by designing decision trees using meta-data acquired while investigating the attribute space.
Investigation of data structure in this case will include clustering of each class of data to obtain information about the structure of areas with high data density for finding an optimal number of clusters within each class.These clusters will then be merged to achieve better division of attribute space into classes by decision tree classifiers.This step is time-and resource consuming; therefore we need to find a trade-off between the number of clusters and achieved improvement in classifier performance.

Related work
The task is to analyze and describe the structure of input data sets in order to improve efficiency of classifier training.The idea of transition from a global research of the input data structures to local research of data is outlined in (Fulton et al. 1996).Authors describe the so-called local approach to decision tree design.According to this approach, first a subset of the available data is formed by selecting examples that are in the locality of a given test example that is being classified.Then a classifier is built using the obtained subset.The input data set is fragmented to some extent by this medium.
The previously described idea is similar to that behind the k-nearest neighbour method.Taking into consideration the fact that every test example requires re-defining of the subset of relevant examples used for classification this allows for a kind of interactive approach to classifier building.
While implementing this method, a question of an appropriate similarity measure arises because it has to describe the specific character of the problem domain as well as the unique character of the new test example.Implementation of the algorithm is associated with a large amount of computations although it may be allowed in some cases when accuracy of classification is crucial.Jiang et al. (2007) proposed a statistical approach of data structure analysis.The authors used supervised density estimation techniques that not only consider the location of a data point but also a variable of interest associated with the data point to model the distribution of the data points.They measure density as the product of an influence function with the variable of interest and use it for clustering using Supervised Clustering Using Density Estimation algorithm but it can also be used in classification algorithms that rely on decision boundaries extracted from the supervised density functions.Brazdil et al. (2003) presented a meta-learning method to support the selection of classification algorithm based on specific characteristics of the structure of a data set.Data structure is explored using the k-nearest neighbour algorithm that compares the similarity of a given data set with the characteristics of standard data sets that have been previously studied.The features of previously studied data sets are related to a ranged amount of classification algorithms.The choice of a certain classification algorithm is executed taking into consideration various parameters.
There is also an approach that is based on analysis of geometrical features of the training set using shadows of fuzzy sets (Ozols and Borisov 2001).This approach allows implementing the evaluation of the dependencies among attributes and constructing new efficient attributes that incorporate combinations of initial attributes based on these dependencies.In this adjustment the connection between the structure of a training set and the parameters of a classifier is explicit.Vilalta (2006) proposed an approach oriented towards preliminary identification of local high density areas in a training set using cluster analysis.The clusters discovered in cluster analysis are viewed as separate classes that are classified using a simple linear classifier.Furthermore, the problem of variance of examples within a single class is overcome; however another problem appears that is the increase in linear discriminants that ensures the growth of complexity (number of clusters) of the training set.
A similar clustering approach was also used in Michalopoulos et al. (1999) and Looney (2002).Michalopoulos et al. used a combination of fuzzy c-means clustering and top down induction of decision trees to produce accurate classification rules.They used clustering to learn the structure of data and used this obtained information in classification.Looney used an algorithm based on fuzzy c-means algorithm to study the structure of data.He modified the algorithm to form fewer yet more uniform clusters than k-means algorithm using prototypes that are placed in dense areas and merging similar clusters.

Proposed approach
In (Vilalta 2006) k-means algorithm is proposed for cluster analysis but this algorithm is inconsistent in its performance because of the random character of initially chosen centroids that sets in if the disposition of clusters is previously unknown.This choice has a great effect on the result of clustering and quality of its performance.In our approach we propose hierarchical agglomerative clustering for data structure analysis and algorithms C4.5 and CART as tools for decision tree classifier design.
In this work, we present evaluation of effectiveness of a global classifier, i.e. a classifier that classifies data without further analysis of the initial data set, and a classifier that implements classification based on research of the geometrical features of the initial attribute space.
This paper is based on experimental research of classification tasks since, in authors' opinion, experimental research that has been carried out previously is insufficient to succeed in defining relationships between qualities of a training set and effectiveness of classifiers.Only additional well-planned experiments can show the diversity of the links between characteristics of the original data set and the effectiveness of classifiers.For the present it has been impossible to find this kind of dependency in theory therefore we are to put trust in experiments.

Methods used
In the presented experiments we used cluster analysis and decision tree classifiers to explore structures of data.To test features found in data exploration, decision tree classifiers were used to classify data with and without class decomposition.

Decision trees
For classification we use decision trees (Quinlan 1993); they are constructed from data by dividing a training set into subsets until subsets can be assigned a class.Division is carried out by choosing an attribute and its value as threshold.Choice of the attribute can be carried out using different criteria.Criteria used in algorithm C4.5 is usually information gain or gain ratio.Information gain is the change in entropy of information if the state of information is changed.Let C be the class attribute with values {c 1 , c 2 , …, c n } and A attribute with values {a 1 , a 2 , …, a k }, H(C) be the entropy of the attribute C, and H(C|A) conditional entropy that shows entropy of C if state of attribute A is known; information gain is:

I(C, A) = H(C) -H(C|A).
(1) The entropy of attribute C is: where P(C= c n ) is the relative frequency of class value c n .And the conditional entropy is: Information gain favors attribute with higher number of values.To avoid that, gain ratio can be used.This criterion penalize where the entropy of attribute A is calculated as follows: Whereas CART usually uses Gini index as splitting criteria.Gini index is calculated as: CART and C4.5 have also other differences like pruning method, missing values handling and others (Kohavi, Quinlan 2002).Pruning examines and substitutes subtrees of the whole tree with a leaf or a branch of the subtree where necessary.Both methods use tree pruning to avoid overfitting because trees that execute full classification of the training set (every record is assigned to its correct class) tend to perform worse on unknown data but they differ in their approach to pruning.C4.5 uses reduced error pruning that analyzes if a subtree replacement with a leaf leads to less error.This technique requires a separate data set for pruning, which can be a drawback but it examines every subtree once and is much faster than other techniques (Quinlan 1987).CART uses minimal cost complexity pruning technique which assigns costs to subtrees based on the error from pruning and the size of the subtree (Kohavi, Quinlan 2002).This technique does not require a separate data set for pruning.
To improve error estimate that is directly affected by the availability of test data, both methods use cross-validation.Without cross-validation, error is usually estimated using holdout method which partitions data into two mutually exclusive sets -a training set for model building and a test set for error estimation via applying the model to unknown data (Kohavi 1995).In a k-fold cross-validation the initial data set is split into k subsets and the model is trained and tested k times.Each run is carried out using k-1 subsets as training data and one subset that is different for each run for training.Then the error is estimated as the average of errors of each run.

Cluster analysis
Clustering finds groups of similar data either by partitioning data into k subsets (partitioning methods) or creating a hierarchical decomposition of the data (hierarchical methods) or building groups of data based on density, grid or models (Han, Kamber 2006).There are many similarity measures (Deza, E.;Deza, M. 2009) that can be used for that purpose, for example, the Euclidean distance that is used in our approach.The Euclidean distance between two points Hierarchical clustering creates a hierarchy of data records and subsets either by dividing the whole data set until a given size of subsets (divisive approach) or agglomerating records into subsets until all records belong to a given number of sets or one set (agglomerative approach).Division and merging is based on the distance between groups called linkage that can be calculated as the distance between closest or center points of groups as well as distance among all points of different groups (He 1999).In the presented experiments Ward's criterion (Ward 1963) that shows increase in variance if groups are merged is used for linkage.

Experiments
In these experiments we used three real-world data sets.With each data set we carried out the same procedure to illustrate its work with different data sets.

Data sets with real-valued attributes
The following data sets were used for experiments: 'Breast Cancer Wisconsin (Diagnostic)' , 'Iris' and 'Parkinson's' from the University of California at Irvine Repository (Asuncion, Newman 2007).

Methodology
For all data sets the same steps for class decomposition are implemented and cross-validation is used for classification: To implement clustering, we use Orange 2.0 and WEKA collection of machine-learning algorithms for classification.

Performance comparison on the Iris problem
The first data set discussed here is 'Iris' with five attributes -four real number attributes and one class attribute with three classes.This is a very popular data set with easily separable classes using two of the attributes (see Figs 3 and 4).Since the structure of the data is mainly known, it is challenging to see if performance of decision tree classifiers can be improved by using class decomposition and if it leads to using more than two attributes in the tree structure.
Figures 3 and 4 show the difference between algorithms C4.5 and CART in practice.Although both algorithms use the same strategy for building the model -splitting the data into groups using the most appropriate attribute and its value, they use different techniques to choose attributes and threshold values for splitting nodes.Therefore they divide the attribute space into different sub-spaces belonging to classesThe structure of separate classes can be analyzed using hierarchical clustering.Since hierarchical clustering does not need any prior knowledge about the number of clusters, which is important for the primary precondition of the research -the task with unknown data, it can be used to derive such information from data with previously unknown structure.A positive feature of using hierarchical clustering is visualization of the cluster hierarchy in a dendrogram.It also makes the choice of number of clusters easier because distances between clusters are shown as links between clusters and the point of their merge.Besides, clusters with the longest distance between them are easier to divide using hyperplanes that are orthogonal to attribute axes like it is done by most decision tree classifiers.
Optimal number of clusters for manual calculations is four (with 14 combinations of interest) because the number of combinations grows exponentially with each next cluster.To show the process of class decomposition, each class is divided into four clusters using hierarchical clustering.This allows showing the differences of merges with different cardinalities while still keeping the amount of needed calculations fairly low.Then these four clusters are merged into different combinations for each class (see Table 1).
The best results are not always for the highest cardinality merges but overall tendency confirms that higher cardinality gives better results -average error (percentage of incorrectly classified records) for single merges with cardinality of two is 14.67 for class Setosa (minimal error for this cardinality is 10), 13.33 for Versicolor (minimal 14) and 11.67 for Virginica (minimal 6), whereas single merges with cardinality of three have following average errors: 9.5 for class Setosa (minimal error for this cardinality is 2), 9 for Versicolor (minimal 6) and 6 for Virginica (minimal 0).Combinations with more than one merge with cardinality of two have a different error pattern than those with one merge of same cardinality -the errors are sometimes lower than with combinations with one merge of higher cardinality (e.g., class Virginica -minimal error is 0 with a combination with two merges of cardinality of four).
Using CART to evaluate separation of the merged clusters resulted in similar numbers with slight changes that could influence the choice of combinations for classification.The best results for class Setosa were exactly the same as with C4.5 classifier, results for class Versicolor were similar with the best combination being {C 1 , C 2 , C 3 },{C 4 } and there were few differences for class Virginica where the best combination was {C 1 , C 2 , C 3 },{C 4 }.    3.  The misclassified examples outside original classes are concentrated in class Versicolor that are classified as Virginica.Therefore this result could be improved by choosing different cluster combination, e.g.[{C 1 , C 2 , C 3 }, {C 4 }] has the same result as [{ C 1 , C 4 }, { C 2 , C 3 }] using C4.5 and higher using CART.The results of using this combination are shown in confusion matrix (Table 4).Changing cluster combination for class Versicolor which had the most misclassified examples led to slight improvement even if both combinations seemed equal.

Performance comparison on the Breast Cancer (Diagnostic) problem
Another data set is 'Breast Cancer Wisconsin (Diagnostic)' with 32 attributes -1 label attribute, 30 real number attributes and 1 class attribute with following classes: M -malignant and B -benign.
According to dendrograms (Figures 5 and 6) attained by hierarchical clustering, class B can be divided into four easily distinguishable clusters and class M -into three clusters.To explore the changes in error depending on the number of clusters, the same data (class M) was also divided into four clusters.These clusters are assigned new labels and then they are merged in different combinations.Exhaustive search was used to illustrate contradictions between a small error when dividing one class and error in the final classification.Cluster combinations and corresponding results of attribute space division (percentage of incorrectly classified records) for each class are given in Table 5.Four cluster and corresponding three cluster combinations are shown for class M.
In this case, the least classification errors fall under combinations with higher cardinality of merged clusters for both classifiers -C4.5 and CART.The best combinations using C4.5 and using CART also coincide.Results for CART are less scattered but have higher classification error.The combinations with least classification error (for class M there is one combination of three clusters and one of four clusters) are then chosen for the next step -class names are replaced with cluster labels (representing original class name and cluster combination, for example for class B combination [{C 1 , C 2 , C 3 }, {C 4 }] cluster labels B_123 and B_4 are used) and are then treated as separate classes in further classification.
Since there are two combinations chosen for each class, combinations of these options will be used to form output data set.The results of classification are shown in Table 6.The combinations are fitted to get the best result (see the last line of Table 6 for CART).From Table 5 and Table 6 one can see that best scores for combinations in classification that classifies subclasses of one class do not correspond to the best scores in final classification although the difference is not significant for classifier C4.5.Although dendrograms of both classes (Figures 6 and 5) show that there are easily separable clusters (three for class 'Malicious' and four for class 'Benign'), dividing data into subclasses does not lead to improved performance of C4.5 classifier and improves performance of CART only slightly.The best result for CART is not better than C4.5 without class decomposition.Apparently clusters of different classes overlap and are easier to distinguish for the approach used in CART.This leads to even worse results for C4.5 than initial data.A possible solution is choosing a larger number of clusters but this process also consumes much more time and resources and the obtained improvement might be too small for the invested work.

Performance comparison on the Parkinson's problem
The third data set is 'Parkinson's' with 24 attributes -a label attribute unique for each record, 22 real valued attributes and class attributes with two values: 0 -healthy, 1 -Parkinson's disease.
This dataset is also split into subsets with records belonging to the same classes.These subsets are clustered and cluster labels are assigned to records, allowing us to perform classification to explore structures of the classes.Classification results are shown in Table 7.
This case shows similar pattern as the previous data sets -the least classification errors fall under combinations with cardinality of three and combinations with two merges with cardinality of two.Although the best scores were achieved using C4.5 the minimal errors within both classes are achieved using CART: 0% error for combination The best results using CART are for the same combinations that had the best result with C4.5 but the second best results differ: [{C 1 , C 2 }, {C 3 , C 4 }] for class 1 but the best results for class 0 are for combinations Classification of the data using CART and C4.5 without class decomposition favors CART as better classifier with 14,4% incorrectly classified records instead of C4.5 with 17,9% (see the second line of Table 8).
When classification is performed using data with decomposed classes, it is obvious that C4.5 performs better with decomposed classes than it does with original classes.C4.5 with decomposed classes even outperforms CART when solving this problem.Classification error for C4.5 classifier is reduced up to 40%.Although classes decomposed according to combinations that scored best in explorative classification don't show reduced performance in C4.5 classification, they certainly can worsen performance of CART classifier (only one case of class decomposition shows a slightly better result).

Conclusions
It is hard to define a mathematical approach to attribute space analysis and description for classification model building because the efficiency of methods and their parameters depend on data structure.There are different approaches to class decomposition that lead to differ-ent results and fit certain data structures.Hierarchical clustering is a suitable approach to high density area discovery within classes because of its elasticity choosing cluster defining parameters (similarity measures and linkage options) and it also does not need any prior information about the number of clusters and their positions (centroids).This allows a more objective class structure analysis.
The choice of classification model is also relevant because the best model for initial data is not always the best model for data with decomposed classes.This also depends on class structure and the character of overlapping areas.
There cannot be a universal approach to cluster combination selection without taking into account the final classification.Although the observed trend complies with previously proposed heuristics, the best results do not always comply with the trend.
This paper puts forward a cluster analysis based technique for analyzing classifier training set structure.The methodology of effective classifier design is implemented by synthesizing class descriptions that employ a combination of well separable clusters.
Information about class structure can be used to improve performance of the classifiers.Experiments show a significant improvement up to 40%.
In future research, more attention will be paid to studying cluster structure sensitivity that reflects real location of classes as well as to studying clustering quality.

Fig. 1 .
Fig.1.Groups of objects cannot be separated using straight lines (hyperplanes in a two-dimensional space) that represent linear classifier

Fig. 2 .
Fig. 2. In the case of a complex class structure combinations of clusters should be considered because a change in class structure changes the classification outcome an n-dimensional vector describing an object in the training set.Let { y 1 , y 2 , …, y k } be the set of classes.The input set T is divided into subsets T 1 , T 2 , …, T k where each T i = {(x, y i )} holds all of the examples of the same class i from the initial set T. Hierarchical clustering is performed with each subset T i .2. Choosing the best number of clusters (by largest distance) and assigning the cluster label to each example x. 3. Merging clusters to find a combination that leads to the least classification error on clustered.4. Evaluating combinations by assigning labels of merged clusters y i ' mapping the subset T i to a set T i ' = {(x, y i ')} and classifying complete training data set T. Every label y i ' contains information about existing class allowing to map the y i ' to the original class y i . 5. Finding the best combination by analyzing the results of classification performed on the -labeling classified examples to match the existing classes { y 1 , y 2 , …, y k }.

Fig. 3 .
Fig. 3. Three classes of Iris data set classified using C4.5 -Iris-Setosa fully separated at the lower part and other two classes overlap a little causing error

Fig. 4 .
Fig. 4. Three classes of Iris data set classified using CART -Iris-Setosa fully separated at the left-hand part and other two classes overlap still causing error class (Setosa) is decomposed into two subclasses -set_1 (class Setosa cluster C 1 ; labeled 'a' in confusion matrix) and set_234 (b), class Versicolor into vers_14 (c) and vers_23 (d) and class Virginica into subclasses virg_12 (e) and virg_34 (f).The resulting confusion matrix is shown in Table

Table 1 .
Cluster combination classification errors for each class using C4.5.Best result lines are shaded

Table 3 .
Confusion matrix for decomposed classes

Table 4 .
Confusion table for decomposed classes.Different combination for class Versicolor

Table 5 .
Classification errors for each class using C4.5.Best combination lines are shadowed

Table 6 .
Classification errors of decomposed classes transformed for existing classe

Table 7 .
Cluster combination classification errors for each class using C4.5.Best combination lines are shadowed

Table 8 .
Classification errors of decomposed classes transformed for existing classes