A decision tree is a graph-like structure in which internal node represents a "test" on an attribute, each branch represents the outcome of the test and each leaf node represents a class label. The paths from root to leaf represent classification rules.

Some common algorithms for decision tree are ID3 and C4.5.

ID3 algorithm (Iterative Dichotomiser 3)

Step 1: Take the original set S as the root node.

Step 2: Repeat steps 3, 4 and 5 until all the attributes in the set S has taken.

Step 3: For every unused attribute of the set S and calculates the entropy H (S) (or information gain IG (A)) of that attribute.

Step 4: Then selects the attribute which has the smallest entropy (or largest information gain) value.

Step 5: The set S is then split by the selected attribute (e.g. age < 50, 50 <= age < 100, age >= 100) to produce subsets of the data.

This algorithm stops in the following cases:

Case 1: Every element in the subset belongs to the same class, and then the node is turned into a leaf and labelled with the class of the examples.

Case 2: There are no more attributes to be selected, but the examples still do not belong to the same class, then the node is turned into a leaf and labelled with the most common class of the examples in the subset.

Case 3: There are no examples in the subset, this happens when no example in the parent set was found to be matching a specific value of the selected attribute, for example if there was no example with age > 100. Then a leaf node is created and labelled with the most common class of the examples in the parent set.

Usage

The ID3 algorithm is used by training on a dataset S to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new unseen test cases by working down the decision tree using the values of this test case to arrive at a terminal node that tells you what class this test case belongs to.

Limitations

1. ID3 does not guarantee an optimal solution; it can get stuck in local optimums. It uses a greedy approach by selecting the best attribute to split the dataset on each iteration. One improvement that can be made on the algorithm can be to use backtracking during the search for the optimal decision tree.
2. ID3 can over fit to the training data, to avoid over fitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible tree.
3. ID3 is harder to use on continuous data. If the value of any given attribute is continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time consuming.

C4.5 Algorithm

C4.5 is an extension to ID3. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = s1, s2, s3, of already classified samples. Each sample si consists of a p-dimensional vector (x1,i, x2,i,...,xp, i) where the xj represent attributes or features of the sample, as well as the class in which si falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sub lists.

This algorithm has a few base cases:

1. All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
2. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
3. Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

Algorithm

The general algorithm for building decision trees is:

1. Check for base cases
2. For each attribute a .
1. Find the normalized information gain ratio from splitting on a
3. Let a_best be the attribute with the highest normalized information gain
4. Create a decision node that splits on a_best
5. Recurse on the sub lists obtained by splitting on a_best, and add those nodes as children of node

Advantages of C4.5 over ID3

1. Handling both continuous and discrete attributes – In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.
2. Handling training data with missing attribute values. Missing attribute values are simply not used in gain and entropy calculations.
3. Pruning trees after creation – C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes.

References:

1. S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31(2007) 249-268, 2007.
2. J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996.