| ... | ... | @@ -53,18 +53,33 @@ In general, the optimization procedure includes (i) which of the input variables |
|
|
|
Decision trees are one of the fundamental building blocks of data driven learning so let’s look a bit closer on the learning algorithms. One important feature we should remember is the fact that learning is based on domain splitting (see figures above). Therefore, size of the training set is of critical importance, as it dictates the complexity of the tree graphs. To be more precise, the algorithms will create trees with certain set of rules, such as number of allowed leaves in the model. Therefore, we should note that the tree architectures are indirectly controlled by the training dataset so we should always keep it in mind while discussing the learning procedure for DTs.
|
|
|
|
|
|
|
|
The first important aspect of learning DT from a given example set is the fact that it is a very, very difficult task as an optimization problem (NP hard). In short, it is not possible to solve for complex problems. So, we rely on heuristic methods such as the top-down (ID3, C4.5, CART) or bottom-up methods, were the latter is more preferred (also the case in scikit learn—CART. Note that CART uses growing and pruning strategy). The top-down method consists of the following steps:
|
|
|
|
|
|
|
|
```
|
|
|
|
createTree(exampleSet, Features, labels, splitEvaluator, stopCriteria):
|
|
|
|
|
|
|
|
i.Create a new tree “Tree”
|
|
|
|
|
|
|
|
ii.If the stopCriteria is True for given instances in exampleSet:
|
|
|
|
|
|
|
|
a.Node is a “Leaf” with the most common label in in exampleSet
|
|
|
|
|
|
|
|
iii.Else:
|
|
|
|
|
|
|
|
a.Use splitEvaluator function on each feature f to find best feature f* for exampleSet; splitEvaluator (f, exampleSet)
|
|
|
|
|
|
|
|
b.Label the tree as f*
|
|
|
|
|
|
|
|
c.for each outcome criteria v* of f*:
|
|
|
|
i. create subTree via createTree (exampleSet . v*, Features, labels)
|
|
|
|
ii. connect the subTree to Tree with “edge” information v*
|
|
|
|
|
|
|
|
1.create subTree via createTree (exampleSet . v*, Features, labels)
|
|
|
|
|
|
|
|
2.connect the subTree to Tree with “edge” information v*
|
|
|
|
|
|
|
|
iv.Prune the tree
|
|
|
|
|
|
|
|
v.Return the Tree
|
|
|
|
```
|
|
|
|
|
|
|
|
The recursive algorithm is stopped when either of the followings is true:
|
|
|
|
• All examples in the training set belong to a single class / value y,
|
|
|
|
• The maximum tree depth is reached,
|
| ... | ... | |
| ... | ... | |