Changes

Cihan Ates · 667eea32
--- a/DDE-1/Classification.md
+++ b/DDE-1/Classification.md
@@ -37,7 +37,7 @@ Nobody is perfect. That is also true for our ML models. Generalization is major
 One popular building block here is the decision tree (DT). The model can be interpreted as a graph, consisting of sequential decisions performed at each node. These decisions are how the data space is divided into sub-fractions. To illustrate, lets imagine a 2D dataspace. In the first step, data-space is divided into two based on a model parameter $`γ_1`$. Then, each sub-domain is further split based on the model parameters $`γ_2`$ and $`γ_3`$. Once we have a tree, we can pass a new input x' and find out in which zone that new instance falls into. At the leaf nodes, we may have constants or class indices printed out as the output. Decision trees are transparent models: we can trace the reasoning behind output and judge the relative influences of the input features on that decision. This property is what makes DT based models a popular choice in the medical field. 
-During the training, we learn the split criteria and the values for the split conditions. We also learn what are the leaf values (used for classes or regression tasks). How does this managed? How can we create useful tree graphs? One option is to try combinations, but such an approach becomes practically impossible to perform, even for fixed number of nodes in a decision tree. Therefore, greedy optimization is usually deployed. Herein, we start with a single root node, splitting the space into two. For a simple illustration, the classification task in 2D is drawn below. In 2D, we have two candidate 
+During the training, we learn the split criteria and the values for the split conditions. We also learn what are the leaf values (used for classes or regression tasks). How does this managed? How can we create useful tree graphs? One option is to try combinations, but such an approach becomes practically impossible to perform, even for fixed number of nodes in a decision tree. Therefore, greedy optimization is usually deployed. Herein, we start with a single root node, splitting the space into two. For a simple illustration, the classification task in 2D is drawn below. In 2D, we have two candidate first nodes: vertical line and the horizontal line for the decision boundary. We will test both and decide which is better. Then, we go into the second iteration: here the number of candidates increase in terms of regions as well. In general, the optimization procedure includes (i) which of the input variables are divided into regions, (ii) choice of the input variables and (iii) the value of the threshold. In short, we try to split the sub-samples even further at every iteration. We can continue until we have the same labels (for classification) in a sub-space or it is not possible to split it further. It means, there will be no error of classification in the model. However, such an effort would give complex trees, leading to poor generalization capabilities. In practice, simple trees are preferred over the complex structures, which is achieved by utilizing different tree inducer algorithms.