Changes

Cihan Ates · 27d3cb93
--- a/DDE-1/Classification.md
+++ b/DDE-1/Classification.md
@@ -2,6 +2,19 @@

 In the previous week, we discuss one of the predictive learning tasks, [regression](DDE-1/Regression), we the goal is to estimate the target value from the input array X by using alternative mathematical models. The objective in classification is similar; we use the input feature matrix X to predict a class, a set of discrete values. Herein, the mathematical model divides the feature space (X) into regions, which are separated by the decision boundaries. The learning procedure is, therefore, focused on identifying these hyperplanes. 

-### Linear Classifiers: Logistic regression
+## Linear Classifiers: Logistic regression
+
+The simplest case is where the input data can be separated by using "linear decision planes",i.e., linearly separable. Here linear means that the decision surfaces are linear functions of the input array X. 
+
+Lets think a simple model: a linear function of the input vector with a bias:
+
+```math
+y_p(x, w) = w_0 + w_1x_1 + . . . + w_ix_i
+```
+
+To convert the regression model into a classifier, we can imagine a scenario of binary classification and claim that we have Class 1 if $`y_p>=0`$; otherwise, Class 2. By enforcing such a rule, we can figure out the weights that would work for our training dataset by minimizing an error function (e.g. least squares). This toy example is good for demonstration but not very useful in practice: (i) least squares approach is not robust and fails drastically in the presence of outliers, (ii) more importantly, under the hood, we assume a Gaussian conditional distribution for the labels --which is usually not the case. 
+
+
+
+Also note that logistic regression suffers from over-fitting, if the training dataset is linearly separable. Once trained, logistic sigmoid function will be very steep, like [a Heaviside step function](https://en.wikipedia.org/wiki/Heaviside_step_function). Therefore, you should add regularization to the error function (penalize w going to very large values).

-The simplest case is where the input data can be separated by using "linear decision planes",i.e., linearly separable. Here linear means that the decision surfaces are linear functions of the input array X. 
\ No newline at end of file