# Two approaches for logistic regression

Finally, a post in this blog that actually gets a little bit technical …

This post discusses two approaches for understanding of logistic regression: Empirical risk minimizer vs probabilistic approaches.

Empirical Risk Minimizer

Empirical risk minimizing frames a problem in terms of the following components:

• Input space $X \in R^d$. Corresponds to observations, features, predictors, etc
• outcome space $Y \in \Omega$. Corresponds to target variables.
• Action space $A_R = R$ Also called decision function, predictor, hypothesis.
• A sub element in action space could be hypothesis space: all linear transformations
• Loss function: $l(\hat{y}, y)$ a loss defined on the predicted values and observed values.

The goal of the whole problem is to select a function mapping $F$ in action space that minimizes the total loss on sample. This is achieved by selecting the value of the parameters in $f$ such that it minimizes the empirical loss in the training set. We also do hyperparameter tuning, which is done on the validation set in order to prevent overfitting.

• Input space $X \in R^d$.
• outcome space $Y \in {0, 1}$. Binary target values.
• Action space $A_R = R$ The hypothesis space: all linear score functions
• $F_{score} = {x \rightarrow x^Tw | w \in R^d}$
• Loss function: $l(\hat{y}, y) = l_{logistic}(m) = \text{log} (1 + e ^{-m})$
• This is a kind of margin based loss, thus the $m$ here.
• Margin is defined as $\hat{y} y$, which has interpretation in binary classification task. Consider:
• if $m = \hat{y} y > 0$, we know we have our prediction and true value are of the same sign. Thus, in binary classification, we could already get the correct result. Thus, for $m > 0$ we should have loss = 0.
• if $m = \hat{y} y < 0$, we know we have our prediction and true value are of different signs. Thus, in binary classification, we are wrong. We need to define a positive value for loss function.
• In SVM, we define hinge loss $l(m) = \text{max}(0, 1-m)$, which is a “maximum-margin” based loss (more on this in the next post, which will cover the derivation of SVM, kernel methods) Basically, for this loss, we have when $m \geq 1$ no loss, $\latex m < 1$ loss. We can interpret $m$ as “confidence” of our prediction. When $m < 1$ this means a low confidence, thus still penalize!
• With this intuition, how do we understand logistic loss? We know:
• This loss always > 1
• When $m$ negative (i.e. wrong prediction), we have greater loss !
• When $m$ positive (i.e. correct prediction), we have less loss…
• Note also for same amount of increase in $m$, the scale that we “reward” correct prediction is less than the scale we penalize wrong predictions.

Bournoulli regression with logistic transfer function

• Input space $X = R^d$
• Outcome space $y \in {0, 1}$
• action space $A = [0, 1]$ An action is the probability that an outcome is 1

Define the standard logistic function as $\phi (\eta) = 1 / (1 + e^{-\eta})$

• Hypothesis space as $F = {x \leftarrow\phi (w^Tx) | w \in R^d}$
• Sigmoid function is any function that has an “S” shape. One example is the simple case of logistic function! Used in neural networks as activation function / transfer function. Purpose is to add non-linearity to the network.

Now we need to do a re-labeling for $y_i$ in the dataset.

• For every $y_i = 1$, we define $y' = 1$
• For every $y_i = -1$, we define $y' = 0$

Can we do this? Doesn’t this change the value of $y$-s ? The answer is , in binary classification ( or in any classification), the labels do not matter. Instead, this trick just makes the equivalent shown much easier…

Then, the negative log likelihood objective function, given this $F$ and dataset $laex D$, is :

• $NLL(w) = \sum_i^n [-y_i ' \text{log} \phi (w^T x_i)] +(y_i ' -1) \text{log} (1 - \phi (w^T x_i))$

How to understand this approach? Think about a neural network…

• Input $x$
• First linear layer: transform $x$ into $w^Tx$
• Next non-linear activation function. $\phi (\eta) = 1 / (1 + e^{-\eta})$.
• The output is interpreted as a probability of positive classes.
• Think about multi-class problems, the second layer is a softmax — and we get a vector of probabilities!

With some calculation, we can show NLL is equivalent to the sum of empirical loss.