Finally, a post in this blog that actually gets a little bit technical …

This post discusses two approaches for understanding of logistic regression: **Empirical risk minimizer **vs **probabilistic approaches**.

Empirical Risk Minimizer

Empirical risk minimizing frames a problem in terms of the following components:

- Input space . Corresponds to observations, features, predictors, etc
- outcome space . Corresponds to target variables.
- Action space Also called decision function, predictor, hypothesis.
- A sub element in action space could be hypothesis space: all linear transformations

- Loss function: a loss defined on the predicted values and observed values.

The goal of the whole problem is to select a function mapping $F$ in action space that minimizes the total loss on sample. This is achieved by selecting the value of the parameters in $f$ such that it minimizes the empirical loss in the training set. We also do hyperparameter tuning, which is done on the validation set in order to prevent overfitting.

In a binary classification task:

- Input space .
- outcome space . Binary target values.
- Action space The hypothesis space: all linear score functions
- Loss function:
- This is a kind of margin based loss, thus the here.
- Margin is defined as , which has interpretation in binary classification task. Consider:
- if , we know we have our prediction and true value are of the same sign. Thus, in binary classification, we could already get the correct result. Thus, for we should have loss = 0.
- if , we know we have our prediction and true value are of different signs. Thus, in binary classification, we are wrong. We need to define a positive value for loss function.
- In SVM, we define hinge loss , which is a “maximum-margin” based loss (more on this in the next post, which will cover the derivation of SVM, kernel methods) Basically, for this loss, we have when no loss, $\latex m < 1$ loss. We can interpret as “confidence” of our prediction. When this means a low confidence, thus still penalize!

- With this intuition, how do we understand logistic loss? We know:
- This loss always > 1
- When negative (i.e. wrong prediction), we have greater loss !
- When positive (i.e. correct prediction), we have less loss…
- Note also for same amount of increase in , the scale that we “reward” correct prediction is less than the scale we penalize wrong predictions.

Bournoulli regression with logistic transfer function

- Input space
- Outcome space
- action space An action is the probability that an outcome is 1

Define the **standard logistic function** as

- Hypothesis space as
- Sigmoid function is any function that has an “S” shape. One example is the simple case of logistic function! Used in neural networks as activation function / transfer function. Purpose is to add non-linearity to the network.

Now we need to do a re-labeling for in the dataset.

- For every , we define
- For every , we define

Can we do this? Doesn’t this change the value of -s ? The answer is , in binary classification ( or in any classification), the labels do not matter. Instead, this trick just makes the equivalent shown much easier…

Then, the negative log likelihood objective function, given this and dataset $laex D$, is :

How to understand this approach? Think about a neural network…

- Input
- First linear layer: transform into
- Next non-linear activation function. .
- The output is interpreted as a probability of positive classes.
- Think about multi-class problems, the second layer is a softmax — and we get a vector of probabilities!

With some calculation, we can show NLL is equivalent to the sum of empirical loss.