Generative Learning:

Supervised learning – input labeled x and y. then give x and calculate probability p(y|x)

generative learning – basically Bayesian idea. **forgot about sample and population ..**

p(y|x) = p(y, x) / p(x) = p(x|y) * p(y) / sum(P(x, y)) ~ p(x|y) * p(y)

p(y) is the prior distribution of y.

p(x|y) is the probability of x given a y

So if assume a prior y, also p(x|y), which means assuming the distribution of labels and the distribution of input given a label, can get the probability of current observant under each label. Then maximize this. **(Does this relate to mle as well?)**

**Gaussian Discriminant Analysis Model**

x is continuous real valued vectors

The prior for y is Bernoulli(phi)

x|y=0 ~ multivariate Gaussian with mean mu0 and cov sigma

x|y=0 ~ multivariate Gaussian with mean mu1 and cov sigma — same sigma!!

GDA and logistic regression :

if p(x|y) is a multivariate gaussian then p(y|x) follows a logistic function.

Q: GDA is a generative algo, while logistic regression is a discriminate algo. When use one over the other?

- GDA makes stronger assumptions about p(x|y). i.e. if p(x|y) ~ Gaussian, y ~ Bernoulli, then p(y|x) ~ logistic regresssion. But the inverse is not true. if p(y|x) ~ logistic, then it’s possible that p(x|y) ~ Poisson. Thus the assumption of logistic regression is weaker.
- When p(x|y) is indeed Gaussian, then GDA is asymptotically efficienc. i.e. if the assumption is met GDA is better than logistic regression
- then logistic is robust and less sensitive to incorrect models

**Naive Bayes**

x is discrete vectors

Sep 13, 2017

A list of models to review:

- Neural networks
- GAN
- Energy-based models
- CNN
- LSTM

- Support Vector Machine
- LASSO
- PCA
- KNN

Oct 11, 2017

- today learnt KNN — select k nearest neighbors and classify it to be the most prevelant class in these neighbors
- Supervised model : given input predict output — from elements of statistical learning
- k means — ??

Oct 12, 2017

Today learnt:

- decision trees
- CART — classification and regression tree – each step only choose one feature and a threshold. (machine learning book with tensorflow)
- feature importance — calculated using Gini
- Gini

- MLE vs KL divergence vs cross entropy
- KL divergence is measuring dissimilarities between empirical distribution and theoretical distributions
- MLE is to tune the parameters s.t. the likelihood is maximized given seen data
- cross entropy ?

- PCA – eigenvalue decomposition of covariance matrix ??
- RNN — used when the input data has different lengths but have to train the same model !