# Machine Learning Re-cap – keep updating

Generative Learning:

Supervised learning – input labeled x and y. then give x and calculate probability p(y|x)

generative learning – basically Bayesian idea. forgot about sample and population ..

p(y|x) = p(y, x) / p(x) = p(x|y) * p(y) / sum(P(x, y)) ~ p(x|y) * p(y)

p(y) is the prior distribution of y.

p(x|y) is the probability of x given a y

So if assume a prior y, also p(x|y), which means assuming the distribution of labels and the distribution of input given a label, can get the probability of current observant under each label. Then maximize this. (Does this relate to mle as well?)

Gaussian Discriminant Analysis Model

x is continuous real valued vectors

The prior for y is Bernoulli(phi)

x|y=0 ~ multivariate Gaussian with mean mu0 and cov sigma

x|y=0 ~ multivariate Gaussian with mean mu1 and cov sigma — same sigma!!

GDA and logistic regression :

if p(x|y) is a multivariate gaussian then p(y|x) follows a logistic function.

Q: GDA is a generative algo, while logistic regression is a discriminate algo. When use one over the other?

• GDA makes stronger assumptions about p(x|y). i.e. if p(x|y) ~ Gaussian, y ~ Bernoulli, then p(y|x) ~ logistic regresssion. But the inverse is not true. if p(y|x) ~ logistic, then it’s possible that p(x|y) ~ Poisson. Thus the assumption of logistic regression is weaker.
• When p(x|y) is indeed Gaussian, then GDA is asymptotically efficienc. i.e. if the assumption is met GDA is better than logistic regression
• then logistic is robust and less sensitive to incorrect models

Naive Bayes

x is discrete vectors

Sep 13, 2017

A list of models to review:

• Neural networks
• GAN
• Energy-based models
• CNN
• LSTM
• Support Vector Machine
• LASSO
• PCA
• KNN

Oct 11, 2017

• today learnt KNN — select k nearest neighbors and classify it to be the most prevelant class in these neighbors
• Supervised model : given input predict output — from elements of statistical learning
• k means — ??

Oct 12, 2017

Today learnt:

• decision trees
• CART — classification and regression tree – each step only choose one feature and a threshold. (machine learning book with tensorflow)
•  feature importance — calculated using Gini
• Gini
• MLE vs KL divergence vs cross entropy
• KL divergence is measuring dissimilarities between empirical distribution and theoretical distributions
• MLE is to tune the parameters s.t. the likelihood is maximized given seen data
• cross entropy ?
• PCA – eigenvalue decomposition of covariance matrix ??
• RNN — used when the input data has different lengths but have to train the same model !