Supervised learning – input labeled x and y. then give x and calculate probability p(y|x)
generative learning – basically Bayesian idea. forgot about sample and population ..
p(y|x) = p(y, x) / p(x) = p(x|y) * p(y) / sum(P(x, y)) ~ p(x|y) * p(y)
p(y) is the prior distribution of y.
p(x|y) is the probability of x given a y
So if assume a prior y, also p(x|y), which means assuming the distribution of labels and the distribution of input given a label, can get the probability of current observant under each label. Then maximize this. (Does this relate to mle as well?)
Gaussian Discriminant Analysis Model
x is continuous real valued vectors
The prior for y is Bernoulli(phi)
x|y=0 ~ multivariate Gaussian with mean mu0 and cov sigma
x|y=0 ~ multivariate Gaussian with mean mu1 and cov sigma — same sigma!!
GDA and logistic regression :
if p(x|y) is a multivariate gaussian then p(y|x) follows a logistic function.
Q: GDA is a generative algo, while logistic regression is a discriminate algo. When use one over the other?
- GDA makes stronger assumptions about p(x|y). i.e. if p(x|y) ~ Gaussian, y ~ Bernoulli, then p(y|x) ~ logistic regresssion. But the inverse is not true. if p(y|x) ~ logistic, then it’s possible that p(x|y) ~ Poisson. Thus the assumption of logistic regression is weaker.
- When p(x|y) is indeed Gaussian, then GDA is asymptotically efficienc. i.e. if the assumption is met GDA is better than logistic regression
- then logistic is robust and less sensitive to incorrect models
x is discrete vectors
Sep 13, 2017
A list of models to review:
- Neural networks
- Energy-based models
- Support Vector Machine
Oct 11, 2017
- today learnt KNN — select k nearest neighbors and classify it to be the most prevelant class in these neighbors
- Supervised model : given input predict output — from elements of statistical learning
- k means — ??
Oct 12, 2017
- decision trees
- CART — classification and regression tree – each step only choose one feature and a threshold. (machine learning book with tensorflow)
- feature importance — calculated using Gini
- MLE vs KL divergence vs cross entropy
- KL divergence is measuring dissimilarities between empirical distribution and theoretical distributions
- MLE is to tune the parameters s.t. the likelihood is maximized given seen data
- cross entropy ?
- PCA – eigenvalue decomposition of covariance matrix ??
- RNN — used when the input data has different lengths but have to train the same model !