Machine Learning Re-cap – keep updating

Generative Learning:

Supervised learning – input labeled x and y. then give x and calculate probability p(y|x)

generative learning – basically Bayesian idea. forgot about sample and population ..

p(y|x) = p(y, x) / p(x) = p(x|y) * p(y) / sum(P(x, y)) ~ p(x|y) * p(y)

p(y) is the prior distribution of y.

p(x|y) is the probability of x given a y

So if assume a prior y, also p(x|y), which means assuming the distribution of labels and the distribution of input given a label, can get the probability of current observant under each label. Then maximize this. (Does this relate to mle as well?)

Gaussian Discriminant Analysis Model

x is continuous real valued vectors

The prior for y is Bernoulli(phi)

x|y=0 ~ multivariate Gaussian with mean mu0 and cov sigma

x|y=0 ~ multivariate Gaussian with mean mu1 and cov sigma — same sigma!!

GDA and logistic regression :

if p(x|y) is a multivariate gaussian then p(y|x) follows a logistic function.

Q: GDA is a generative algo, while logistic regression is a discriminate algo. When use one over the other?

  • GDA makes stronger assumptions about p(x|y). i.e. if p(x|y) ~ Gaussian, y ~ Bernoulli, then p(y|x) ~ logistic regresssion. But the inverse is not true. if p(y|x) ~ logistic, then it’s possible that p(x|y) ~ Poisson. Thus the assumption of logistic regression is weaker.
  • When p(x|y) is indeed Gaussian, then GDA is asymptotically efficienc. i.e. if the assumption is met GDA is better than logistic regression
  • then logistic is robust and less sensitive to incorrect models

Naive Bayes

x is discrete vectors



Sep 13, 2017

A list of models to review:

  • Neural networks
    • GAN
    • Energy-based models
    • CNN
    • LSTM
  • Support Vector Machine
  • PCA
  • KNN

Oct 11, 2017

  • today learnt KNN — select k nearest neighbors and classify it to be the most prevelant class in these neighbors
  • Supervised model : given input predict output — from elements of statistical learning
  • k means — ??

Oct 12, 2017

Today learnt:

  • decision trees
    • CART — classification and regression tree – each step only choose one feature and a threshold. (machine learning book with tensorflow)
    •  feature importance — calculated using Gini
    • Gini
  • MLE vs KL divergence vs cross entropy
    • KL divergence is measuring dissimilarities between empirical distribution and theoretical distributions
    • MLE is to tune the parameters s.t. the likelihood is maximized given seen data
    • cross entropy ?
  • PCA – eigenvalue decomposition of covariance matrix ??
  • RNN — used when the input data has different lengths but have to train the same model !
Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *