Monthly Archives: September 2017

小圈子与早起

这周目前进展不错的事情,是加入了 meetup 上一个六点半的晨练小组。两年来声称要六点半起床,很少做到,加入这个小组后已经连续做到三天。明天应该也可以。做过测试,说自己的办事方式是 obliger,也就是需要外部监督。

为了早起,报名了很多早上的锻炼活动。为了写作,加入了几个写作小组。纽约啥都有。

这个规律出现在读研过程里。硕士一是没有结构,二是没有群体,因为 consciousness 是社会属性,孤立的时候,人就会越来越越没有责任心,从而自尊降低,负面循环。

这几天也真的开始动手改论文,研究综述又是一顿大改,之前的功夫又变成内功了。今天听讲座,意识到文献综述的问题是,寻找 alternatives,一下子明白了之前困惑的原因:没有找到alternatives。这个细细说来,又在于问题的精度不同。Andreas Glaeser 写了一本书!!不是论文。因此没有很细的研究问题,或者研究问题非常大。我在这个层次上找文献,当然找不到。因为基于他的书的框架,我自己的问题,也提的很泛泛。当我把问题分解成三个解释,根据每个解释,就可以概念化了。例如,对政治不感兴趣=political apathy, civic engagement, 对抗议不感兴趣,= mobilization process, 对民主不感兴趣,这个还没想好,该是 movement claim 吧。

然后就是中文写作了。一个难题是,这个问题无法结构化,也没有外部监督。我想写小说,已经想很久了,也写了许多片段。但是,一直没有整理成文。博客和电脑上的区别是,博客上是半公开的,知道有观众。那么写的时候,就会更有动力,写作最终是一种交流。同时想象出观众,写的时候也会更流畅。电脑上写,完全给自己看,很容易就失去兴趣,对时间也不好把握。而公众号上,观众并不完全可以信任,需要考虑形象管理,因此压力更大,任务变得更艰难,就更难 get it done 了。

这个,和写作者的 ego 有关。我并不是完全自信,否则可以滔滔不绝。

还有个教训是,要接受自己的弱点,然后对症下药,效果会比较好。之前总希望自己变成很自律的人,但是就是无法做到,非常痛苦。如果承认自己,就是需要更多外部监督和结构化生活,那么通过加入小组、在博客上更新,这两个方法目前看来是适用的。那就这么搞呗。

尝试一周,每天在博客上写500字片段。并不是很难的任务。也需要找更多写作小组/组织写作小组了!

First Try of AWS

First try of Amazon AWS, as in course Natural Language Processing. I should have reviewed this tutorial a week ago! Use all the time I spent on Facebook and Douban and talking to people…

Step-by-step guidance from the course tutor. I will note down my understandings and relevant materials in red.

AWS

    • Sign in to AWS
    • Under Services tab on the upper left corner, click EC2 under Compute section
    • On the upper right corner, switch your region to Oregon.
    • Under IMAGES section click AMIs
    • On the dropdown box in the search bar, change from “Owned by me” to “Public images
    • To launch a CPU EC2 Instance with PyTorch environment
      • Search for NYU-DSGA1011-PyTorch-CPU-0
      • Right click the AMI and click Launch
      • Select your instance type (t2.micro for instance)
      • Click Review and Launch
      • In the pop-up window, select Choose an existing key pair and select your key pair below
      • Need to create my own key pair before this. What is a key pair? http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair
      • So seems to just be a key / username/password. My keypair for this is “nlp2017” and the file is “nlp2017.pem.txt”
      • Click View Instance
      • Here instances need to be in a running state until it can be connected! How to connect here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html?icmpid=docs_ec2_console
      • Open a terminal window, type “ssh -i [my-key-pair.pem] ec2-user@[dns]”. Please replace [my-key-pair.pem] with a directory to your key pair and [dns] with the string under Public DNS (IPv4) column.
        • chmod 600 nlp2017.pem.txt [This is to change permission see this thread: https://stackoverflow.com/questions/9270734/ssh-permissions-are-too-open-error]ssh –i nlp2017.pem.txt ec2-user@ec2-52-37-234-187.us-west-2.compute.amazonaws.com
      • source ~/.bashrc
      • cd pytorch_test/src/
      • python pytorch_test_lr_cpu.py
      • You can confirm it’s working by observing messages like “Epoch: [1/5], Step: [100/600], Loss: 2.2161”
      • Example client code from https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/01-basics/logistic_regression/main.py
      • In the future, you can replace the example client code with your .py file.
    • To launch a GPU EC2 Instance with PyTorch environment.
    • Remember to stop an instance ! http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html

 

  • You will be charged for any EC2 Instance with GPU

 

      • Search for nyu-nlp-pytorch-gpu-0
      • Right click the AMI and click Launch
      • Select g2.2xlarge as your instance type
      • Click Review and Launch
      • In the pop-up window, select Choose an existing key pair and select your key pair below
      • Click View Instance
      • Open a terminal window, type “ssh -i [my-key-pair.pem] ec2-user@[dns]”. Please replace [my-key-pair.pem] with a directory to your key pair and [dns] with the string under Public DNS (IPv4) column.
      • source ~/.bashrc
      • cd nlp_client_code/src/
      • python pytorch_cnn_tutorial_gpu.py
      • You can confirm it’s working by observing messages like “Epoch [1/5], Iter [100/600] Loss: 0.2209”
      • Example client code from https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/convolutional_neural_network/main-gpu.py
      • In the future, you can replace the example client code with your .py file.

 

  • Please don’t forget to stop (or terminate) your instance. Otherwise, you may be charged for AWS usage.

 

 

Keep away from attention drainers

Yesterday night I set off to return to the apartment at around 10:30 pm. This morning, when I set off to school, it was 10:30 am or so. So I have 12 hours between daily commute, without doing anything.

How did I spend the time? Mostly on mobile phones. I did not even start my computer.

I remember going to bed at 12:50 am and was waken up by the alarm clock at 7:30 am. Thus quality sleep time is less than six hours. I do not feel like working now because of the lack of rest.

So, there are roughly 6 hours wasted. A huge harm to productivity and this cannot be continued.

Activities:

  • Posted a WeChat moment and was drawn to that
  • Browsing through other people’s WeChat moments
  • Weibo probably taken 30 minutes or so
  • Yelp search for where to buy bikes in NYC
  • Internet surfing last night, contents I don’t quite remember now

These activities need to be eliminated. I missed a morning class also. Since human attention and time are limited resources, companies build apps and make money out of them (e.g. selling ads)! Every minute I spend on Weibo, the company is going to make money without really taken my benefits into consideration. I am being exploited by a company.

However, it is impossible to not to use iPhone, because the tradeoff is to lose connection with friends. There is also the convenience of apps despite the waste of energy.

Steps and plans:

At night, after a day’s work willpower is weak. So, need to take concrete steps to avoid being drawn to distractions.

  • When heading off from school, open SelfControl on App.
    • Limit daily social media use from 800am-810am, 1200-1210 pm, 600pm -610pm, 950-1000pm. That is already one hour in total!
    • This includes: Douban, Weibo, Facebook, WeChat.
  • Between 10pm to 8am, do not use mobile phone.
  • What to do with the extra time?
    • Actually focus on my tasks.
    • If feel bored, do not use mobile phone or social media. Instead, do a five5 minute mediation instead.
    • Walk around or move outside.
    • Do not listen to songs, as they lead to unstable emotion and weak self-control. Do not taken in new information (read a story, podcast, etc.) Take a real rest.

I will log my daily progress on this in a Google Sheet. The goal is to form a habit from Sep 14 to Oct 31. I might fall back, but the moral is never miss two days in a row.

 

Machine Learning Re-cap – keep updating

Generative Learning:

Supervised learning – input labeled x and y. then give x and calculate probability p(y|x)

generative learning – basically Bayesian idea. forgot about sample and population ..

p(y|x) = p(y, x) / p(x) = p(x|y) * p(y) / sum(P(x, y)) ~ p(x|y) * p(y)

p(y) is the prior distribution of y.

p(x|y) is the probability of x given a y

So if assume a prior y, also p(x|y), which means assuming the distribution of labels and the distribution of input given a label, can get the probability of current observant under each label. Then maximize this. (Does this relate to mle as well?)

Gaussian Discriminant Analysis Model

x is continuous real valued vectors

The prior for y is Bernoulli(phi)

x|y=0 ~ multivariate Gaussian with mean mu0 and cov sigma

x|y=0 ~ multivariate Gaussian with mean mu1 and cov sigma — same sigma!!

GDA and logistic regression :

if p(x|y) is a multivariate gaussian then p(y|x) follows a logistic function.

Q: GDA is a generative algo, while logistic regression is a discriminate algo. When use one over the other?

  • GDA makes stronger assumptions about p(x|y). i.e. if p(x|y) ~ Gaussian, y ~ Bernoulli, then p(y|x) ~ logistic regresssion. But the inverse is not true. if p(y|x) ~ logistic, then it’s possible that p(x|y) ~ Poisson. Thus the assumption of logistic regression is weaker.
  • When p(x|y) is indeed Gaussian, then GDA is asymptotically efficienc. i.e. if the assumption is met GDA is better than logistic regression
  • then logistic is robust and less sensitive to incorrect models

Naive Bayes

x is discrete vectors

 

 

Sep 13, 2017

A list of models to review:

  • Neural networks
    • GAN
    • Energy-based models
    • CNN
    • LSTM
  • Support Vector Machine
  • LASSO
  • PCA
  • KNN

Oct 11, 2017

  • today learnt KNN — select k nearest neighbors and classify it to be the most prevelant class in these neighbors
  • Supervised model : given input predict output — from elements of statistical learning
  • k means — ??

Oct 12, 2017

Today learnt:

  • decision trees
    • CART — classification and regression tree – each step only choose one feature and a threshold. (machine learning book with tensorflow)
    •  feature importance — calculated using Gini
    • Gini
  • MLE vs KL divergence vs cross entropy
    • KL divergence is measuring dissimilarities between empirical distribution and theoretical distributions
    • MLE is to tune the parameters s.t. the likelihood is maximized given seen data
    • cross entropy ?
  • PCA – eigenvalue decomposition of covariance matrix ??
  • RNN — used when the input data has different lengths but have to train the same model !