Category Archives: Random_old

Some fictions I wrote to get away from stuff at hand

重读 hierarchical attention network for document classification

这篇文章的 key idea 是,把关于文档结构的层级结构信息加入模型,有助于生成更好的文本表征。

这里的文档结构主要是说文章由句子组成,是层级结构。之前的方法是把所有句子连成一起输入一个 RNN 模型,这样其实丢失了段落这样的层级结构。

相应地,另一种方法是把每个句子里的词先分步输入一个 RNN 模型,生成句子表征;再将所有的句子表征输入另一个RNN模型,生成文本表征;最后再用文本表征进行分类。这样的方法,神经网络之间并不共享参数,而且两次输入RNN,故可以捕捉到层级结构。

将文本分层级处理,还涉及到更为灵活的注意力机制的应用。句子和词层面分别实现注意力机制,可以使表征从特定的“重要”元素里获取更多信息。

具体的网络实现有几个特点:

关于注意力机制的实现;

  • 将每个句子里的词向量送入 GRU (see last post on explanation on what is a GRU),收集每一步输出的 hidden state (因而不能直接调用 pytorch nn.GRU 函数而需要稍作改变写个 for-loop 并把结果存起来)
  • 把所有的 hidden state 送入MLP,生成对词向量的表征
  • 随机初始化一个 context vector,这个 context vector 的意义是句子的含义。用它和每个向量的表征求点积,代表 attention score。score 越高,说明两个向量越相似,也就说明这个词在这个句子里有更显著的意义。因此给它的 attention weight 也就应该比较高。
  • 将 attention score 送入 softmax 函数求得权重。用这个权重和原始的 hidden states sequence 求 weighted sum 得到整个句子的表征。

关于层级结构的实现:

  • 我们一共只训练两套 GRU 单元,一个负责总结句子,一个负责总结段落。因此所有的句子必须一样长,所有的段落必须有一样长度的句子。因此在预处理时,过长的句子被剪掉,过短的句子被补足。具体的长度选取可以看训练数据中长度的分布,用 qunatile 选择。
  • 将数据划整齐后,首先用上述方法得到每个句子的表征。
  • 其次,针对每个段落,再把所有的句子表征送入GRU,得到段落表征。
  • 最后就可以用这个表征做分类了。

论文地址:https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf

又一条文献综述吐槽贴

硕士论文还需要最后修改,陆续在看文献。从上周五开始,先后花了两个白天,两个傍晚,两个清晨,才大概看懂 Philip Converse 1964 年的名著 The Nature of Belief Systems in Mass Publics。引用了许多次,今天才真正搞懂。以为是他写得差,拿去问美国人,道是这文章写得太优美,导致理解困难。

头几次看的时候,没有意识到自己不懂,但是行为表现就是水,上网,玩手机,聊天。总之,身体很诚实地在逃避。大概星期六晚上意识到可能是因为这篇文章很难。星期天开始拆分小节,老老实实地各个击破。昨天推进了一点。今天终于搞明白大概在干嘛了。

读通了这篇开山之作,再看后来的论文,马上感觉串起来,知道前因后果了。文科的东西有文科的难法。2015/2016年就该完成的工作,拖到现在。根本原因还是英文阅读能力不够+耐心不足吧?论文写不出来,再怎么找借口,根本原因是积累不够。积累不够的原因是卡在某个难点上就绕路走,兜兜转转半天,哪个学科都无法精进。

为什么这篇文章难呢?首先是这人英文写得百转千回,一个句子要带三个逗号两个从句,因此语意也重重转折,难以分辨主要意思。这当然也说明作者思维缜密。其次是这篇文章结构不是很清晰。当代的社科论文作者已经学乖了,照顾读者耐心,第一句话就提出研究问题。第二三四句话摆出 alternative hypothesis,第五句话说结论,第六句话吹牛。但这篇文章比较像 essay,看不出这么简单粗暴的结构。第三难点是我对政治知识基本抓瞎,只知道个 liberal / conservative,放在作者的分类里就是社会底端的无知大众。第四难点是词汇量。

当然,积极来想,读完一篇难的文章,思维上的收获应当也不小。这两年没怎么看书,说话都变得越来越没文化,还以为自己了不得。

这篇博文还想说两件事。一个是文科训练大概的确会让人变复杂而纠结。因为研究社会事实,工作就是把本来可以简单化的事情搬出来反复咀嚼分析。人的心理啦,对社会走向和未来预期啦。这大半年我的生活里几乎没有什么社会性事件,可想而知情商增长又停滞了。

第二是这篇论文最后提及,政治精英参与 the history of ideas (思想史?)的制定,而他们的决策反过来影响大众。大学学习马克思的时候,好像也看到过类似的思想 (class consciousness?) ,好像也因此树立了一些理想。

知识的制造是政治,是权力。过去两年的挣扎痛苦,可能也是对于不熟悉的意识形态要掌控我的生活的反抗。不愿离开社会学,也曾给自己洗脑说要超越自己的阶级局限。父辈搬砖,我拒绝搬砖!但我现在快乐许多,选择了认知上更为轻松的生活方式。搬砖有搬砖的幸福之处,人际斗争伤神且不制造价值,人的智力应当放在更有审美或实用价值的地方。

Python *args and **kwargs

(A new post, in the spirit of always be jabbing, always be firing, always be shipping. )

This post deals with Python *args and **kwargs. Here args and kwargs are just naming conventions, the grammar is actually * and ** .

*args

Define variable length of parameters for a function. In plain English, the number and type of parameters are not known beforehand.

Example:


def test_var_args(f_arg, *argv):
print("first normal arg: ", f_arg)
for arg in argv:
print("another arg through *argv :", arg)

test_var_args('Yasoob', 'python', 'eggs', 'test')

The output is


first normal arg: Yasoob
another arg through *argv : python
another arg through *argv : eggs
another arg through *argv : test


**kwargs

Similar to *args in that it enables variable length inputs. But different in that the inputs can be named.

Example:


def table_things(**kwargs):
for name, value in kwargs.items():
print( '{0} = {1}'.format(name, value))

table_things(apple = 'fruit', cabbage = 'vegetable')

The output is


apple = fruit
cabbage = vegetable

References:

https://stackoverflow.com/questions/3394835/args-and-kwargs

https://pythontips.com/author/yasoob008/

 

On learning fast

This semester I’m taking three courses and two of them surprisingly take up much more time than I thought, and I wondered why I spent so much time of them. The courses are nlp and big data with machine learning systems.

Here are the reasons:

  • Too many sources of study material. For nlp, I consulted Andrew Ng’s course material deep learning, Stanford nlp course notes, NYU nlp course notes, the book by Goldberg, for deep learning I also read Elements of statistical learning and Deep learning. Much of them are covering the same concepts but from different perspectives and using different notations. Thus I repeated read many things, and not following a coherent logic flow.
  • Unclear learning objectives. The course notes at NLP are not well written, in my opinion, in that the learning objectives for each section is not clear enough so that reader will grasp what s/he is doing and why she is doing this. In learning word embeddings, it was not until I come into this post (http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) that I understand that classification task here is not the purpose but a “fake task” to learn word embeddings. This takes me days and multiple consultations with the TA!
  • Lack of background information. My bad in being too ambitious. The biggest problem is I’m not familiar with programming, especially Python OOP. Thus, even though theoretically I understand the algorithms I do not understand them in the codes. Worse, since I’m so unfamiliar I didn’t realize that I didn’t understand the codes. e.g. in Pytorch a class called nn.embedding… I don’t understand what exactly is the object and why it can be called, and what is returned. This hindered my ability in understanding what exactly is the model building. I also didn’t follow the course industrially and thus missed a few lectures, making it hard to catch up later.
  • Low productivity. 1) When stuck, the best strategy is to go away for a while to search for fill in the background information, or to ask people to figure out what to learn. But, sometimes I stay idle and waste time. Because even if I think about it, I won’t understand if pieces of information is missing. 2) without clear objectives in given time period. Solved this by buying a stop watch and time myself in doing tasks.

 

Oct 26, 2017 update

Today is not very productive. The problem is some textbooks are so detailed that going through it requires lots of time (e.g. deep learning book).

Again it’s a problem of priority.

  1. If I don’t use something, I forget.
  2. If I don’t have a complete understanding of something, I spent more time later because missing crucial information.
  3. But if I start too early, I don’t understand.

Today, I spent time on representive learning, but most of them I don’t remember now because I don’t have background information.

Thus, if I go too deep in the first time with the deep learning book without a research topic:

  1. I will spend lots of time and possibly cannot understand it all.
  2. I cannot memorize every theorem I read in the book, this is not high school.
  3. Since I don’t use, I will forget.

The approach is learning by doing.

  1. Set a deliverable!!! not just read books. Always read book with a goal!!!

Nov 8 2017 update

Perfectionism 

with blog post, I wanted to write one but after two months I haven’t done.

In fact, if I limit each post to 1.5 hour, then each morning I can already get it done. The trick is to do them in pieces, also drafts.

Each draft takes 10 minutes or so so I can use idle time to edit. Just write whatever I have on the blog.

After accumulating enough drafts I can then publish.

小圈子与早起

这周目前进展不错的事情,是加入了 meetup 上一个六点半的晨练小组。两年来声称要六点半起床,很少做到,加入这个小组后已经连续做到三天。明天应该也可以。做过测试,说自己的办事方式是 obliger,也就是需要外部监督。

为了早起,报名了很多早上的锻炼活动。为了写作,加入了几个写作小组。纽约啥都有。

这个规律出现在读研过程里。硕士一是没有结构,二是没有群体,因为 consciousness 是社会属性,孤立的时候,人就会越来越越没有责任心,从而自尊降低,负面循环。

这几天也真的开始动手改论文,研究综述又是一顿大改,之前的功夫又变成内功了。今天听讲座,意识到文献综述的问题是,寻找 alternatives,一下子明白了之前困惑的原因:没有找到alternatives。这个细细说来,又在于问题的精度不同。Andreas Glaeser 写了一本书!!不是论文。因此没有很细的研究问题,或者研究问题非常大。我在这个层次上找文献,当然找不到。因为基于他的书的框架,我自己的问题,也提的很泛泛。当我把问题分解成三个解释,根据每个解释,就可以概念化了。例如,对政治不感兴趣=political apathy, civic engagement, 对抗议不感兴趣,= mobilization process, 对民主不感兴趣,这个还没想好,该是 movement claim 吧。

然后就是中文写作了。一个难题是,这个问题无法结构化,也没有外部监督。我想写小说,已经想很久了,也写了许多片段。但是,一直没有整理成文。博客和电脑上的区别是,博客上是半公开的,知道有观众。那么写的时候,就会更有动力,写作最终是一种交流。同时想象出观众,写的时候也会更流畅。电脑上写,完全给自己看,很容易就失去兴趣,对时间也不好把握。而公众号上,观众并不完全可以信任,需要考虑形象管理,因此压力更大,任务变得更艰难,就更难 get it done 了。

这个,和写作者的 ego 有关。我并不是完全自信,否则可以滔滔不绝。

还有个教训是,要接受自己的弱点,然后对症下药,效果会比较好。之前总希望自己变成很自律的人,但是就是无法做到,非常痛苦。如果承认自己,就是需要更多外部监督和结构化生活,那么通过加入小组、在博客上更新,这两个方法目前看来是适用的。那就这么搞呗。

尝试一周,每天在博客上写500字片段。并不是很难的任务。也需要找更多写作小组/组织写作小组了!

Natural Language Processing Notes – Keep Updating

References –

Structure of this note: I plan to note down key concepts, mathematical constructs and explanations.

Basic tasks in NLP

  • language modeling,
  • POS tagging,
  • named entity recognition,
  • sentiment analysis
  • and paraphrase detection

From a StackExchange user’s posts-

NLP is very vast and varied. Here are a few basic tools in NLP:

  1. Sentence splitting: Identifying sentence boundaries in text
  2. Tokenization: Splitting a sentence into individual words
  3. Lemmatization: Converting a word to its root form. E.g. says, said, saying will all map to root form – say
  4. Stemmer: It is similar to a lemmatizer, but it stems a word rather than get to the root form. e.g. laughed, laughing will stem to laugh. However, said, saying will map to sa – which is not particularly enlightening in terms of what “sa” means
  5. POS tagger: Tags a word with the Part of Speech – what is a noun, verb, preposition etc.
  6. Parser: Links words with POS tags to other words with POS tags. E.g. John ate an apple. Here John and apple are nouns linked by the verb – eat. John is the subject of the verb, and apple is the object of the verb.

If you are looking for the state of the art for these tools, check out StanfordCoreNLP, which has most of these tools and a trained model to identify the above from a document. There is also an online demo to check out stanfordCoreNLP before downloading and using it with your application.

NLP has several subfields. Here are a few of them:

  1. Machine Translation: Automatic Translation from one language to another
  2. Information Retrieval: Something like a search engine, that retrieves relevant information from a large set of documents based on a search query
  3. Information Extraction: Extract concepts and keywords – such as names of people, locations, times, synonyms etc.
  4. Deep Learning has lately become a new field of NLP, where a system tries to understand a document like a human understands it.

2017 Mar 8 Women in Data Science HKU

A project of text classification based on news articles

  • N-grams language model
  • Chinese text segmentation – a tool Jieba on github
  • Hidden Markov Model (HMM)

From the web – general review articles – pop articles 

State of art of – AI hard – a concept — strong AI – Ali iDST http://mp.weixin.qq.com/s/hYrNi17HiyW9kxokgLR_Ig – as of 2017 Apr 17

Stage I: 2014

Motivation 

  • They started human-machine conversation in 2014. Commercial motivation = two trends. First, smart phone are popular, thus various use scenarios that mouse and touch cannot satisfy. Second, Internet service expand to all scenarios of everyday life! Thus possibility for other interaction methods.

Facts 

  • In early 2014 human machine conversation is at primary stage.
  • In China Taobao, few people actually use speech to search.
  • In early 2014 many companies’ speech search product died.

Stage II: 2015 – 2016

Facts and current work

  • Alibaba iDST work with YUNOS on a speech-enabled interactive search assistant (something like Siri?) But then problems = compete for users’ attention with other apps on all activities. Thus have to take care of what every application is doing (buy air tickets, buy food… etc) = huge workload. Thus in 2015 they transit to work on a platform to build an API to every app. (no details here)
  • Main tasks now: 1) use machine to understand human language (intention + key information. e.g. book train ticket). 2) Manage interaction (keep asking until get key information). 3) Open-ended conversations (wikipedia…) 4) Casual chat (not so valuable in commercial sense)

Challenges 

  • Engineering: scalability … scale on particular domain + scale on particular device! also talked in that NLG paper where the authors built a commercialised natural language generation system for a hospital. From an engineering perspective, challenges include: 1) accuracy and 2) scalable system.
  • Science: general challenges in NLP include various of user’s language (what exactly do you want when you said “you want to go to the US”) + understand context + robustness + sharing of common sense that makes the conversation flow smooth

Stage III: 2016 – now (2017 Apr)

Work 

  • The NLP engine from traditional machine learning method to deep learning!! Intention = CNN, slot-filling = Bi-LSTM-CRF, context (didn’t specify which model!) robustness = data augmentation
  • developed a task-flow language to solve the “understand context” problem
  • For API to domain, developed a OpenDialogue with different specific tasks

Current applications 

  • On smart appliances… something like Amazon Alexa 
  • Some scenarios where human-machine conversation can be useful = Internet and cars. talk when you drive..

Challenges again

  • open-domain NLP — deal with scalability of specific domains
  • current human machine interaction are mostly single-round conversation, or can only take into account limited context. So, modelling with regard to specific context important.
  • Data-driven learning instead of user-defined modules in conversations.

Deep Learning 

  • caffe – a Berkerley package http://caffe.berkeleyvision.org

 

 

Conversation with Yuchen Zhang

Semantic parsing

自然语言– logical form — -knowledge base

  • 问答,学术界关心,比较复杂的问题的问答,需要逻辑推理
    • Argmax …
    • Question — 数据源的操作
  • 做到什么程度,open problem 任何一点 progress 就可以直接 transfer 到 industry
  • Research community — cannot get the data — abstract the problem — Wikipedia question answering dataset

 

Personality — 分析情感的 paper

问题的简化,why? 一开始并不是因为分析情感有太大用处,例如 Amazon review — 直接看分就可以了。不需要看文字

Chatbot 分析也不是很准

但是,语义理解大问题的简化,不仅仅是字面信息,而是更深的信息。

Word extraction 分析比较差。

现实,并不是特别能  transfer ,形成了这么一个群体,sentiment analysis 其实已经不太关心 semantic analysis

 

data coding –

information extraction – paragraph — A & B relations

  • Open source — open IE
  • UWashington

summarization – open question still active research 

  • Research question
  • Long paragraph — short pararaphs
  • Neural models

 

NLP – 不能帮助理解。但是可以提供基本工具,股价预测,新闻,MicroSoft stock going up ? Review — 用户体验,输出

  • Opensource cannot do this
  • What it can do? Extract features
    • Sentiment analysis — emotions positive or negative
    • Relation
    • Feature vector 做预测

不能帮助理解文章,但是可以:

  • 关键词 – topic model 是什么,先做 training 一万篇新闻,unsupervised training 谈论同一件事情
  • Sentiment analysis — amazon 产品 review
  • 端到端的任务,summarization active research area
    • Baseline first line
    • Language independent — 中文语料
  • question answering
  • Reading comphehensive

语料库:

  • 分问题:word embedding — 新闻,文学  corpus

Linguistic

Parcy Liang

相当一部分  professor in linguistics 语言学家,computational methods

Cho 拍到 datasets 上面来

Cho neural machine translation 出名

  • Postdoc
  • Deep learning class paper
  • Attention model
  • GRU
  • Auto- drive
  • Vision

Chris Manning

  • Deep learning + NLP

NLP 工业界比较热门

  • semantic parsing — search engine — rule based more conservative — Google in search
  • Question answering — dialog system — hot — AI assistant
  • Fundamental module — word embedding, parsing, post-text , dependency as features used in applications

独自在美国过年这件事情

我牵挂的人里十个有九个在美国。今年过年,他们九个有八个没有回家。小红在东岸上课,小高在西岸编程。所有的学霸都在芝加哥奋力攀登知识顶峰。他们散落在北美大地上气喘吁吁地生活,共同特点有两个。一是年三十全部去了同学那儿,围着面目模糊的餐具颠三倒四地包饺子;二是即将或者正在成为码农。

年三十我端着烫嘴的糯米排骨看春晚,爸妈在旁边低头抢红包。想到他们要独自在美国过年这件事,我觉得揪心。美帝虽好,不是华人地盘。过年这种中华民族传统巅峰大戏,在美国怎么能搜集到足够的素材来完成呢?

吃饭首先就是个问题。华人超市山高路远,想去得傍上大佬,吭哧吭哧地开四十分钟车。进去一看,满眼都是样子蠢笨的冻鱼冻虾。而我们最好的基围虾,要趁蹦腾的活虾不注意,滋啦扔进锅里焖,入口才有甜味。做饺子要拌肉馅吧,二磅猪肉乱刀剁泥,得和上新鲜韭黄吧,但亚洲超市里只有大葱而且蔫儿吧唧。据说美国杀猪用电不放血,肉总有股腥味儿,只能借料酒护短。

至于更高级的享受,就不要想了。卤鸡腿卤鸡蛋微辣,脆皮流油烤童子鸡,松鼠桂鱼沾茄汁。还有肉丸子吧?要松软易化的鱼丸子、绿豆丸子、豆腐丸子、蓑衣丸子吧?要打糍粑吧?早晨是红糖拔丝煎糍粑,晚上是排骨藕汤熬糍粑。筷头拈起来,拉得长长一大条,囫囵一口塞进去。我边吃边想,这些东西他们上哪儿找?

也不要提烟花和鞭炮的乐趣了。在老家买十块钱摔炮把隔壁挂鼻涕的小孩子欺负到哭。也不要提走亲戚的尴尬与温情了。表姐的小孩看着看着已经长这么高,秋天就上小学了。而我既没毕业也没结婚,发不出压岁钱,几乎要配不上小姨的称呼。

也不要提放假的无所事事了。也不要提和父母越洋视频的信号飘摇了。也不要提在网上刷段子才能跟上朋友圈里的春晚吐槽了。也不要提如何从过往的白人脸上找到新年的喜乐气氛了。不过又是异国小镇上平凡的一天。想到这里,我就有点揪心。

我去过两次美国都是作为交换生。待的时间不长,加起来半年多。两次都是冬天,两次我都在学文科。中西部零下三十度的小镇上,有时我发呆,我的家乡没有这么多雪,也没有这么多陌生的语言和微笑。阳光好的时候我就把自己放进人群里,像往海里放一条鱼。

交换时我度过一个中国年和半个美国年。两年前的除夕,学姐带我去吃饭,也叫上她的朋友。我们仨分别来自上海、南京和武汉,喜爱的食物分别是南翔小笼、鸭血粉丝和牛肉豆皮。而那天我们一起坐在 Asian Kitchen 的二楼,看窗外 State Street 上一排排路灯,小镇上最繁华的街道。火锅端上来滋滋冒着白气,汤里飘着几颗辣椒。我们迫不及待放土豆片进去煮,汤滚了就可以涮肉,用小勺把虾滑搓进锅里。席间我们谈论选课、套瓷和申请,某某老师是个怪胎。学姐犹豫读博还是工作,什么时候回国,暧昧的男生好像没有未来但是到底要不要在一起。饭后我们各自散去,我给家里打电话,国内是下午。爷爷问我想不想他,我说想。我又问爷爷是否想我,他笑了,说:“不是很想”。二姑有对活泼的龙凤胎,慢慢长大,终于剥夺了我做小孩的特权。

去年十一月我去一个美国老太太家里过感恩节。事先并不认识,学校照顾孤寡国际生临时组织的活动。这家除了收留我,还带上了个中国男生。据说有些美国家庭感恩节会特意邀请几个陌生人,这样碍于面子,大家都会好好表现。老太太提前发来两页长的邮件告知食谱和活动日程,我寻思不能空手去,过节前一天傍晚慌慌忙忙冲出去买食材。小镇店铺本来就少,又碰上假期,六点钟街上已经黑灯瞎火,路边鬼影也无,很萧瑟。为了民族荣誉,最后我斥巨资买了瓶甜酒,昂昂然拎着走进老太太光鲜公寓的大门。是真讲究,桌上给我们准备的名牌都是细白瓷,花体字写上各人名字。老太太的朋友有一对男同性恋伴侣,一位图书编辑,一位语言治疗师和她的狗。他们带着美剧里中产家庭平和满足的微笑,教我们切火鸡,煮浇土豆泥的肉汁,往琴酒里放汽水、青柠片和冰块。我们聊中国、特朗普和国家公园里的驯鹿,聊并无交集的朋友和家人,有默契地大笑,一切都不像一次萍水相逢。

我之前误加了个博士生群,不好意思退出,于是得以偷窥他们过年的盛况。主题是某老师邀请学生们去他家过年,大家商量着带什么菜去。读书辛苦又没什么吃的,留学生个个修炼出一身厨艺。我观察,鸡肉出现的频率最高。不会做饭的一般带果盘、酒或者汤圆。一顿饭如果有了饺子,估计会和与会嘉宾合影共同出现在朋友圈里。这时我就用点赞来表达对他们生活美满的关切与祝福。

在异乡的冬夜,我喜欢的写手淡豹这么描写节庆和家的关系:

亲人并非自然而然,是要不断地亲近着,在日常和仪式性的时刻,如爱人需要不断注入爱情的养分,如同年货是一起过年的决心,除此之外,就什么都没有。而家是一项选择,以聚拢,以区分,以走近,以疏远,以赶紧回到自己认定的家去过年的动作。

上大学后我很少把住处称作 “家”。第一年我住宿舍,第二、三年我住另一间宿舍。第四年我和朋友在外租房。第五年我第三次搬宿舍。今年我又租下另一间房。我常找不到合适的词来称呼这些住处。“出租屋” 太书面,“住的地方” 太啰嗦,“家” 又太亲昵。直到我看到一位博士学长在博客里把他在美国的住处称为“公寓”,觉得这词很能概括那不冷不热的疏离感。公寓和家的区别,用淡豹的话讲,是缺少了那个“认定”的动作。

而过年就是这样确凿的动作,它充满仪式感。在大雪纷飞里提车,等小伙伴,开四十分钟到大华超市,是一次认定。剁肉馅,擀饺子皮,费力地洗掉手指间的面粉,是一次认定。算好时差,到处找信号和父母视频,在网上看春晚,也是一次认定。而我算着时间买票,拎着箱子坐车,晚上十一点到家爸妈给我下面条,同样是一次认定。

我知道在美国的他们除了这些仪式感与认定,生活中有更重要的事情。Leetcode,面试,OPT,H1B,抽签,之后是绿卡、学区房。他们的美国梦走在我前头。我常对自己的软弱感到羞愧,在小镇上觉得孤单的时候,我知道自己不会在这里久留。而他们的身体里有一辆火车,是前进而不是停留。赵工,钱律,孙教授,李老板,愿那片土地给你们的青春才智以公义的回报,愿你们的美国梦全部实现。愿你们已经吃过热气腾腾的饺子,来对付独自在外过年这件事情。

还剩五十年

这两天和几位师友吃饭,刚刚无意中看到一些911记录片,又有些感慨。

周日是一心向学的聂同学和解学妹,逛库布里克书店、听音乐会,还去尖沙嘴转了一圈听本土摇滚乐队。吐槽学术,被反吐槽。

周一晚见田老师,说我想做生意。老师先问:你父母能给你什么?又提及他在南方科技大学带学生,说若是把南方科技大学建成香港科技大学,“这一辈子也没有白活”。田老师人好,对学生的职业发展非常负责。

周一晚见到某想读心理学博士的学妹。交流+吐槽学界,不表。

今天中午见网友Hugo,伊指出我对学界的认知有偏颇,可能限于目前的眼界。实际上名利双收的老师也不少,学界厚积薄发而已。二是那些“开连锁店”的老板,今天有明天无,风险很大。三是我的气质不见得搞的赢他们,太 mild。她也提及自己的人生志向。

今天晚上见某金融学姐,工作特别忙。业余还写些文章。她谈及自己两个抑郁的朋友。

晚上又见到某博士学姐,伊明天答辩。学姐做劳工运动,我问她这个怎么找工作?她说尽量隐晦些吧。我提及博士期间艰辛,她表示确实不容易。

这几天申请真的接近尾声(??),最后收尾PS。真的没有申请博士,不走学术道路了。但放弃也不甘心,不放弃也不甘心。是怎么回事?

刚刚看911视频,有一个是播某飞机上的遇难者。普通的生命,就这么没了。感慨人生无常。我如果能健康工作到74岁,那也只剩50年时间了。这五十年,我要拿去做点什么,生命结束时才会觉得,没有白活?

最近好像很少这么看,眼光更限于短期的物质回报、买房买车子等现实考虑。但是长期看来,那些都会有的,最终也都会没有的。难道人生一场,我活着只是为了舒服、享受、开心吗?

还是想要做点事情。仿佛又回到两年前看到占中那种澎湃的感觉,和社会责任感。当时我的确是有某些社会理想,只是志向消磨在日常的琐碎里,慢慢磨灭了。但这条路怎么走?想做什么事情?又能做到多少呢?

这么一想,人生一场,大不了不过一条命。平平淡淡也是活,大起大落也是活。我愿意把这五十年献给什么?去开小店当小生意人?还是要好好考虑啊。白读了这么些书。

北大系网络作家

调研其他公号的运营情况,不知道怎么翻出一篇小说,题目是《金融街没有爱情》。越看越有劲,觉得节奏很好,语言很时尚。人肉发现作者刘玥,北大中文系毕业,在伯克利读博,吓了一跳。豆瓣上高手还是挺多啊,只是不知道为什么要这么写文章?是我太僵硬,还是作者心态好?

想起来桐华和楚湘云也是北大校友。这两位我还是给导师做RA时读到,猜想导师读博时也常泡晋江?文科博士生活单调,我下午还觉得自己要抑郁,晚上也回来虚度光阴了。所以看看小说还是好的。

又想起一位,江南,《此间的少年》作者。所以北大出了这么多网络作家,而清华没有。所以还是理工科课业重,debug一弄一晚上,证明搞不出要想破头,没时间搞这些有的没的。我自己转系过来,深有体会啊。

LOG – kdb database / algo trading

This is a log book of my study in kdb+ database which hopefully will evolve into some implementations of common trading strategies.

Sep 12 2016

  • Downloaded kdb+ 32 bit
  • Learnt how to evoke kdb+ / q language
  • Find some resources
    • The cookbook http://code.kx.com/wiki/Cookbook
    • Working with R http://code.kx.com/wiki/Cookbook/IntegratingWithR

Oct 9 2016

  • New direction to explore: Machine learning for trading?
  • Key problems:
    • Access to data
    • Strategies
    • Execute order

Oct 10 2016

  • Where to download free data: QuantQuote, Google, Yahoo
  • Paid data: CSI: provide data for Yahoo
    • Compiled by CalTech Finance Group – http://quant.caltech.edu/historical-stock-data.html
    • Questions on StockExchange, answered by a professional – http://stackoverflow.com/questions/754593/source-of-historical-stock-data