Category Archives: Random_old

Some fictions I wrote to get away from stuff at hand

Python *args and **kwargs

(A new post, in the spirit of always be jabbing, always be firing, always be shipping. )

This post deals with Python *args and **kwargs. Here args and kwargs are just naming conventions, the grammar is actually * and ** .


Define variable length of parameters for a function. In plain English, the number and type of parameters are not known beforehand.


def test_var_args(f_arg, *argv):
print("first normal arg: ", f_arg)
for arg in argv:
print("another arg through *argv :", arg)

test_var_args('Yasoob', 'python', 'eggs', 'test')

The output is

first normal arg: Yasoob
another arg through *argv : python
another arg through *argv : eggs
another arg through *argv : test


Similar to *args in that it enables variable length inputs. But different in that the inputs can be named.


def table_things(**kwargs):
for name, value in kwargs.items():
print( '{0} = {1}'.format(name, value))

table_things(apple = 'fruit', cabbage = 'vegetable')

The output is

apple = fruit
cabbage = vegetable



On learning fast

This semester I’m taking three courses and two of them surprisingly take up much more time than I thought, and I wondered why I spent so much time of them. The courses are nlp and big data with machine learning systems.

Here are the reasons:

  • Too many sources of study material. For nlp, I consulted Andrew Ng’s course material deep learning, Stanford nlp course notes, NYU nlp course notes, the book by Goldberg, for deep learning I also read Elements of statistical learning and Deep learning. Much of them are covering the same concepts but from different perspectives and using different notations. Thus I repeated read many things, and not following a coherent logic flow.
  • Unclear learning objectives. The course notes at NLP are not well written, in my opinion, in that the learning objectives for each section is not clear enough so that reader will grasp what s/he is doing and why she is doing this. In learning word embeddings, it was not until I come into this post ( that I understand that classification task here is not the purpose but a “fake task” to learn word embeddings. This takes me days and multiple consultations with the TA!
  • Lack of background information. My bad in being too ambitious. The biggest problem is I’m not familiar with programming, especially Python OOP. Thus, even though theoretically I understand the algorithms I do not understand them in the codes. Worse, since I’m so unfamiliar I didn’t realize that I didn’t understand the codes. e.g. in Pytorch a class called nn.embedding… I don’t understand what exactly is the object and why it can be called, and what is returned. This hindered my ability in understanding what exactly is the model building. I also didn’t follow the course industrially and thus missed a few lectures, making it hard to catch up later.
  • Low productivity. 1) When stuck, the best strategy is to go away for a while to search for fill in the background information, or to ask people to figure out what to learn. But, sometimes I stay idle and waste time. Because even if I think about it, I won’t understand if pieces of information is missing. 2) without clear objectives in given time period. Solved this by buying a stop watch and time myself in doing tasks.


Oct 26, 2017 update

Today is not very productive. The problem is some textbooks are so detailed that going through it requires lots of time (e.g. deep learning book).

Again it’s a problem of priority.

  1. If I don’t use something, I forget.
  2. If I don’t have a complete understanding of something, I spent more time later because missing crucial information.
  3. But if I start too early, I don’t understand.

Today, I spent time on representive learning, but most of them I don’t remember now because I don’t have background information.

Thus, if I go too deep in the first time with the deep learning book without a research topic:

  1. I will spend lots of time and possibly cannot understand it all.
  2. I cannot memorize every theorem I read in the book, this is not high school.
  3. Since I don’t use, I will forget.

The approach is learning by doing.

  1. Set a deliverable!!! not just read books. Always read book with a goal!!!

Nov 8 2017 update


with blog post, I wanted to write one but after two months I haven’t done.

In fact, if I limit each post to 1.5 hour, then each morning I can already get it done. The trick is to do them in pieces, also drafts.

Each draft takes 10 minutes or so so I can use idle time to edit. Just write whatever I have on the blog.

After accumulating enough drafts I can then publish.


这周目前进展不错的事情,是加入了 meetup 上一个六点半的晨练小组。两年来声称要六点半起床,很少做到,加入这个小组后已经连续做到三天。明天应该也可以。做过测试,说自己的办事方式是 obliger,也就是需要外部监督。


这个规律出现在读研过程里。硕士一是没有结构,二是没有群体,因为 consciousness 是社会属性,孤立的时候,人就会越来越越没有责任心,从而自尊降低,负面循环。

这几天也真的开始动手改论文,研究综述又是一顿大改,之前的功夫又变成内功了。今天听讲座,意识到文献综述的问题是,寻找 alternatives,一下子明白了之前困惑的原因:没有找到alternatives。这个细细说来,又在于问题的精度不同。Andreas Glaeser 写了一本书!!不是论文。因此没有很细的研究问题,或者研究问题非常大。我在这个层次上找文献,当然找不到。因为基于他的书的框架,我自己的问题,也提的很泛泛。当我把问题分解成三个解释,根据每个解释,就可以概念化了。例如,对政治不感兴趣=political apathy, civic engagement, 对抗议不感兴趣,= mobilization process, 对民主不感兴趣,这个还没想好,该是 movement claim 吧。

然后就是中文写作了。一个难题是,这个问题无法结构化,也没有外部监督。我想写小说,已经想很久了,也写了许多片段。但是,一直没有整理成文。博客和电脑上的区别是,博客上是半公开的,知道有观众。那么写的时候,就会更有动力,写作最终是一种交流。同时想象出观众,写的时候也会更流畅。电脑上写,完全给自己看,很容易就失去兴趣,对时间也不好把握。而公众号上,观众并不完全可以信任,需要考虑形象管理,因此压力更大,任务变得更艰难,就更难 get it done 了。

这个,和写作者的 ego 有关。我并不是完全自信,否则可以滔滔不绝。



Natural Language Processing Notes – Keep Updating

References –

Structure of this note: I plan to note down key concepts, mathematical constructs and explanations.

Basic tasks in NLP

  • language modeling,
  • POS tagging,
  • named entity recognition,
  • sentiment analysis
  • and paraphrase detection

From a StackExchange user’s posts-

NLP is very vast and varied. Here are a few basic tools in NLP:

  1. Sentence splitting: Identifying sentence boundaries in text
  2. Tokenization: Splitting a sentence into individual words
  3. Lemmatization: Converting a word to its root form. E.g. says, said, saying will all map to root form – say
  4. Stemmer: It is similar to a lemmatizer, but it stems a word rather than get to the root form. e.g. laughed, laughing will stem to laugh. However, said, saying will map to sa – which is not particularly enlightening in terms of what “sa” means
  5. POS tagger: Tags a word with the Part of Speech – what is a noun, verb, preposition etc.
  6. Parser: Links words with POS tags to other words with POS tags. E.g. John ate an apple. Here John and apple are nouns linked by the verb – eat. John is the subject of the verb, and apple is the object of the verb.

If you are looking for the state of the art for these tools, check out StanfordCoreNLP, which has most of these tools and a trained model to identify the above from a document. There is also an online demo to check out stanfordCoreNLP before downloading and using it with your application.

NLP has several subfields. Here are a few of them:

  1. Machine Translation: Automatic Translation from one language to another
  2. Information Retrieval: Something like a search engine, that retrieves relevant information from a large set of documents based on a search query
  3. Information Extraction: Extract concepts and keywords – such as names of people, locations, times, synonyms etc.
  4. Deep Learning has lately become a new field of NLP, where a system tries to understand a document like a human understands it.

2017 Mar 8 Women in Data Science HKU

A project of text classification based on news articles

  • N-grams language model
  • Chinese text segmentation – a tool Jieba on github
  • Hidden Markov Model (HMM)

From the web – general review articles – pop articles 

State of art of – AI hard – a concept — strong AI – Ali iDST – as of 2017 Apr 17

Stage I: 2014


  • They started human-machine conversation in 2014. Commercial motivation = two trends. First, smart phone are popular, thus various use scenarios that mouse and touch cannot satisfy. Second, Internet service expand to all scenarios of everyday life! Thus possibility for other interaction methods.


  • In early 2014 human machine conversation is at primary stage.
  • In China Taobao, few people actually use speech to search.
  • In early 2014 many companies’ speech search product died.

Stage II: 2015 – 2016

Facts and current work

  • Alibaba iDST work with YUNOS on a speech-enabled interactive search assistant (something like Siri?) But then problems = compete for users’ attention with other apps on all activities. Thus have to take care of what every application is doing (buy air tickets, buy food… etc) = huge workload. Thus in 2015 they transit to work on a platform to build an API to every app. (no details here)
  • Main tasks now: 1) use machine to understand human language (intention + key information. e.g. book train ticket). 2) Manage interaction (keep asking until get key information). 3) Open-ended conversations (wikipedia…) 4) Casual chat (not so valuable in commercial sense)


  • Engineering: scalability … scale on particular domain + scale on particular device! also talked in that NLG paper where the authors built a commercialised natural language generation system for a hospital. From an engineering perspective, challenges include: 1) accuracy and 2) scalable system.
  • Science: general challenges in NLP include various of user’s language (what exactly do you want when you said “you want to go to the US”) + understand context + robustness + sharing of common sense that makes the conversation flow smooth

Stage III: 2016 – now (2017 Apr)


  • The NLP engine from traditional machine learning method to deep learning!! Intention = CNN, slot-filling = Bi-LSTM-CRF, context (didn’t specify which model!) robustness = data augmentation
  • developed a task-flow language to solve the “understand context” problem
  • For API to domain, developed a OpenDialogue with different specific tasks

Current applications 

  • On smart appliances… something like Amazon Alexa 
  • Some scenarios where human-machine conversation can be useful = Internet and cars. talk when you drive..

Challenges again

  • open-domain NLP — deal with scalability of specific domains
  • current human machine interaction are mostly single-round conversation, or can only take into account limited context. So, modelling with regard to specific context important.
  • Data-driven learning instead of user-defined modules in conversations.

Deep Learning 

  • caffe – a Berkerley package



Conversation with Yuchen Zhang

Semantic parsing

自然语言– logical form — -knowledge base

  • 问答,学术界关心,比较复杂的问题的问答,需要逻辑推理
    • Argmax …
    • Question — 数据源的操作
  • 做到什么程度,open problem 任何一点 progress 就可以直接 transfer 到 industry
  • Research community — cannot get the data — abstract the problem — Wikipedia question answering dataset


Personality — 分析情感的 paper

问题的简化,why? 一开始并不是因为分析情感有太大用处,例如 Amazon review — 直接看分就可以了。不需要看文字

Chatbot 分析也不是很准


Word extraction 分析比较差。

现实,并不是特别能  transfer ,形成了这么一个群体,sentiment analysis 其实已经不太关心 semantic analysis


data coding –

information extraction – paragraph — A & B relations

  • Open source — open IE
  • UWashington

summarization – open question still active research 

  • Research question
  • Long paragraph — short pararaphs
  • Neural models


NLP – 不能帮助理解。但是可以提供基本工具,股价预测,新闻,MicroSoft stock going up ? Review — 用户体验,输出

  • Opensource cannot do this
  • What it can do? Extract features
    • Sentiment analysis — emotions positive or negative
    • Relation
    • Feature vector 做预测


  • 关键词 – topic model 是什么,先做 training 一万篇新闻,unsupervised training 谈论同一件事情
  • Sentiment analysis — amazon 产品 review
  • 端到端的任务,summarization active research area
    • Baseline first line
    • Language independent — 中文语料
  • question answering
  • Reading comphehensive


  • 分问题:word embedding — 新闻,文学  corpus


Parcy Liang

相当一部分  professor in linguistics 语言学家,computational methods

Cho 拍到 datasets 上面来

Cho neural machine translation 出名

  • Postdoc
  • Deep learning class paper
  • Attention model
  • GRU
  • Auto- drive
  • Vision

Chris Manning

  • Deep learning + NLP

NLP 工业界比较热门

  • semantic parsing — search engine — rule based more conservative — Google in search
  • Question answering — dialog system — hot — AI assistant
  • Fundamental module — word embedding, parsing, post-text , dependency as features used in applications









交换时我度过一个中国年和半个美国年。两年前的除夕,学姐带我去吃饭,也叫上她的朋友。我们仨分别来自上海、南京和武汉,喜爱的食物分别是南翔小笼、鸭血粉丝和牛肉豆皮。而那天我们一起坐在 Asian Kitchen 的二楼,看窗外 State Street 上一排排路灯,小镇上最繁华的街道。火锅端上来滋滋冒着白气,汤里飘着几颗辣椒。我们迫不及待放土豆片进去煮,汤滚了就可以涮肉,用小勺把虾滑搓进锅里。席间我们谈论选课、套瓷和申请,某某老师是个怪胎。学姐犹豫读博还是工作,什么时候回国,暧昧的男生好像没有未来但是到底要不要在一起。饭后我们各自散去,我给家里打电话,国内是下午。爷爷问我想不想他,我说想。我又问爷爷是否想我,他笑了,说:“不是很想”。二姑有对活泼的龙凤胎,慢慢长大,终于剥夺了我做小孩的特权。





上大学后我很少把住处称作 “家”。第一年我住宿舍,第二、三年我住另一间宿舍。第四年我和朋友在外租房。第五年我第三次搬宿舍。今年我又租下另一间房。我常找不到合适的词来称呼这些住处。“出租屋” 太书面,“住的地方” 太啰嗦,“家” 又太亲昵。直到我看到一位博士学长在博客里把他在美国的住处称为“公寓”,觉得这词很能概括那不冷不热的疏离感。公寓和家的区别,用淡豹的话讲,是缺少了那个“认定”的动作。








今天中午见网友Hugo,伊指出我对学界的认知有偏颇,可能限于目前的眼界。实际上名利双收的老师也不少,学界厚积薄发而已。二是那些“开连锁店”的老板,今天有明天无,风险很大。三是我的气质不见得搞的赢他们,太 mild。她也提及自己的人生志向。












LOG – kdb database / algo trading

This is a log book of my study in kdb+ database which hopefully will evolve into some implementations of common trading strategies.

Sep 12 2016

  • Downloaded kdb+ 32 bit
  • Learnt how to evoke kdb+ / q language
  • Find some resources
    • The cookbook
    • Working with R

Oct 9 2016

  • New direction to explore: Machine learning for trading?
  • Key problems:
    • Access to data
    • Strategies
    • Execute order

Oct 10 2016

  • Where to download free data: QuantQuote, Google, Yahoo
  • Paid data: CSI: provide data for Yahoo
    • Compiled by CalTech Finance Group –
    • Questions on StockExchange, answered by a professional –




----上午先见 Wendy Griswold,导师的老朋友,来西北的介绍人。老奶奶头发上全是小卷,发际线略有后退,估计六十多了,但看上去非常有精神。谈话时一针见血,可见方法论非常扎实。先谈了会儿香港黄之峰同学的情况。在这边发现,香港经历是非常好的 small talk 话题,老外好像都挺好奇。

和 Wendy 大概讲了下我的数据情况,以及 pattern。Wendy 犀利地说,我觉得你现在的情况就是,有些猜测但是数据无法证明,那么你需要再回去田野继续问,直到有说服力为止。


Wendy 言,1)你应该去问你的导师,她是 ethnographer;2)我们不是搞科学的,不是什么都要 0.05 显著。你有个 claim,能拿出证据来,就行了。

对话就结束了,Wendy 赶着去开会。我看了看录音,17分钟,这老师效率真高。在 UW-Madison 交换时,和老师常常扯半天讲不到点子上,那时什么都不懂,却觉得自己什么都懂。年轻真是有趣。

----下午听某 seminar,听到一半就开始走神,查了查Wendy 推荐的某教授 (Angela Wu @ CUHK)。发现她的研究被“政见”转过,而且是大名鼎鼎的赵蒙旸(夕岸)同学翻译。于是开始翻政见,又搜到港大的朱江南老师,和“已故杜克大学教授史天健” 合写的文章。于是又搜史天健,发现原来是刘瑜老师的师兄。看了一圈学术八卦,正好讲座也结束了。

去见 Michael 教授。他指出,我的研究问题没理清楚。到底是想知道

  1. Why: 为什么有的大陆学生有不同看法,亦即政治认知怎么形成的,还是
  2. What: 大陆学生怎么看占中,以及他们怎么看民主政体,社会运动。

第一个问题偏解释性,可以与 Andreas Glaeser 的书对话。但是非常难。我现在有些猜测,如家庭、专业、香港朋友等等。但是如果直接问他们这些问题,得到的答案是欠缺解释力的。如果他们自己 draw a causal link without prompt,才更有说服力。另外,如果有大样本数据,就更好。

第二个问题偏描述性,但也不是不能做理论对话,关键看跟谁说。描述,这样一个(在某些维度上)极具同质性的群体,在政治理解的平面上如何 diverge。(还是可以以不同问题作为轴,画出一个分布图来)

例如,有人说 social movement useless, what do they mean by useless?

这其实是文化社会学的内容,看他们如何理解,make sense of their lives。文化社会学的一个 insight 即是,meaning is relational。那么,看看大陆生如何看香港社会运动,如何看民主,如何看我党,这一整套理解,也是很有意思的。

接下来怎么做,两位老师都提到继续访谈。Michale 提出,可以去找访谈过的人接着问 follow up, find the meat。那么,根据以上两条问题,又有两种问法。

Why — 问更多个人细节,deeper life history, without prompt。美国文献里有很多关于 class, gender, region的东西,但是要看他们说了什么 unexpected 的东西,例如 media? certain form of Chinese literature? Youth groups? — 这些都是 cultural programs.

What — 问更多个人的理解。how do they make sense of democratic movements? Perspectives on CCP? Perspectives on other democracies? Bigger picture of the whole political culture, and how their understanding of this certain movement locate in that picture. How that shaped their understanding of this particular movement. How they imagine democracy? “Meaning is relational.” 操作上,这也就是要他们 elaborate their points, where do you get that idea from?

接上,可以问的问题有,what they said and what they didn’t say?

文献方面,第一个问题主要还是知识社会学。除了 Political Epstemics,还可以与经典文献对话。如 Karl Mannheim。知识社会学里最根本的问题,是 the social determinants of thoughts, how do you explain social thoughts of certain groups? 我顺便吐槽了一下看 PE 这本书每看必睡着,他表示这书很厉害,一是从几个人那里就能问出很细致的东西,二是把这些访谈和其他历史材料结合的很好,写出非常令人信服的结论。他表示,我一个硕士论文,要做成那样子非常困难。

第二个问题,可以参考的文献有 Nina Eliasoph on “empathy”; civic imagination — how people perceive politics.

About China — sociological work on civic society. You can probably find a gap there – if that’s the case I would be interested to know, and this is an obvious contribution. For now, most of the work on civil society focus on activists who are in the movement, or state officials. If you find nobody has studied young people who observe movements but not in the movements, that makes your case interesting.

Also in cultural sociology and phenomenology, symbolic interactionism. Look for the texture of people’s understanding, the texture of meaning.

[My thoughts: on the plane spanned by different dimensions of political understanding, if — I incorporate the dimension of time, I can argue on the continuity and changes that is happening — c.f. work on crime, life trajectories, grow out of crime — why?]

Courses to take at NU: Celeste Hayes interview course; cultural workshop — mostly students good resources to engage with them.

----晚上和 Fang Jun 学长喝咖啡,聊了一会儿,大开眼界。

一是方学长的人生轨迹。北师大本硕(?)坚持在人人网上写日记,然后有个机会去美国威廉玛丽学院教书(孔子学院--威廉玛丽学院差老师--清华派不出老师--朋友推荐机会--教书)。教书期间,开始给纽约时报中文版和彭博商业周刊写稿。(“美国和国内都很看你的 credentials,你要是个北师大硕士,或者港大硕士,小硕士根本没人听你的。”)在清华期间,认识了 Wendy Griswold,聊了一会儿天。Wendy 邀请他来西北读博士。








第四,之前的理科训练,让我非常不习惯 open ended questions。这也是对“科学”的理解太狭隘。实际上民族志--包括定性研究--还真的就是艺术,更像小说。参考人类学。这不是科学,不需要那么严谨,也做不到。这样的思维模式需要调整。

总之就是思维不够开放,生活经历又很狭隘。其实这样不管做什么,都不会成大器,因为按着主流观点走,最终能培养出来的优秀,用更严苛大标准来看,肯定是平庸的。刻苦是必须的,但是还需要一点 insight。


放到学术上也是一样。不谈 Howard Becker和 Parsons的区别,那毕竟时代不同。但是 Andrew Abbott 也说,I am always eccentric。最乱的地方,才是最可能做出成果的地方。所谓乱世出英雄……

想高收益,就得高风险。芝加哥和伯克利 placement 这么差,你怎么知道你就能成功毕业呢?“不是所有人都能成为刘瑜。” 乱世出英雄,但也有无数英雄竞折腰啊。只是如果连去拼一拼的胆子也没有,那肯定就不是英雄了。不管怎么做梦,还是老老实实把论文写完吧先。







于是又想起刘思达老师的忠告。1. 不是所有人都能成为刘瑜,学术才是主业。2. 但是只要你坚持,十年之后必定会有一定影响力。好文章就是好文章,差文章就是差文章,读者和出版商也不完全是瞎子。

所以,还是老老实实练习吧。至于“多好才是好”,那就看老天给了多少饭吃了。连 Wall Street Playboys 这种反三观的网站,还说 if you are the top 10%, go to arts or theatre… 年轻人不要急功近利呀。

Weekly log: Feb 29 – Mar 6

It’s been a while since I’ve updated the weekly log thing. After Chinese New Year I haven’t prepared myself for doing research – that’s been almost two weeks. Teaching tasks have taken some time away from me this semester. My last presentation didn’t go well in Thomas Wong’s postgraduate seminar, and it took me a while to figure out what I have been doing.

Feb 29

I also include suggestions and thoughts from previous weeks in today’s log.

  1. Research is a social activity, so connect with your reader. That means, the merits of a research paper lie more in its ability to convince its readers. This will require the researcher to have clear writing style and follows the convention in research community. It is also the researcher’s responsibility to consider counter arguments to make her argument more convincing. Talking to more people about her research will also help, because then she will have a better sense of the questions people might ask, thus better deal with reviewer’s critiques. You will need to constant make good argument to build your ethos, and people will gradually have trust in you.
  2. I learned this lesson after an incompetent presentation of my thesis in postgraduate seminar. People were unhappy of my proposed use of “auto-ethnography” and the two comparable cases I gave them. In fact I was aware of the methodological issues, possibly even better than some of them. My data was much richer than I presented to them. But I failed to show them I know this. Thus I felt they were not taking my argument seriously, because I didn’t take the presentation seriously myself. Research is not only about reporting what you have found, but also about making people accept your report. In fact, the latter is perhaps more important at my current stage of career. In the past few months I’ve been too happy with my imagination, but my imagination is nothing because people won’t take it seriously. This is why Dr. Tian said the way some researcher in mainland China does their research will impede their later career. “You should write those things only after you are established.” Dr. Wang Liping had a similar point when I said I wanted to work on theory as The Social Construction of Reality
  3. Write as you go along, everyday. For the following reasons – (These tips are from The Craft of Research)
    • Write to sort your data, your reading so you won’t feel everything is in a hopeless muddle.
    • Write to encourage your best critical thinking. Writing is thinking by itself.
    • Write so your life will be easier when you are drafting later.
    • Write to understand your source better.
  4. Sort your data along your argumentThis is connected to my problem in writing literature reviews. I tended to spend too much time reading every single word in a reading. But this is unnecessary, as I only need to take the key argument in a paper and sort it along my argument. This is where Dr. Tian will suggest “project-based reading”.
  5. Do not critique a source until you can summarise it. Self-evident.
  6. Significant problems are those that changed our way of looking at things. Great researchers are those who can propose great problems.