Use binary representation of int to represent subsets of n elements

For the past two months I spent many hours solving algorithmic problems. It’s fun, but very time-consuming. I think the majority of time is spent on trying to find accurate solutions and systematic ways for problem solving. I also spent a few weeks (!) grasping data structure fundamentals.

Anyway, this post just describes a technique I came into when reading this book , page 73, on using integers to find all possible subsets from a list of length n. Example Leetcode problem: https://leetcode.com/problems/subsets/description/

The key insights are:

  • Every int is binary represented in a machine’s memory as a sequence of 0’s and 1’s
  • The range of ints from 0 to 2**n-1, have included all possible combinations of 0’s and 1’s for a list of length n.
  • Think about every position’s value (0 or 1) as an indicator whether element at that position appears or not.
  • So for every int from 0 to 2 **n -1, we only need to extract all the positions that are 1, to get the combination of list elements this int represents
  • With bitwise operators & and shift operator <<, we can easily get whether the i-th element of an int is 1 (turned on) or 0 (turned off). Try X & (1 << i).
  • Looping through all positions from 0 to (n-1) we can get all valid bits for this int. We append the corresponding elements to the answer list corresponding to this int.
  • Finally, we collect every answer list to a big list and return.

Very smart and interesting strategy. Code snippet here: https://codepad.co/snippet/Tm6Z5ZZP

This strategy can used for other exhaustive search problems that requires looping through all possible combinations of n elements, with exponential time complexity ( 2**n). But the very fast bit manipulation makes it evades the time limit.

又一条文献综述吐槽贴

硕士论文还需要最后修改,陆续在看文献。从上周五开始,先后花了两个白天,两个傍晚,两个清晨,才大概看懂 Philip Converse 1964 年的名著 The Nature of Belief Systems in Mass Publics。引用了许多次,今天才真正搞懂。以为是他写得差,拿去问美国人,道是这文章写得太优美,导致理解困难。

头几次看的时候,没有意识到自己不懂,但是行为表现就是水,上网,玩手机,聊天。总之,身体很诚实地在逃避。大概星期六晚上意识到可能是因为这篇文章很难。星期天开始拆分小节,老老实实地各个击破。昨天推进了一点。今天终于搞明白大概在干嘛了。

读通了这篇开山之作,再看后来的论文,马上感觉串起来,知道前因后果了。文科的东西有文科的难法。2015/2016年就该完成的工作,拖到现在。根本原因还是英文阅读能力不够+耐心不足吧?论文写不出来,再怎么找借口,根本原因是积累不够。积累不够的原因是卡在某个难点上就绕路走,兜兜转转半天,哪个学科都无法精进。

为什么这篇文章难呢?首先是这人英文写得百转千回,一个句子要带三个逗号两个从句,因此语意也重重转折,难以分辨主要意思。这当然也说明作者思维缜密。其次是这篇文章结构不是很清晰。当代的社科论文作者已经学乖了,照顾读者耐心,第一句话就提出研究问题。第二三四句话摆出 alternative hypothesis,第五句话说结论,第六句话吹牛。但这篇文章比较像 essay,看不出这么简单粗暴的结构。第三难点是我对政治知识基本抓瞎,只知道个 liberal / conservative,放在作者的分类里就是社会底端的无知大众。第四难点是词汇量。

当然,积极来想,读完一篇难的文章,思维上的收获应当也不小。这两年没怎么看书,说话都变得越来越没文化,还以为自己了不得。

这篇博文还想说两件事。一个是文科训练大概的确会让人变复杂而纠结。因为研究社会事实,工作就是把本来可以简单化的事情搬出来反复咀嚼分析。人的心理啦,对社会走向和未来预期啦。这大半年我的生活里几乎没有什么社会性事件,可想而知情商增长又停滞了。

第二是这篇论文最后提及,政治精英参与 the history of ideas (思想史?)的制定,而他们的决策反过来影响大众。大学学习马克思的时候,好像也看到过类似的思想 (class consciousness?) ,好像也因此树立了一些理想。

知识的制造是政治,是权力。过去两年的挣扎痛苦,可能也是对于不熟悉的意识形态要掌控我的生活的反抗。不愿离开社会学,也曾给自己洗脑说要超越自己的阶级局限。父辈搬砖,我拒绝搬砖!但我现在快乐许多,选择了认知上更为轻松的生活方式。搬砖有搬砖的幸福之处,人际斗争伤神且不制造价值,人的智力应当放在更有审美或实用价值的地方。

津门大饭店今天煎饼果子师傅回家了

(又是一个 draft,放到博客上有点 git add 的意思,比较公开感觉是个 milestone,又不会发到微信太丢脸。writer’s block 可能需要不要脸才能解决,有空会回来扩写这篇文脉不畅的东西……)

津门大饭店,坐落在海边。穿绿衣服的学姐问我,“啥饭店?是不是在那个日本店旁边?” 说着说着面露难色。来港四年,她竟不知津门大饭店,我的心一阵柔软。

我说是,山道麦当劳出去直走;昂首经过右手边百佳宝蓝色招牌;扭头微笑注视街对面日本城人来人往门口一pile大笨象;潇洒转身信踩莲步跨入茶色大门太平洋广场;蹬蹬蹬!三步并作两步直上电梯!刷刷刷!趁热打铁再上一层!

只见眼前豁然开阔,红木白纱屏风,淡墨工笔书画,端的是一片新去处!恰似美猴王发现水帘洞,哥伦布发现新大陆。进门一块朱红牌匾:驰名百年,荆门鱼头泡饼。

这时眼波谄媚的服务员早已迎了上来,客官吃啥?噢不对,小姐几位?一边领着你往里走,伊一边问:吃火锅还是小菜?绕过屏风是十几张桌子错落有致摆在铺红地毯的大厅里,属于猴哥和我的方桌在窗边静静等待!

服务员阿姨涂了淡粉唇膏的嘴唇妩媚一挤,脖子向前一伸,表示欢迎:喝什么茶?

有什么茶?

普洱茉莉菊花香片。

那就香片吧。

说话间另一位阿姨早已殷勤递上菜单两份,酱黄瓜一碟,炒花生一碟。两只白瓷壶,一壶滚水,一壶热茶,滚水浸洗餐具,热茶荡涤灵魂。猴哥和我洗得不亦乐乎。

我有位大学同学是天津人,彼此不熟,聊不到一起去。我觉得她高冷,她觉得我高深。吃饭到十五分钟,她开始找手机,我开始找话题。这么尬了两次,两人很有默契地再也没约过。偶尔去餐厅碰到和其他人约饭,还挺尴尬。虽然无缘,但我信她,因为她身高一米七八。她说津门大饭店不错,那就是不错。我那时第一次听到津门大饭店。老实说,这个饭店位置真是太差了,在街边二楼,窗户上挂个大红招牌,灰溜溜,一般人根本不想进去。为了这家饭店不垮我后来操碎了心。

那时起我就惦记上了津门大饭店,公子哥儿惦记清纯美女那种,把前任麦当劳留在身后,把新欢大饭店藏在心里。

津门大饭店有三宝:烤鸭,包子,煎饼果子。倒不是说多么好吃,主要是香港这个鸟不拉屎又嫌贫爱富的地方,别处要么吃不到,要么吃不起。所谓山中无老虎,猴子称大王。为现实所妥协的味觉,煎饼果子已经是要啥自行车了!

烤鸭学名“驰名津门烤填鸭”,有什么好的呢?服务员会问你要不要一鸭两吃,也就是一盆烤肉之外,再送一盆鸭架汤,熬得雪白,上面飘点油脂。冬瓜么青菜么是没有的。鸭肉片得有薄有厚,师傅可能不是故意的。白色大盘分成花瓣形,中间一团甜面酱,周围一摞切好的黄瓜,一摞红萝卜,一摞大葱,像柴火版堆着。肉斜放着堆。油脂色鸭,鸭皮烤得油亮发光。油脂烤得透明。也没好到全聚德那种入口即化,略有点黄脂,有点粘稠。但总的来说,用香港人喜欢的话,不过不失。

关于煎饼果子和包子又有个故事。去年年底一切都惆怅萧索,猴哥却想吃煎饼果子。一大早就跑去津门大饭店。服务员殷勤地招待她坐下,问,吃包子吗?猴哥问有煎饼果子吗?对方说师傅九点才来。猴哥等到九点。九点到了,早就饿的前胸贴后背。猴哥问煎饼果子有了吗?结果还是煎饼果子没有。猴哥活活饿到十点。最后知道,我的妈,煎饼果子师傅回家了!

这种饭店还开什么开?开什么开?

后来终于有次和朋友一同尝了煎饼果子,哎呀我的妈那是吃得我们男默女泪。那次是两位学妹,一位要去美国读书,一位要去北京做调查。约在津门大饭店吃告别饭,叫了煎饼果子给美国学妹尝鲜。千呼万唤始出来后,原来是黄色蛋皮裹白皮饼子,里面塞根炸得香酥的油条。鸡蛋马虎地打在烤饼上,蛋白蛋黄分得不均,咬在嘴里味道也不太一样。盘里撒几条葱花,配上豆酱,饼子吃进嘴里的味道也丰富了些。

津门大饭店还有啥?豆腐粉丝汤。鲜味一般。端上来,一只大碗,总是缺个口。碗里漂浮几片青菜,豆腐沉在碗底,青菜煮的烂熟,白菜根煮成透明色。汤碗一角堆些干虾仁,嚼起来一股鲜味儿。炸肉段儿。以前看某东北写手写她高中食堂,师傅把肉段儿炸得金黄,漏勺捞起来一把,颠一颠把油滤了,肉段颤颤地堆在勺里,用嘴嘬着吃。赶快趁热夹一段吃,松软,热乎,面包屑脆,猪肉又香又甜。这里的炸肉段儿只怕没有那么香酥,但是也足以满足南方人对东北大块吃肉的想象。

津门大饭店名字是蛮 ambitious 的。香港这种地方,客厅跟我家厕所一样大,占了两层楼,就算是大饭店了。内地大饭店怎么着也得有个五六层楼加一队红衣迎宾小女孩吧,这里啥也没有,坐地起价就能自立门户为坐地成佛就能成大饭店了!

饭后我和猴哥站在海边,猴哥说,这都是香港的原因。你看你在香港,只要一想想未来,想想房价,就恨不得马上去赚钱。我想其实没钱也没关系,我们可以携手去吃津门大饭店,然后晚上在山道麦当劳里扑街。市民的幸福其实相当容易,呼朋唤友胡吃海喝,好像也能高高兴兴地活个十年八年。

又一个迷茫的北美理科生小王准备成为码农

(语句不通,其实是 draft 稿,但是逼迫自己今天的 deadline 跪着也要搞定。故事大半虚构小半真实)

每个年代都有理想破灭的故事。这真叫人悲伤。昨天青年从帐篷里撤离,今天青年收起雨伞回到学校。东边青年说,全世界无产阶级团结起来;西边青年说 we are the 99 percent。文艺青年不再更新,事业青年融不到钱,愤怒青年开始修心养性,左派青年说,丘吉尔说二十五岁之后的自由主义者都没脑子。

说不好。有时是理想出了问题,有时是现实出了问题。可能他们都没问题,但是面对构建理想之破灭后的惨淡生活,需要把不如意归结为两条平行认知交流不畅生出的误会种种。

Anyway,我们这个年代身处北美的理工科青年,理想破灭这件事情的发生,通常伴随着如下行为:
– 对亲友和网友疯狂吐槽原专业并深入讨论转 CS 可能性
– 购买各类编程书籍,例如 Learn Python the Hard Way, Head First Java, Problem Solving With C++
– 订阅 LeetCode VIP 账户
– 注册一亩三分地账号,每天视奸老中和小中 offer 并刷面经

我认识这样的一个青年小王。我俩是大学同学,这人蛮逗的。我俩以前常一起自习,吃饭。当时也没觉得这人有啥不一样。结果这人毕业申请学校的时候,说坚决不当码农。“蝇营狗苟的生活,不是我想要的!” 豪气冲天,壮志凌云,一溜烟跑去读了个艺术史硕士,觉得是其人生超越性的光明开端。

要说搞文艺,那也有几种。微博知乎力争上游当大V,微信勤恳更新疯狂洗稿,都还是有发达可能的。然而小王似乎并没有这个头脑和野心。他转去艺术史后不久我就没有了他的消息。听说谈恋爱了。又听说分手了。又听说状态不好,本来就神神叨叨的人变得愈加 drama 了。临到末了,我们都快毕业了,也没听到他交论文的消息。

突然一个周六下午,我接到小王微信,说是要转行,找我探个路。我说好,约了周末面谈。小王灰溜溜地说,我得转行,不知道转什么。我劝他转码农吧,他又说没想好。我问他你想干什么,他说不知道。我问那你为什么要转呢?小王说原专业没有前途,我想挣钱。

老实说,小王这类人我见过太多了。什么转行的化学博士啊,什么比较文学转码农的呀。我还见过一个物理的,看这人话蛮多,喜欢吹牛,蛮自恋的。当时也是怀揣美国梦一心搞学术的,现在处在边缘进退两难。你要说这些人怂吧,也不完全是怂。说他们贪吧,也不完全是贪。那是为什么呢?可能问题是成熟太晚,可能问题是没有踩到时代的浪潮却有颗躁动的心吧。

我接着听小王说。“没办法,这里房子实在是太贵了。我们学校旁边有几片豪宅,最贵的那片叫宝翠园。门口站着穿蓝色制服的保安,拿橘黄色警棍,走下楼梯,人就拿警棍把你往外赶。偶尔进小区的红色的士不知道载着谁,你就得给站在路边,给人家行注目礼,因为这是人家私人路。”

小王有次心情不好,喝多了跟我说,妈的,不就是挣钱吗,老子真想搞,也不会比你丫差,看给你牛的。我想他的生活可能出现一些危机。我想也许哪个富二代给了他刺激。也许他自己没意识到他说这话时,小气,势利,酸,俗,毫不文艺。

布迪厄讲,审美是有阶级性的。有些人的钱来得较为容易,但小王没有那种洒脱。他得在红尘里摸爬滚打二十年,才能有那种养尊处优、举重若轻的风度。我想小王终究没跳出他那个格局,穷过的人,关键时刻,还是一股精致的利己主义。

他后来酒醒了,估计是忘记了,也可能是羞耻,再没提过这茬话。

我也没跟他提这些。要我看,小王这种人,最适合幻想,以及跟幻想搭边儿的行业。刚毕业就创业的。去西藏的。离婚的。生孩子的。总的来说,这类人太贪心。对新鲜、刺激的需求太甚,没有耐心忍受日常的平庸。他们不知道日子也是一天天过出来的。小王呀小王,你今年二十五岁,总得学会两只脚站在地上,踩踏实了。

小王看来是已经学到了这堂人生课。他说,没办法,我妈老了。本科的时候,她一个人来香港,不带电话,从罗湖找到我们学校,又在我们学校找到我。这次她来香港,天下着雨,黑乎乎的,结果她穿凉鞋,一脚踩进臭水沟里。我妈是护士,最怕脏水,回去还得洗呢。

他说我意识到我想当个正经人,就没法儿这么胡乱搞,你明白吗?我在那个状态里,整天神神叨叨的,我待不下去。我不能不搬砖,搬砖使我心情宁静觉得被需要,我是条贱命呐。

他也有点犹豫是否要转码农。“机械而被异化的生活,有什么意义呢?”

我无语。小王啊小王,长安米贵,居大不易呀。你还说什么灵魂!还谈什么意义?你说你要写文章,你写了吗?你说你要为社会做贡献,你做了吗?你瞅瞅你满嘴跑的噼里啪啦绿皮大火车哟。

这天下午,我终于看着小王自作主张地,苍凉地和自己的野心道别。他几乎落下两行泪,觉得人生不过如此,他曾幻想自己是XX,是XX,是XX,现在这些幻想都没有了。他亲手杀死了无限可能。那些腥风血雨,那些运筹帷幄,从今天起都跟他小王没关系了。

其实本来也没什么关系,但是他今天必须打起精神面对事实。从今天起,做一个老实巴交的码农。给每一个科技公司起一个温暖的名字,谷歌是狗家,Facebook 是F家,亚马逊是玄学大厂。他坐在敞亮的教室里,这样决定了。

我劝他注册个 LeetCode 的 VIP,认真上好 Java 课程,打开网站,勤勤恳恳做题,就跟高中一样。他会跟周围人一起讨论,哇 XXX 去了谷歌,好厉害啊!XX 去了微软,又是多么牛逼。哇 XXXX三个月从文科转到码农,真是…

我相信小王能搞定。但等他搞定,成为 professional 了,只怕也不再文艺,不再是我曾经认识的那个可爱的小王了。

饭吃完了。我也没啥好说的。说了,小王会觉得我刻薄。谁年轻时还没个理想呢。崔健的歌词,以前听政治学老师唱的,送给小王。

瞧你丫那德行怎么变成了这样儿了
前几年你穷的时候你还挺有理想的
如今刚过了几天儿你刚挣了几个钱儿
我看你比世界变得快多了要么是漏馅儿了
你挺会开玩笑的你挺会招人喜欢
你过去的理想如今已变成工具了
你说这就是生活你说这才有味道
你干脆说你愿意在失落中保持微笑

Doctor AI: Using deep learning to automatically diagnoses diseases

(This post is based on a classmate’s and mine project in our first semester at NYU. I’ve been trying to write it into a blog post for three months and today finally decided to remove the task out of my to-do list…)

Everyone have heard about AlphoGo, the hallmark of current artificial intelligence (AI) research that can beat human Go champions.

But how about using AI to automatically diagnoses diseases? Or, at least, automatically narrows down the range of possible diseases and recommend suitable medical consultants? Imagine how much burden this will take off from the always over-worked doctors. Imagine how much benefits this will do to our health care system!

In this blog, we present a machine learning project that automate the identification of relevant diseases, by using “intake clinical notes” to predict the final diagnosis code. From this code, we can then identity which consultants are relevant.

#### Data Source
As with any other current machine learning systems (or really, even with human learning), an algorithm needs to see enough data before it can learn to make smart decisions.

So for our algorithm, we feed it with a public available medical dataset called MIMIC (‘Medical Information Mart for Intensive Care’). This dataset is composed by a group of MIT researchers, and comprises over 40,000 hospital admission notes for more than 30,000 adults and more than 7,000 neonates who are admitted to critical care unites at one hospital.

(It’s freely available at : https://mimic.physionet.org/ in the form of a SQL database if you would like to take a look!)

The complete dataset has rich information, including vital signs, medications, laboratory measurements, observations and notes taken by care providers. However, for our purpose, we only used two variables in the dataset:

– medical discharge summaries for each patients. A discharge summary is a letter written by the
doctor caring for a patient in hospital. It contains important information about the patients
hospital visit, including why they came into hospital, the results of any tests they had, the treatment they received, any changes to their medication, etc.

– the ICD-9 diagnoses for patients. “ICD-9” stands for “International Classification of Diseases, Ninth Revision”. Basically, it is a three digit code for a disease. For example, 428 stands for “heart failure”.

#### Data Transformation / Feature Engineering
One interesting thing about this project is the model needs to learn from text. To tackle this challenge, we used techniques from natural language processing (NLP) to transform text into machine readable inputs – numbers. In the jargons of machine learning, this is also called “feature engineering”.

How to transform text into numbers? We first tried the following methods for data transformation.

* tokenize the input text. This means be break down sentences into sequences of words. For example, ‘She loves dog’ becomes a list of words, (‘She’, ‘loves’, ‘dog’). We used a python library NLTK (“Natural Language Toolkit”) to achieve this.

* turned all text into lower-cases, because we would like to prevent ‘She’ and ‘she’ being recognized as two different words.

* converted all numbers to the generic character \*, so that ‘2017-03-13’ becomes ‘\*\*\*\*-\*\*-\*\*’. This is because evert discharge summary contains at least one date, but numbers are not so useful in our prediction so we just mask the information.

After these steps, we will have every discharge summery turned into a list of words. The total corpus will contain a list of list, with every list being a discharge summary. Not bad!

Our next step would be to turn the lists of words into lists of numbers. For this, the method is called one-hot encoding, in that we construct a dictionary consisting of most common words in the **corpus**, and then replace every word in by its position in the dictionary. Easy!

For example, for this dataset we did the following:

* Construct a vocabulary consisting of words that appeared at least 5 times in the dataset. We basically discard the rare words. After this step, we have a vocabulary of 38,119 unique words.
* Use these 38,119 unique words to index each word in individual list of words. For example, if “the” appeared in the first position, it’s going to be replaced by the number ‘1’.
* After creating the features, we transform each piece of text summary into numbers, by assigning each word (token) with its count in the document.

For a simple example, consider the text: ”harry had had enough”. A Counter object on the tokenized text would yield (’had’: 2, ’harry’: 1, ’enough’: 1). So, this sentence will be transformed into a vector (2, 1, 1).

We also did other preprocess techniques to handle challenges in this dataset, including rate words, out-of-vocabulary words, but these techniques are omitted in this post for simplicity.

#### Model
We used a “deep learning” framework to handle this task. If that sounds pretentious, you are probably right! In fact, “deep learning” is just a fancy name of saying “neural networks with many layers”, while “neural network” is just a fancy way of saying “nested functions”. So nothing mysterious here.

We used a Hierarchical-Attention GRU (HA-GRU) model for this task (Yang, 2016; Baumel 2017). This model’s main characteristics are that 1) it uses a hierarchical model structure to encode at sentence level and document level separately, and 2) it has two levels of attention mechanisms applied at the word and sentence-level. All data manipulation and model training was done in Python 3.6.0, Pytorch 0.2.0.

To understand this model we needs to understand two components: GRU model and attention mechanism. There are some complicated mathematical constructs, but the idea of this post is to describe the main ideas without digging into too much detail, and point to other helpful sources if necessary.

Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) models, which is a type of neural networks. Let’s start from the simplest construct: neural networks.

A neural network is a fancy way of saying “function approximation”. For example, suppose we have two variables, $X$ and $Y$. $X$ is a vector of length 3, $Y$ is a vector of length 2.

We would like to discover the relationship between $Y$ and $X$ in the form of $Y = f(X)$.

We might first do a linear transformation on $X$ and get $g(X)$, then throw the $g(X)$ into another non-linear function $z(\cdot)$, and get the final result $z(g(X))$.

So in the end, we will have $Y = f(X) = z(g(X))$.

A neural network model is essentially a graphical way of presenting this mathematical transformation. If we present the previous mathematical formulation into a neural network, it will look like the following:

![image](neural_network.png)

Here is the coorespondence:
– the first three nodes are our $X$, since dimension is three we have three nodes.
– the second two nodes are our transformed hidden vector$h$. The dimension of hidden vector is two, so we have two nodes.
– the final two nodes are our final predicted vector $Y$. The dimension is two, so we also have two nodes.

With the above understanding of simple neural networks, a recurrent neural networks adds autoregression into the formula by passing the output of every simple neural network to the input of the next neural network. The idea is somewhat similar to an autoregressive time series model. Such architecture helps dealing with long term dependencies.

To understand more about neural networks, the following blog posts are excellent resources:
– http://karpathy.github.io/2015/05/21/rnn-effectiveness/
– http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Attention mechanism, on the other hand, is based on recurrent neural networks. The mechanism is multiplying outputs at multiple time stamp by a weight matrix that might or might not be influenced by the output. There are many implementations of attention mechanism, while the following posts also serve as great introduction to this topic:
– https://distill.pub/2016/augmented-rnns/

After understanding recurrent neural networks and attention, the specific architecture of our model – HA-GRU is easy to follow!

Basically, this model is a stack of two recurrent neural networks.
At the first level we encode every sentence in a document into a vector. This is achieved by first embedding words into word vectors, then applying a bi-directional GRU (word GRU) with neural attention mechanism on the embedded words. The output of this step is sequences of sentence vectors.

At the second level, we repeat the process of a recurrent neural network plus attention mechanism, but this time at sentence level.

### Results
We train our HA-GRU model with different parameter configurations of embedding layer, word bi-GRU layer and sentence bi-GRU layer. We use mini-batch gradient descent for numeric optimization.
After first running the model on our full dataset
(over 29,000 summaries and 936 labels), we noticed that, because of the extreme sparsity of the label set, the results were negligible. So we decided to test a smaller label set and finally achieved an accuracy ranging from 0.01-0.04. In other words, our model does not seem to make good predictions.

### Discussion
While our results were not great, the project was still a good exercise for thinking about, and testing how artificial intelligence models can help with medical practices. It also shows that obviously, our Dr. AI has a large room of improvement!

毕业三年

很快我就本科毕业三年了。这三年里生活最大的变化,是接触社科学术圈,又离开社科学术圈。与之相应,离开了典型中国理工科学生圈子,又回到典型中国理工科学生圈子。上周两位青年公知在纽约办讲座(http://nyshalong.com/event/138),题目和我之前做的东西很相关。两年前我会去,这次我不打算去了。也算是显著可见的日常生活的改变的一个例子吧。

这篇文章讲从理转文又回到理工科的经历和感触,围绕三个问题展开。

  1. 为什么念社会学?

我对文科一直有兴趣,从小喜欢看书,语文和外语成绩很好。但是因为家庭环境和教育经历,对文科专业抱有偏见。同时,理科成绩还可以,因此高考前没有考虑文科专业。高考选志愿报的都是工科。

数学成绩一般。委培第二学期选了很难的数学课程,受到打击,觉得自己不适合念理工科。于是调查了一下,打算往文科转。当时问了中文系、哲学系、社会学、人类学的老师,最后自己权衡后,决定念社会学。

本科期间接着修社会学课程,暑假做社会学 RA。期间常抱怨,但也继续往这个方向探索。读了些论文,以及 Andrew Abbott 在芝大的讲话 Welcome to the University of Chicago : http://www.inf.fu-berlin.de/lehre/pmo/eng/Abbott-Education.pdf ,以及 Andreas Glaeser  的讲话:So, how about becoming a poet? https://aims.uchicago.edu/page/2005-andreas-glaeser 很受那种智识的美感吸引。

同时年轻的时候也比较有野心,对了解社会运作有兴趣,想要探索社会规律。

等到本科毕业申请的时候,一则想去名校,社会学竞争相对理工科没有那么激烈。

二则对文科专业的确有好奇,常觉得生活里有困惑没有解除。大二时和后来的社会学导师交流,说不清那种感觉,但感觉她走过的路里可能有答案。

三则 for some unknown reasons,当时很难想象去美国当码农/数据分析员;现在想起来,可能是对可预见的(平凡)生活的恐惧。也就是喜欢新鲜刺激未知的东西,也可能是喜欢抽象化浪漫的东西。小时候喜欢偷偷去游戏机室,初中时幻想和小混混一起打群架混社会,又很喜欢《麦田里的守望者》。高中被管得严没有做太出格的事情。其实大一大二大三对未来也只有遥远的浪漫的规划。年轻的时候,并不是脚踏实地认真耕耘的行事风格。

四则毕业时碰上香港某政治事件,情绪激动。年轻时我一直有点莫名其妙的愤青。社会学对阶级、性别等方面的不平等现象有天然的关切,也对公义、价值观有讨论,吸引年轻人。美国很多社会学者都是1960年代因为 progressive movement 转来念,例如 Erik Olin Wright,Michael Burawoy。华人的例子有赵鼎新,潘毅,以及刘思达。这点观察,伊万塞勒尼在《社会学的三重危机》里也提到过:https://contexts.org/blog/the-triple-crisis-of-sociology/

某名言道(据传是John Adams第一个提起,链接 http://freakonomics.com/2011/08/25/john-adams-said-it-first/),If a person is not a liberal when he is twenty, he has no heart; if he is not a conservative when he is forty, he has no head。目睹自己这两年从 liberal 变成 conservative,我觉得这句话讲得真是太对了!

五则正好有奖学金读书的机会。机缘巧合,也算幸运,选择留在香港念社会学硕士。现在想想,是坑了导师了。

 

2. 为什么不再念社会学?

这个问题之前啰里八嗦过太多次了。其实放眼望去,选择念社科博士的人才是少数。所以不念社科博士其实没啥好奇怪的。你不是也没念吗?

第一,最重要的是我发现对社科研究没有真正的兴趣。第一次见导师,聊选题,我问,什么题目可以比较快地发论文?这并不是合格的科研工作者该问的问题。我没有发明新的社会变革理论/国家-社会框架理论的野心,也没有给 symbolic interaction 理论添砖加瓦的愿望。有些社会学/心理学研究很有意思,比如六度分隔理论,Bourdieu 的区隔理论。但欣赏和创作是两件事情。

第二,我发现之前的向往和困惑,大部分来自于我比较理想化,对精神性的东西有向往。喜欢看的书大部分也都是文艺书籍,并不是社科学术。也和年轻时情绪比较敏感有关,大五人格就是 neuroticism 比较高。这些追求在社会学学术道路里只能 marginally 满足,更为合适的职业目标可能是作家。

第三,同时,我并不喜欢和人辩论,而社科研究、乃至各种科学研究,最终目的都是要说服他人。尤其是政治学,见过的对政治学感兴趣的人要么比较 ambitious 要么老喜欢争论,令我怀疑对政治感兴趣,是否也和对权力感兴趣有关。

第四,社会学的生活方式让我苦恼。我所在的硕士项目是英式风格,注重思维独立性,对研究生基本放养。而我向往被精细控制的生活,高中早上六点到校,晚上九点放学的生活方式让我安心。可以逃避面对自我的责任感。太多时间放在自己手里,让我内疚、难受。同时,职业生活和研究生的确生活节奏不一样。我还是比较喜欢稍强竞争的环境。

第五,社会学的圈子让我不适应。我和周围同学老师的教育背景和思维方式还是不太一样。和定性社会学家/民族志学者几乎没法儿聊,和定量社会学者,人如果受过比较 hardcore 的数理训练,可能都会比较难接受没有那么严谨整洁的论证方式。针对理工背景的申请者转入社科读博这一点,经济学博主 Noahpinion 提供了比较有意思的讨论:http://noahpinionblog.blogspot.com/2014/03/coming-into-econ-from-physics-and-other.htm

第六,学习社会学的过程并没有想象中顺利。基础差,走了不少弯路,也觉得挫败。这个会在下一篇博文里详谈。

第七,从对不公义现象的关切来看,我和社会学圈子的价值观并不完全相符。我自己的价值观也在变化。George Lakoff 曾经在 Moral Politics 里提出理论,说自由派投票者的思维是 nurturant parent model,简单来说就是认为弱势群体的不利处境大部分来自不公平的社会结构,政府有责任保护他们。而保守派是strict father model,认为不幸者的处境大部分来自自身责任,强调个人的自助、自立、自律。我后来发现虽然逻辑上认为自己应该是前者,其实我心底里更认同后者。因此我和真正的“左派”社会学研究者以及活动家并聊不到一起去。

第八,刨除以上一切分析,纯功利角度来看,社会学不提供很好的物质回报。

3. 学到了什么教训?

  • 要做自己真正感兴趣的事情。很遗憾,本科前接触到的教育非常功利,导致没有意识到兴趣导向的重要。高考压力也非常大,导致真的没有时间和精力想这个问题。现在看来,一件事情只有真正感兴趣,才会敢于冒风险,能够坚持。而我目前浅薄的人生经验观察,同辈里优秀的人,都有目标坚定、敢于取舍、踏实、专注和坚持的特点。这些可能都要以兴趣为前提。
  • Exploration vs exploitation 是很重要的决策取舍,Reinforcement learning 里面的最核心的问题。
    • 如果花了太多时间 explore,无法得到最优解。
    • 而如果太早就 exploit,同样无法得到最优解。这个真是人间真理呀!
    • 我大一草草定下了想法转社会学,大学乃至硕士期间也没有及时止损,最后代价惨重。这就是太早 exploit.
    • 而我16年夏决定想转行的时候,又花了太多时间 explore,如果当初坚持一个方向,现在只怕也能省一年时间。
  • 名校崇拜。我当初想去名校,后来虽然的确得到了去美国名校接触顶尖学者的机会,但是名气毕竟是虚的。在西北的三个月,我并不开心,心猿意马,抱怨连天。连累身边人的同时,也糟蹋了老师利用自己 network 争取来的机会。
    • 为什么会有这种名校崇拜?举个例子。高中有位同学分数不够北大正常批次录取,老师百般怂恿他去北大医学院,但他最后选择去复旦金融。我当时不能理解。现在意识到,把清北当成唯一指标的思维模式,最终付出了惨痛代价。
  • 再继续往下说,就会扯到教育资源分配,乃至阶级价值观了。
    • 我为什么去上那个高中?
    • 如果去了别的高中,我还能考上较好的大学吗?
    • 上大学后,我发现很多同学家境比我好。而我和自己的表亲比起来,真是太幸运了。这也是当初选择读社会学,乃至思维偏左的一个原因。法律社会学里有“两个半球理论”,出身声望较低家庭的律师,很多会选择做和公益有关的诉讼,并且最终服务个人客户和小客户,思维偏左。出身特权化的家庭的律师,往往服务富裕、强有力的公司客户。http://www.pkulaw.cn/fulltext_form.aspx?Db=qikan&Gid=1510166689&keyword=&EncodingName=&Search_Mode=

4. 如果你也是社会学学生,在考虑是否要申请美国的社会学博士。本人对劝退文学颇有钻研,收集了以下较有代表性的文章:

  • Graduate School in the Humanities: Just don’t go. https://www.chronicle.com/article/Graduate-School-in-the/44846 作者是名校英文博士毕业,文笔流畅优美,观点一针见血,强烈推荐。比中文互联网上 99% 的讨论要强。这人还写了几篇后续。
  • 针对社会学,orgtheory 的作者(芝大博士,做社会运动,现在似乎是在马里兰教书)有写过一本书 grad school rulz! 全面概括社会学从申请到AP的各种常见问题。关于是否要读博士,他的建议也是否。
  • 100 Reasons NOT to Go to Graduate School 针对社科,有些重复的,但是总的来说很全面   http://100rsns.blogspot.com

 

 

 

Two approaches for logistic regression

Finally, a post in this blog that actually gets a little bit technical …

This post discusses two approaches for understanding of logistic regression: Empirical risk minimizer vs probabilistic approaches.

Empirical Risk Minimizer

Empirical risk minimizing frames a problem in terms of the following components:

  • Input space X \in R^d. Corresponds to observations, features, predictors, etc
  • outcome space Y \in \Omega. Corresponds to target variables.
  • Action space A_R  = R Also called decision function, predictor, hypothesis.
    • A sub element in action space could be hypothesis space: all linear transformations
  • Loss function: l(\hat{y}, y) a loss defined on the predicted values and observed values.

The goal of the whole problem is to select a function mapping $F$ in action space that minimizes the total loss on sample. This is achieved by selecting the value of the parameters in $f$ such that it minimizes the empirical loss in the training set. We also do hyperparameter tuning, which is done on the validation set in order to prevent overfitting.

In a binary classification task:

  • Input space X \in R^d.
  • outcome space Y \in {0, 1}. Binary target values.
  • Action space A_R  = R The hypothesis space: all linear score functions
    • F_{score} = {x \rightarrow x^Tw | w \in R^d}
  • Loss function: l(\hat{y}, y) = l_{logistic}(m) = \text{log} (1 + e ^{-m})
    • This is a kind of margin based loss, thus the m here.
    • Margin is defined as \hat{y} y, which has interpretation in binary classification task. Consider:
      • if m = \hat{y} y > 0 , we know we have our prediction and true value are of the same sign. Thus, in binary classification, we could already get the correct result. Thus, for m > 0 we should have loss = 0.
      • if m = \hat{y} y < 0 , we know we have our prediction and true value are of different signs. Thus, in binary classification, we are wrong. We need to define a positive value for loss function.
        • In SVM, we define hinge loss l(m) = \text{max}(0, 1-m), which is a “maximum-margin” based loss (more on this in the next post, which will cover the derivation of SVM, kernel methods) Basically, for this loss, we have when m \geq 1 no loss, $\latex m < 1$ loss. We can interpret m as “confidence” of our prediction. When m < 1 this means a low confidence, thus still penalize!
    • With this intuition, how do we understand logistic loss? We know:
      • This loss always > 1
      • When m negative (i.e. wrong prediction), we have greater loss !
      • When m positive (i.e. correct prediction), we have less loss…
      • Note also for same amount of increase in m, the scale that we “reward” correct prediction is less than the scale we penalize wrong predictions.

Bournoulli regression with logistic transfer function

  • Input space X = R^d
  • Outcome space y \in {0, 1}
  • action space A = [0, 1] An action is the probability that an outcome is 1

Define the standard logistic function as \phi (\eta) = 1 / (1 + e^{-\eta})

  • Hypothesis space as F = {x \leftarrow\phi (w^Tx) | w \in R^d}
  • Sigmoid function is any function that has an “S” shape. One example is the simple case of logistic function! Used in neural networks as activation function / transfer function. Purpose is to add non-linearity to the network.

Now we need to do a re-labeling for y_i in the dataset.

  • For every y_i = 1, we define y'  = 1
  • For every y_i = -1, we define y'  = 0

Can we do this? Doesn’t this change the value of y-s ? The answer is , in binary classification ( or in any classification), the labels do not matter. Instead, this trick just makes the equivalent shown much easier…

Then, the negative log likelihood objective function, given this F and dataset $laex D$, is :

  • NLL(w) = \sum_i^n [-y_i ' \text{log} \phi (w^T x_i)] +(y_i ' -1) \text{log} (1 - \phi (w^T x_i))

How to understand this approach? Think about a neural network…

  • Input x
  • First linear layer: transform x into w^Tx
  • Next non-linear activation function. \phi (\eta) = 1 / (1 + e^{-\eta}).
  • The output is interpreted as a probability of positive classes.
    • Think about multi-class problems, the second layer is a softmax — and we get a vector of probabilities!

With some calculation, we can show NLL is equivalent to the sum of empirical loss.

 

 

Encoding categorical features: likelihood, one-hot, and feature selection

This post describes techniques used to encode high cardinality categorical features in a supervised learning problem.

In particular, since these values cannot be ordered, the features are nominal. Specifically, I am working with the Kaggle competition here. The problem with this dataset is that some features (e.g. types of cell phone operating systems) are categorical and has hundreds of values.

The problem occurs in how to fit these features in our model. Nominal features work fine with decision trees (random forests), Naive Bayes (use count to estimate pmf). But for other models, e.g. neural networks, logistic regression, the input needs to be numbers.

Before introducing likelihood encoding, we can go over other methods in handling such situations.

Likelihood encoding 

Likelihood encoding is a way of representing the values according to their relationships with the target variable. The goal is finding a meaningful numeric encoding for a categorical feature. Meaningful in this case means as much related to the output/target as possible.

How do we do this? A simple way is 1) first group the training set by this particular categorical feature and 2) representing each value by within group mean of that value. For example, a categorical feature might be gender. Suppose the target is height. Then, we might have the average height for male is 1.70m, while the average height for female is 1.60m. We then change ‘male’ to 1.70, while ‘female’ to 1.60.

Perhaps we should also add some noise to this mean to prevent overfitting to training data. This can be done by :

  • add Gaussian noise to the mean. Credit to Owen Zhang :
  • use the idea of “cross-validation”. So here, instead of using the grand group mean, we use the cross-validation mean. (Not very clear on this point at the moment. Need to examine the idea of cross-validation. Will write in the next post.) Some people propose on Kaggle about using two levels of cross-validation: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features 

One hot vector

This idea is similar to the dummy variable in statistics. Basically, each possible value is being transformed into its own columns. Each of these columns will be a 1 if the original feature equals this value, or 0 if the original feature does not equal this value.

An example is for natural language processing models, the first step is usually 1) tokenize the sentence and 2) constructing a vocabulary and 3) map every token to an index (a number, or a nominal value, basically). After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot

After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot vectors into low dim vectors). In this way, we are basically representing every symbol in a low dimensional vector. The exact form of the vector is learned. What is happening here is actually dimension reduction.  So, after learning the weighting matrix, other methods, like PCA, can potentially work here as well!

Hashing 

A classmate named Lee Tanenbaum told me about this idea. This is an extension on the one-hot encoding idea. Suppose there can be n values in the feature. Basically, we use two hash functions, hash the possible values into two variables. The first hash all values into \sqrt(n) number of baskets,basketh baskent there are \sqrt(n) number of feature values. All feature values in the same busket is going to be the same for variable A. Then, we use a second hash function, that carefully hash the values into another busket variable B. We want to make sure the combination of A and B can fully represent every possible value in the target feature. We then learn a low-dim representation for both A and B, and concantenate them together.

My opinion on this is, this is still one-hot encoding + weight learning. However, we are forcing certain structures onto the weight matrix.

Feature selection 

Still based on one-hot encoding. However, instead of compressing everything into a tiny low-d vector, we discard some dummy variables based on their importance. In fact, LASSO is exactly used for this! L1 usually drives the coefficient of some features to zero, due to the diamond shape of the constraint. Source:

  • On why l1 gives sparsity: video here :  https://www.youtube.com/watch?v=14MKVkhvMus&feature=youtu.be
  • Course here : https://onlinecourses.science.psu.edu/stat857/book/export/html/137 Statoverflow answer here: https://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection

Domain specific methods

These models exploit the relationship between these symbols in the training set, and learn a vector representation of every symbol (e.g. word2vec). This is certainly another way of vectorizing the words. Probably I will write more about this after learning more on representation learning!

Python generators

This short post describes what is a generator in Python.

A function with yield in it is a function. However, when called the function, it returns a generator object. Generators allow you to pause a function and return an intermediate result. The function saves its execution context and can be resumed later if necessary.

def fibonacci():
    a, b = 0, 1
    while True:
        yield b
        a, b = b, a + b

g = fibonacci()

[next (g) for i in range(10)]

This will return [1, 1, 2, 3, 5, 8, 13, 21, 34, 55].

When call the list comprehension again, it will return:

[next (g) for i in range(10)]

[89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]

Here, note the function is like a machine that can generate what you want. It will also remember its last state. This is different from a function that returns a list, which will not remember its last state..

Python generators can also interact with the code called with the next method. yield becomes an expression, and a value can be passed along with a new method called sendHere is an example piece of code:

def psychologist():
    print("Please tell me your problem")
    while True:
        answer = (yield) # note the usage of yield here 
        if answer is not None:
            if answer.endswith("?"): # note it's endSwith, the s there
                print("Don't ask yourself too many questions.")
            elif "good" in answer:
                print("A that's good, go on. ")
            elif "bad" in answer:
                print("Don't be so negative.")

This defines a function that can return a generator.

free = psychologist()

type(free)

This will return “generator”

next(free)

This will return “Please tell me your problem.”

free.send("what?")

This will return “Don’t ask yourself too many questions.”

free.send("I'm feeling bad.")

This will return “Don’t be so negative.”

free.send("But she is feeling good!")

This will return, “A that’s good, go on.”

send acts like next, but makes yield return the value passed. The function can, therefore, change its behavior depending on the client code.

Should I use an IDE, or should I use Vim?

This problem has been bugging me for a while so I decide to write it out even though it’s just a short piece.  This post compares tools for Python programming using :

  • Jupyter Notebook
  • IDEs like PyCharm
  • Text editors like Vim

Jupyter Notebook:

  • The pro is it’s easily visualizable. When you want a graph, you can see a graph immediately. The comments are also beautifully formatted.
  • Another pro is it can be connected to a Linux server like Dumbo.
  • The con is, it’s not a program. A notebook is it’s own file and although it can be downloaded as a .py file, the file is usually too long, with lots of comments like typesetting parameters.

When to use it?

  • Data exploration. because of the visualization and analysis nature.
  • Big data. Because it can be connected to a server, that makes running large amount of data possible.

PyCharm

  • The pro is it’s suited for Python development. I have not learnt the functionalities entirely, but e.g. search and replace are easily doable in PyCharm. Debugging also seems to be easy?
  • The con is it’s usually not available on a server.
  • Another con is need extra finger movement when switching from terminal to Pycharm.

When to use it?

  • Debugging complicated programs. e.g. NLP programs.
  • No need to run on a server.

Vim

  • The pro is it’s everywhere. Thus, whenever you write on your own machine, or on the server,   it feels the same.
  • Another pro is it can be used for anything. like python, C++, markdown, bash… So there is no need to switch to other places when ssh to the server.
  • The con is it’s not that easy to use. e.g. search and replace… hard to do this. Adjust tab? also not immediately doable.
  • Another con is it’s not that easy to debug. have to manually print out variables… This makes it particularly difficult when the program is large.

When to use it?

  • When need to connect to a server. e.g. big data size.