Monthly Archives: February 2017

Natural Language Processing Notes – Keep Updating

References –

Structure of this note: I plan to note down key concepts, mathematical constructs and explanations.

Basic tasks in NLP

  • language modeling,
  • POS tagging,
  • named entity recognition,
  • sentiment analysis
  • and paraphrase detection

From a StackExchange user’s posts-

NLP is very vast and varied. Here are a few basic tools in NLP:

  1. Sentence splitting: Identifying sentence boundaries in text
  2. Tokenization: Splitting a sentence into individual words
  3. Lemmatization: Converting a word to its root form. E.g. says, said, saying will all map to root form – say
  4. Stemmer: It is similar to a lemmatizer, but it stems a word rather than get to the root form. e.g. laughed, laughing will stem to laugh. However, said, saying will map to sa – which is not particularly enlightening in terms of what “sa” means
  5. POS tagger: Tags a word with the Part of Speech – what is a noun, verb, preposition etc.
  6. Parser: Links words with POS tags to other words with POS tags. E.g. John ate an apple. Here John and apple are nouns linked by the verb – eat. John is the subject of the verb, and apple is the object of the verb.

If you are looking for the state of the art for these tools, check out StanfordCoreNLP, which has most of these tools and a trained model to identify the above from a document. There is also an online demo to check out stanfordCoreNLP before downloading and using it with your application.

NLP has several subfields. Here are a few of them:

  1. Machine Translation: Automatic Translation from one language to another
  2. Information Retrieval: Something like a search engine, that retrieves relevant information from a large set of documents based on a search query
  3. Information Extraction: Extract concepts and keywords – such as names of people, locations, times, synonyms etc.
  4. Deep Learning has lately become a new field of NLP, where a system tries to understand a document like a human understands it.

2017 Mar 8 Women in Data Science HKU

A project of text classification based on news articles

  • N-grams language model
  • Chinese text segmentation – a tool Jieba on github
  • Hidden Markov Model (HMM)

From the web – general review articles – pop articles 

State of art of – AI hard – a concept — strong AI – Ali iDST http://mp.weixin.qq.com/s/hYrNi17HiyW9kxokgLR_Ig – as of 2017 Apr 17

Stage I: 2014

Motivation 

  • They started human-machine conversation in 2014. Commercial motivation = two trends. First, smart phone are popular, thus various use scenarios that mouse and touch cannot satisfy. Second, Internet service expand to all scenarios of everyday life! Thus possibility for other interaction methods.

Facts 

  • In early 2014 human machine conversation is at primary stage.
  • In China Taobao, few people actually use speech to search.
  • In early 2014 many companies’ speech search product died.

Stage II: 2015 – 2016

Facts and current work

  • Alibaba iDST work with YUNOS on a speech-enabled interactive search assistant (something like Siri?) But then problems = compete for users’ attention with other apps on all activities. Thus have to take care of what every application is doing (buy air tickets, buy food… etc) = huge workload. Thus in 2015 they transit to work on a platform to build an API to every app. (no details here)
  • Main tasks now: 1) use machine to understand human language (intention + key information. e.g. book train ticket). 2) Manage interaction (keep asking until get key information). 3) Open-ended conversations (wikipedia…) 4) Casual chat (not so valuable in commercial sense)

Challenges 

  • Engineering: scalability … scale on particular domain + scale on particular device! also talked in that NLG paper where the authors built a commercialised natural language generation system for a hospital. From an engineering perspective, challenges include: 1) accuracy and 2) scalable system.
  • Science: general challenges in NLP include various of user’s language (what exactly do you want when you said “you want to go to the US”) + understand context + robustness + sharing of common sense that makes the conversation flow smooth

Stage III: 2016 – now (2017 Apr)

Work 

  • The NLP engine from traditional machine learning method to deep learning!! Intention = CNN, slot-filling = Bi-LSTM-CRF, context (didn’t specify which model!) robustness = data augmentation
  • developed a task-flow language to solve the “understand context” problem
  • For API to domain, developed a OpenDialogue with different specific tasks

Current applications 

  • On smart appliances… something like Amazon Alexa 
  • Some scenarios where human-machine conversation can be useful = Internet and cars. talk when you drive..

Challenges again

  • open-domain NLP — deal with scalability of specific domains
  • current human machine interaction are mostly single-round conversation, or can only take into account limited context. So, modelling with regard to specific context important.
  • Data-driven learning instead of user-defined modules in conversations.

Deep Learning 

  • caffe – a Berkerley package http://caffe.berkeleyvision.org

 

 

Conversation with Yuchen Zhang

Semantic parsing

自然语言– logical form — -knowledge base

  • 问答,学术界关心,比较复杂的问题的问答,需要逻辑推理
    • Argmax …
    • Question — 数据源的操作
  • 做到什么程度,open problem 任何一点 progress 就可以直接 transfer 到 industry
  • Research community — cannot get the data — abstract the problem — Wikipedia question answering dataset

 

Personality — 分析情感的 paper

问题的简化,why? 一开始并不是因为分析情感有太大用处,例如 Amazon review — 直接看分就可以了。不需要看文字

Chatbot 分析也不是很准

但是,语义理解大问题的简化,不仅仅是字面信息,而是更深的信息。

Word extraction 分析比较差。

现实,并不是特别能  transfer ,形成了这么一个群体,sentiment analysis 其实已经不太关心 semantic analysis

 

data coding –

information extraction – paragraph — A & B relations

  • Open source — open IE
  • UWashington

summarization – open question still active research 

  • Research question
  • Long paragraph — short pararaphs
  • Neural models

 

NLP – 不能帮助理解。但是可以提供基本工具,股价预测,新闻,MicroSoft stock going up ? Review — 用户体验,输出

  • Opensource cannot do this
  • What it can do? Extract features
    • Sentiment analysis — emotions positive or negative
    • Relation
    • Feature vector 做预测

不能帮助理解文章,但是可以:

  • 关键词 – topic model 是什么,先做 training 一万篇新闻,unsupervised training 谈论同一件事情
  • Sentiment analysis — amazon 产品 review
  • 端到端的任务,summarization active research area
    • Baseline first line
    • Language independent — 中文语料
  • question answering
  • Reading comphehensive

语料库:

  • 分问题:word embedding — 新闻,文学  corpus

Linguistic

Parcy Liang

相当一部分  professor in linguistics 语言学家,computational methods

Cho 拍到 datasets 上面来

Cho neural machine translation 出名

  • Postdoc
  • Deep learning class paper
  • Attention model
  • GRU
  • Auto- drive
  • Vision

Chris Manning

  • Deep learning + NLP

NLP 工业界比较热门

  • semantic parsing — search engine — rule based more conservative — Google in search
  • Question answering — dialog system — hot — AI assistant
  • Fundamental module — word embedding, parsing, post-text , dependency as features used in applications

献给XX猪扒米线

没人不喜欢拐角这家猪扒米线。可能是米线确实魅力大,也可能是选择性偏差,总之 “XX猪扒米线” 在西环该是有口皆碑的存在。海岛边的下午,就算食欲和天气一样寡淡,想起猪扒米线酸酸麻麻的模样,我还是忍不住口齿生津 —— 这巴甫洛夫般的神奇牵挂。

踩着叮叮车的浪漫轨迹,承蒙千万豪宅的慷慨荫蔽,卑路乍街轻移玉步转入西祥街:朋友和我兴致勃勃走去吃猪扒米线。香港食店普遍面积奇小而脾气奇差,运气不好就会被迫和陌生情侣捉对儿拼桌,运气更差还会不小心听进去一肚子腻歪话而影响食欲。好在这次我们三人正好包抄一张小圆桌,专心低头攻打碗里的食物。

未来工程师卞澄澄不修边幅,猪扒金黄诱人,她恶向胆边生,哗啦啦吃得最快。战斗完毕抬起头来看我们仍在进食,她拔剑四顾心茫然。产品经理胡八道又白又美,但啃猪扒比我们都行。只见她先优雅吃肉,再豪爽喝汤。汤喝完了剩下白溜溜的米线,堆在碗里盘成一座小山,胡八道便尖起嘴来嘬着吃。我吃得多,每次恨不得加半份猪扒,再把油滋滋的炸花生米蘸着香菜酸辣汤里吃。

你看出来了,这碗猪扒米线有三种原料:驰名香酥猪扒、高汤熬制米线、古法炸花生米。材料没有什么特别,姿色平凡的街市货。但味道奇绝,引得我们趋之若鹜,全靠厨子的巧夫之炊。食材盛在酱色大瓷碗里,饰以渐变红黑花纹,摆盘不输优雅高贵百零八大钱的日式拉面。米线白白嫩嫩,圆润饱满,一半浸在高汤里,一半沾上辣椒油,红白相间霎是可人。每人碗里都是两大一小三块猪扒,一块带骨有嚼劲,一块纯肉大满足。剩下入味透彻的三角尖尖一口包进嘴里,汤汁四溢。猪扒松软,花生米香脆,要等我自己开始做饭,才知道是腌制和火候上下了功夫。

你可能不知道的是,猪扒米线还有三种味道:“清汤”,“麻辣”,和加了据说是秘制酸菜的 “酸麻辣”。我们仨都偏爱酸麻辣。一是因为味道层次丰富。筷子尖小心把磨成细颗粒的红油辣椒搅匀,拈起淡青色透明多汁酸菜,咬一小口,让舌头上沾满甜酸,再泡进汤里调味。二是因为,比起清汤,酸麻辣既多了辣椒又多了酸菜,却不用加钱,导致吃起来有占小便宜那种紧张刺激的心理快感。除了拌米线,这家店也把辣椒油装瓶卖,唤之“XX秘制辣椒油”,雄心勃勃俨然老干妈第二。要我说,这油论香辣还比老干妈强。后者有点回酸,总让人联想到剩饭,以及单身男女冰箱的味道。

店里常年同一位老伙计忙活。这里星罗棋布地摆满圆桌,没什么位置走动,所以他的工作内容主要是三百六十度转身加不规则短距离平移,像个不幸碰上复杂路况的扫地机器人。面向不同方位的客人,手一抬可以上菜,再一举可以收钱。我大一时他穿棉布 T 恤,系黑围裙,平头,脸上有点出汗反光,见到客人便敬业地微笑。如今我研二了,他还穿 T 恤和黑围裙,正在边出汗边敬业地微笑。那微笑颇有特点,角度标准得跟QQ表情似的,既不叫你觉得虚伪,也不叫你觉得亲昵。

在纸醉金迷的香港和瞬息万变的青春里,老伙计和我维持了对猪扒米线难得的长情。这小店估计让他赚了不少钱。回想起来,因为猪扒米线这款明星产品太过出色,五年里我竟从来没点过菜单上的其他食物,顶多出于吃相考虑,点一瓶可乐和一包纸巾,每念及此,忍不住喟叹错过多少水晶粉、上海面和拍青瓜。

齐美尔讲,物件摆在那里,人们就会对之赋予意义。和米线耳鬓厮磨这么久,总有那么些危险的暧昧瞬间。比如在陌生的城市里深夜独自走回出租屋,怀着对前途和人事的牵挂,恰似汪峰歌里的失业青年。灯火通明的小店飘来熟悉的辣味,老伙计熟悉的身影在煮面。卞澄澄和胡八道不在,这次我独占一张桌。点碗熟悉的猪扒米线,稀里哗啦地喝下去。酸酸麻麻,又爽又辣,没什么回味,过去就过去了。感官快乐消失之决绝,可以生发些形而上的联想。吃完我已再次被收买,米线如此美味,生活重新温暖,从此我再也不嘲笑香港人为保护这座城市做出的某些举动了。

Use Python to send personalised mass emails

My first task in Python – sending mass emails to my tutorial students. I have an excel sheet with all the emails of the students’ who’ve selected my tutorial timeslots and I am trying to send the same emails to them. Should be simple.

Set-up

First installed Atom as a text editor to write Python scripts. No particular reason just because the author of Learn Python the Hard Way recommended it.

Modules used: openpyxl and smtplib

Installation: openpyxl To do this need to run

sudo pip install openpyxl

Note 1 – after sudo, terminal prompts you to type passwords but it won’t show! Just type the correct passwork and hit enter.

Note 2- Before openpyxl need to install pip (package manager for python) first. According to this post, type

sudo easy_install pip

directly in my terminal and it worked.

Note 3- To exit sudo, type exit or sudo -k or command + D. See this post.

Note 4 – to run my script, change directory to where the file exists and type this in terminal –

python massemail.py

Note 5- Python 2 and 3 seems to have different ways of installing modules.

Note 6 – for loop in Python is close at beginning and open in the end. for i in range (2,9) shows i = 2, 3, 4, 5, 6, 7, 8.

Note 7 – starttls() has be to put before ehlo() and the reason is here

My first attempt to log in through username@connect.hku.hk failed because google blocked the attempt. I then received an email in my mailbox confessing it blocked the attempt because of security settings. I changed the setting to “less secure apps” and it then worked.

Note 8- Python data structure: list, tuples and dictionaries. Compare to arrays in C++ and variables, lists in R. In R everything is automatically a list but just appear in different ways? In that sense it’s simpler and built for statistics as in dealing with sequences of data.

  • lists [] methods on lists- append, extend, insert, pop…
  • tuples () seperated by comma, immutable cannot change values; also faster than list; methods only index and count
  • dictionaries {}

Note 9- The smtplib documentation for Python 2.7 offers a fancier example (a technical documentation) with prompt for senders and receivers, as well as messages. There are three arguments for SMTP.sendmail() method.

Email headers need careful formatting and this requires detailed knowledge of arguments of sendmail method. Exerptions from the documentation above –

Send mail. The required arguments are an RFC 822 from-address string, a list of RFC 822 to-address strings (a bare string will be treated as a list with 1 address), and a message string. The caller may pass a list of ESMTP options (such as 8bitmime) to be used in MAIL FROM commands as mail_options. ESMTP options (such as DSN commands) that should be used with all RCPT commands can be passed as rcpt_options. (If you need to use different ESMTP options to different recipients you have to use the low-level methods such as mail(), rcpt() and data() to send the message.)

A simple solution here

Note 10- def defines a function

Personalization 

I tried simple text instead of html type of body message as the latter seem to be unnecessary at the moment.

Note 11 – this example   has a prompt in writing the body of email. It uses triple quotes to enclose a doc string (?what is that?)

To do personalization I have a “list” of names and emails paired together. one example put them in a dictionary and for loop through it.

Note 12 – use %s to replace and format strings this post illustrates. an example

"Hello %s, my name is %s." % ('Mike', 'Yuqiong') /n
"Today is %s %d." % ('Feb', 21)

Note 13 – when testing multiple instances in a dictionary with the same name, Python appends the later values to the same dictionary. In serious scenarios this will create bugs. use del to delete an unwanted varible.

Final codes 
Used SyntaxHighlighter Evolved to insert the code. Just wrap codes with [name-of-language] [/name-of-language] in the text editor without escaping html environment.

import openpyxl, pprint, smtplib
# smtplib and pprint seems to be pre-installed modules?

# Read tables into a Python object
print ('Opening workbook...')
wb = openpyxl.load_workbook('tutorials.xlsx')
# Get all sheet names
wb.get_sheet_names()
# Get a particular sheet - in this workbook, only one sheet "response"
response = wb.get_sheet_by_name('Responses') # Note here name is case sensitive
response['A1'] # Select a cell as an object
response['A1'].value # show the value of the cell

# Start the real work
emaillist = {} # email list is an empty obejct to be stored with email values
print ('Reading rows...')
for row in range(2, 60+1): # for loop is open in the end
    name = response['C'+str(row)].value # Note here how to manipulate strings in Python
    email = response['E'+str(row)].value
    emaillist[name] = email


import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText

smtpObj = smtplib.SMTP('smtp.gmail.com', 587)
smtpObj.starttls() # Upgrade the connection to a secure one using TLS (587)
# If using SSL encryption (465), can skip this step
smtpObj.ehlo() # To start the connection
smtpObj.login('me@gmail.com', 'psw')

nlist = 'name1', 'name2'
elist = 'email1@gmail.com', 'email2@gmail.com'

for i in range(0, len(elist)):
    msg = MIMEMultipart()
    msg['From'] = 'Yuqiong Li <me@gmail.com>' # Note the format
    msg['To'] = '%s <' % nlist[i] +elist[i] + '>'
    msg['Subject'] = 'Testing email for %s' % nlist[i]
    message = '%s here is the email' % nlist[i]
    msg.attach(MIMEText(message))
    smtpObj.sendmail('me@gmail.com',elist[i],msg.as_string())

smtpObj.quit()

 

独自在美国过年这件事情

我牵挂的人里十个有九个在美国。今年过年,他们九个有八个没有回家。小红在东岸上课,小高在西岸编程。所有的学霸都在芝加哥奋力攀登知识顶峰。他们散落在北美大地上气喘吁吁地生活,共同特点有两个。一是年三十全部去了同学那儿,围着面目模糊的餐具颠三倒四地包饺子;二是即将或者正在成为码农。

年三十我端着烫嘴的糯米排骨看春晚,爸妈在旁边低头抢红包。想到他们要独自在美国过年这件事,我觉得揪心。美帝虽好,不是华人地盘。过年这种中华民族传统巅峰大戏,在美国怎么能搜集到足够的素材来完成呢?

吃饭首先就是个问题。华人超市山高路远,想去得傍上大佬,吭哧吭哧地开四十分钟车。进去一看,满眼都是样子蠢笨的冻鱼冻虾。而我们最好的基围虾,要趁蹦腾的活虾不注意,滋啦扔进锅里焖,入口才有甜味。做饺子要拌肉馅吧,二磅猪肉乱刀剁泥,得和上新鲜韭黄吧,但亚洲超市里只有大葱而且蔫儿吧唧。据说美国杀猪用电不放血,肉总有股腥味儿,只能借料酒护短。

至于更高级的享受,就不要想了。卤鸡腿卤鸡蛋微辣,脆皮流油烤童子鸡,松鼠桂鱼沾茄汁。还有肉丸子吧?要松软易化的鱼丸子、绿豆丸子、豆腐丸子、蓑衣丸子吧?要打糍粑吧?早晨是红糖拔丝煎糍粑,晚上是排骨藕汤熬糍粑。筷头拈起来,拉得长长一大条,囫囵一口塞进去。我边吃边想,这些东西他们上哪儿找?

也不要提烟花和鞭炮的乐趣了。在老家买十块钱摔炮把隔壁挂鼻涕的小孩子欺负到哭。也不要提走亲戚的尴尬与温情了。表姐的小孩看着看着已经长这么高,秋天就上小学了。而我既没毕业也没结婚,发不出压岁钱,几乎要配不上小姨的称呼。

也不要提放假的无所事事了。也不要提和父母越洋视频的信号飘摇了。也不要提在网上刷段子才能跟上朋友圈里的春晚吐槽了。也不要提如何从过往的白人脸上找到新年的喜乐气氛了。不过又是异国小镇上平凡的一天。想到这里,我就有点揪心。

我去过两次美国都是作为交换生。待的时间不长,加起来半年多。两次都是冬天,两次我都在学文科。中西部零下三十度的小镇上,有时我发呆,我的家乡没有这么多雪,也没有这么多陌生的语言和微笑。阳光好的时候我就把自己放进人群里,像往海里放一条鱼。

交换时我度过一个中国年和半个美国年。两年前的除夕,学姐带我去吃饭,也叫上她的朋友。我们仨分别来自上海、南京和武汉,喜爱的食物分别是南翔小笼、鸭血粉丝和牛肉豆皮。而那天我们一起坐在 Asian Kitchen 的二楼,看窗外 State Street 上一排排路灯,小镇上最繁华的街道。火锅端上来滋滋冒着白气,汤里飘着几颗辣椒。我们迫不及待放土豆片进去煮,汤滚了就可以涮肉,用小勺把虾滑搓进锅里。席间我们谈论选课、套瓷和申请,某某老师是个怪胎。学姐犹豫读博还是工作,什么时候回国,暧昧的男生好像没有未来但是到底要不要在一起。饭后我们各自散去,我给家里打电话,国内是下午。爷爷问我想不想他,我说想。我又问爷爷是否想我,他笑了,说:“不是很想”。二姑有对活泼的龙凤胎,慢慢长大,终于剥夺了我做小孩的特权。

去年十一月我去一个美国老太太家里过感恩节。事先并不认识,学校照顾孤寡国际生临时组织的活动。这家除了收留我,还带上了个中国男生。据说有些美国家庭感恩节会特意邀请几个陌生人,这样碍于面子,大家都会好好表现。老太太提前发来两页长的邮件告知食谱和活动日程,我寻思不能空手去,过节前一天傍晚慌慌忙忙冲出去买食材。小镇店铺本来就少,又碰上假期,六点钟街上已经黑灯瞎火,路边鬼影也无,很萧瑟。为了民族荣誉,最后我斥巨资买了瓶甜酒,昂昂然拎着走进老太太光鲜公寓的大门。是真讲究,桌上给我们准备的名牌都是细白瓷,花体字写上各人名字。老太太的朋友有一对男同性恋伴侣,一位图书编辑,一位语言治疗师和她的狗。他们带着美剧里中产家庭平和满足的微笑,教我们切火鸡,煮浇土豆泥的肉汁,往琴酒里放汽水、青柠片和冰块。我们聊中国、特朗普和国家公园里的驯鹿,聊并无交集的朋友和家人,有默契地大笑,一切都不像一次萍水相逢。

我之前误加了个博士生群,不好意思退出,于是得以偷窥他们过年的盛况。主题是某老师邀请学生们去他家过年,大家商量着带什么菜去。读书辛苦又没什么吃的,留学生个个修炼出一身厨艺。我观察,鸡肉出现的频率最高。不会做饭的一般带果盘、酒或者汤圆。一顿饭如果有了饺子,估计会和与会嘉宾合影共同出现在朋友圈里。这时我就用点赞来表达对他们生活美满的关切与祝福。

在异乡的冬夜,我喜欢的写手淡豹这么描写节庆和家的关系:

亲人并非自然而然,是要不断地亲近着,在日常和仪式性的时刻,如爱人需要不断注入爱情的养分,如同年货是一起过年的决心,除此之外,就什么都没有。而家是一项选择,以聚拢,以区分,以走近,以疏远,以赶紧回到自己认定的家去过年的动作。

上大学后我很少把住处称作 “家”。第一年我住宿舍,第二、三年我住另一间宿舍。第四年我和朋友在外租房。第五年我第三次搬宿舍。今年我又租下另一间房。我常找不到合适的词来称呼这些住处。“出租屋” 太书面,“住的地方” 太啰嗦,“家” 又太亲昵。直到我看到一位博士学长在博客里把他在美国的住处称为“公寓”,觉得这词很能概括那不冷不热的疏离感。公寓和家的区别,用淡豹的话讲,是缺少了那个“认定”的动作。

而过年就是这样确凿的动作,它充满仪式感。在大雪纷飞里提车,等小伙伴,开四十分钟到大华超市,是一次认定。剁肉馅,擀饺子皮,费力地洗掉手指间的面粉,是一次认定。算好时差,到处找信号和父母视频,在网上看春晚,也是一次认定。而我算着时间买票,拎着箱子坐车,晚上十一点到家爸妈给我下面条,同样是一次认定。

我知道在美国的他们除了这些仪式感与认定,生活中有更重要的事情。Leetcode,面试,OPT,H1B,抽签,之后是绿卡、学区房。他们的美国梦走在我前头。我常对自己的软弱感到羞愧,在小镇上觉得孤单的时候,我知道自己不会在这里久留。而他们的身体里有一辆火车,是前进而不是停留。赵工,钱律,孙教授,李老板,愿那片土地给你们的青春才智以公义的回报,愿你们的美国梦全部实现。愿你们已经吃过热气腾腾的饺子,来对付独自在外过年这件事情。