Monthly Archives: February 2017

Natural Language Processing Notes – Keep Updating

References –

Structure of this note: I plan to note down key concepts, mathematical constructs and explanations.

Basic tasks in NLP

  • language modeling,
  • POS tagging,
  • named entity recognition,
  • sentiment analysis
  • and paraphrase detection

From a StackExchange user’s posts-

NLP is very vast and varied. Here are a few basic tools in NLP:

  1. Sentence splitting: Identifying sentence boundaries in text
  2. Tokenization: Splitting a sentence into individual words
  3. Lemmatization: Converting a word to its root form. E.g. says, said, saying will all map to root form – say
  4. Stemmer: It is similar to a lemmatizer, but it stems a word rather than get to the root form. e.g. laughed, laughing will stem to laugh. However, said, saying will map to sa – which is not particularly enlightening in terms of what “sa” means
  5. POS tagger: Tags a word with the Part of Speech – what is a noun, verb, preposition etc.
  6. Parser: Links words with POS tags to other words with POS tags. E.g. John ate an apple. Here John and apple are nouns linked by the verb – eat. John is the subject of the verb, and apple is the object of the verb.

If you are looking for the state of the art for these tools, check out StanfordCoreNLP, which has most of these tools and a trained model to identify the above from a document. There is also an online demo to check out stanfordCoreNLP before downloading and using it with your application.

NLP has several subfields. Here are a few of them:

  1. Machine Translation: Automatic Translation from one language to another
  2. Information Retrieval: Something like a search engine, that retrieves relevant information from a large set of documents based on a search query
  3. Information Extraction: Extract concepts and keywords – such as names of people, locations, times, synonyms etc.
  4. Deep Learning has lately become a new field of NLP, where a system tries to understand a document like a human understands it.

2017 Mar 8 Women in Data Science HKU

A project of text classification based on news articles

  • N-grams language model
  • Chinese text segmentation – a tool Jieba on github
  • Hidden Markov Model (HMM)

From the web – general review articles – pop articles 

State of art of – AI hard – a concept — strong AI – Ali iDST – as of 2017 Apr 17

Stage I: 2014


  • They started human-machine conversation in 2014. Commercial motivation = two trends. First, smart phone are popular, thus various use scenarios that mouse and touch cannot satisfy. Second, Internet service expand to all scenarios of everyday life! Thus possibility for other interaction methods.


  • In early 2014 human machine conversation is at primary stage.
  • In China Taobao, few people actually use speech to search.
  • In early 2014 many companies’ speech search product died.

Stage II: 2015 – 2016

Facts and current work

  • Alibaba iDST work with YUNOS on a speech-enabled interactive search assistant (something like Siri?) But then problems = compete for users’ attention with other apps on all activities. Thus have to take care of what every application is doing (buy air tickets, buy food… etc) = huge workload. Thus in 2015 they transit to work on a platform to build an API to every app. (no details here)
  • Main tasks now: 1) use machine to understand human language (intention + key information. e.g. book train ticket). 2) Manage interaction (keep asking until get key information). 3) Open-ended conversations (wikipedia…) 4) Casual chat (not so valuable in commercial sense)


  • Engineering: scalability … scale on particular domain + scale on particular device! also talked in that NLG paper where the authors built a commercialised natural language generation system for a hospital. From an engineering perspective, challenges include: 1) accuracy and 2) scalable system.
  • Science: general challenges in NLP include various of user’s language (what exactly do you want when you said “you want to go to the US”) + understand context + robustness + sharing of common sense that makes the conversation flow smooth

Stage III: 2016 – now (2017 Apr)


  • The NLP engine from traditional machine learning method to deep learning!! Intention = CNN, slot-filling = Bi-LSTM-CRF, context (didn’t specify which model!) robustness = data augmentation
  • developed a task-flow language to solve the “understand context” problem
  • For API to domain, developed a OpenDialogue with different specific tasks

Current applications 

  • On smart appliances… something like Amazon Alexa 
  • Some scenarios where human-machine conversation can be useful = Internet and cars. talk when you drive..

Challenges again

  • open-domain NLP — deal with scalability of specific domains
  • current human machine interaction are mostly single-round conversation, or can only take into account limited context. So, modelling with regard to specific context important.
  • Data-driven learning instead of user-defined modules in conversations.

Deep Learning 

  • caffe – a Berkerley package



Conversation with Yuchen Zhang

Semantic parsing

自然语言– logical form — -knowledge base

  • 问答,学术界关心,比较复杂的问题的问答,需要逻辑推理
    • Argmax …
    • Question — 数据源的操作
  • 做到什么程度,open problem 任何一点 progress 就可以直接 transfer 到 industry
  • Research community — cannot get the data — abstract the problem — Wikipedia question answering dataset


Personality — 分析情感的 paper

问题的简化,why? 一开始并不是因为分析情感有太大用处,例如 Amazon review — 直接看分就可以了。不需要看文字

Chatbot 分析也不是很准


Word extraction 分析比较差。

现实,并不是特别能  transfer ,形成了这么一个群体,sentiment analysis 其实已经不太关心 semantic analysis


data coding –

information extraction – paragraph — A & B relations

  • Open source — open IE
  • UWashington

summarization – open question still active research 

  • Research question
  • Long paragraph — short pararaphs
  • Neural models


NLP – 不能帮助理解。但是可以提供基本工具,股价预测,新闻,MicroSoft stock going up ? Review — 用户体验,输出

  • Opensource cannot do this
  • What it can do? Extract features
    • Sentiment analysis — emotions positive or negative
    • Relation
    • Feature vector 做预测


  • 关键词 – topic model 是什么,先做 training 一万篇新闻,unsupervised training 谈论同一件事情
  • Sentiment analysis — amazon 产品 review
  • 端到端的任务,summarization active research area
    • Baseline first line
    • Language independent — 中文语料
  • question answering
  • Reading comphehensive


  • 分问题:word embedding — 新闻,文学  corpus


Parcy Liang

相当一部分  professor in linguistics 语言学家,computational methods

Cho 拍到 datasets 上面来

Cho neural machine translation 出名

  • Postdoc
  • Deep learning class paper
  • Attention model
  • GRU
  • Auto- drive
  • Vision

Chris Manning

  • Deep learning + NLP

NLP 工业界比较热门

  • semantic parsing — search engine — rule based more conservative — Google in search
  • Question answering — dialog system — hot — AI assistant
  • Fundamental module — word embedding, parsing, post-text , dependency as features used in applications


没人不喜欢拐角这家猪扒米线。可能是米线确实魅力大,也可能是选择性偏差,总之 “XX猪扒米线” 在西环该是有口皆碑的存在。海岛边的下午,就算食欲和天气一样寡淡,想起猪扒米线酸酸麻麻的模样,我还是忍不住口齿生津 —— 这巴甫洛夫般的神奇牵挂。




你可能不知道的是,猪扒米线还有三种味道:“清汤”,“麻辣”,和加了据说是秘制酸菜的 “酸麻辣”。我们仨都偏爱酸麻辣。一是因为味道层次丰富。筷子尖小心把磨成细颗粒的红油辣椒搅匀,拈起淡青色透明多汁酸菜,咬一小口,让舌头上沾满甜酸,再泡进汤里调味。二是因为,比起清汤,酸麻辣既多了辣椒又多了酸菜,却不用加钱,导致吃起来有占小便宜那种紧张刺激的心理快感。除了拌米线,这家店也把辣椒油装瓶卖,唤之“XX秘制辣椒油”,雄心勃勃俨然老干妈第二。要我说,这油论香辣还比老干妈强。后者有点回酸,总让人联想到剩饭,以及单身男女冰箱的味道。

店里常年同一位老伙计忙活。这里星罗棋布地摆满圆桌,没什么位置走动,所以他的工作内容主要是三百六十度转身加不规则短距离平移,像个不幸碰上复杂路况的扫地机器人。面向不同方位的客人,手一抬可以上菜,再一举可以收钱。我大一时他穿棉布 T 恤,系黑围裙,平头,脸上有点出汗反光,见到客人便敬业地微笑。如今我研二了,他还穿 T 恤和黑围裙,正在边出汗边敬业地微笑。那微笑颇有特点,角度标准得跟QQ表情似的,既不叫你觉得虚伪,也不叫你觉得亲昵。



Use Python to send personalised mass emails

My first task in Python – sending mass emails to my tutorial students. I have an excel sheet with all the emails of the students’ who’ve selected my tutorial timeslots and I am trying to send the same emails to them. Should be simple.


First installed Atom as a text editor to write Python scripts. No particular reason just because the author of Learn Python the Hard Way recommended it.

Modules used: openpyxl and smtplib

Installation: openpyxl To do this need to run

sudo pip install openpyxl

Note 1 – after sudo, terminal prompts you to type passwords but it won’t show! Just type the correct passwork and hit enter.

Note 2- Before openpyxl need to install pip (package manager for python) first. According to this post, type

sudo easy_install pip

directly in my terminal and it worked.

Note 3- To exit sudo, type exit or sudo -k or command + D. See this post.

Note 4 – to run my script, change directory to where the file exists and type this in terminal –


Note 5- Python 2 and 3 seems to have different ways of installing modules.

Note 6 – for loop in Python is close at beginning and open in the end. for i in range (2,9) shows i = 2, 3, 4, 5, 6, 7, 8.

Note 7 – starttls() has be to put before ehlo() and the reason is here

My first attempt to log in through failed because google blocked the attempt. I then received an email in my mailbox confessing it blocked the attempt because of security settings. I changed the setting to “less secure apps” and it then worked.

Note 8- Python data structure: list, tuples and dictionaries. Compare to arrays in C++ and variables, lists in R. In R everything is automatically a list but just appear in different ways? In that sense it’s simpler and built for statistics as in dealing with sequences of data.

  • lists [] methods on lists- append, extend, insert, pop…
  • tuples () seperated by comma, immutable cannot change values; also faster than list; methods only index and count
  • dictionaries {}

Note 9- The smtplib documentation for Python 2.7 offers a fancier example (a technical documentation) with prompt for senders and receivers, as well as messages. There are three arguments for SMTP.sendmail() method.

Email headers need careful formatting and this requires detailed knowledge of arguments of sendmail method. Exerptions from the documentation above –

Send mail. The required arguments are an RFC 822 from-address string, a list of RFC 822 to-address strings (a bare string will be treated as a list with 1 address), and a message string. The caller may pass a list of ESMTP options (such as 8bitmime) to be used in MAIL FROM commands as mail_options. ESMTP options (such as DSN commands) that should be used with all RCPT commands can be passed as rcpt_options. (If you need to use different ESMTP options to different recipients you have to use the low-level methods such as mail(), rcpt() and data() to send the message.)

A simple solution here

Note 10- def defines a function


I tried simple text instead of html type of body message as the latter seem to be unnecessary at the moment.

Note 11 – this example   has a prompt in writing the body of email. It uses triple quotes to enclose a doc string (?what is that?)

To do personalization I have a “list” of names and emails paired together. one example put them in a dictionary and for loop through it.

Note 12 – use %s to replace and format strings this post illustrates. an example

"Hello %s, my name is %s." % ('Mike', 'Yuqiong') /n
"Today is %s %d." % ('Feb', 21)

Note 13 – when testing multiple instances in a dictionary with the same name, Python appends the later values to the same dictionary. In serious scenarios this will create bugs. use del to delete an unwanted varible.

Final codes 
Used SyntaxHighlighter Evolved to insert the code. Just wrap codes with [name-of-language] [/name-of-language] in the text editor without escaping html environment.

import openpyxl, pprint, smtplib
# smtplib and pprint seems to be pre-installed modules?

# Read tables into a Python object
print ('Opening workbook...')
wb = openpyxl.load_workbook('tutorials.xlsx')
# Get all sheet names
# Get a particular sheet - in this workbook, only one sheet "response"
response = wb.get_sheet_by_name('Responses') # Note here name is case sensitive
response['A1'] # Select a cell as an object
response['A1'].value # show the value of the cell

# Start the real work
emaillist = {} # email list is an empty obejct to be stored with email values
print ('Reading rows...')
for row in range(2, 60+1): # for loop is open in the end
    name = response['C'+str(row)].value # Note here how to manipulate strings in Python
    email = response['E'+str(row)].value
    emaillist[name] = email

import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText

smtpObj = smtplib.SMTP('', 587)
smtpObj.starttls() # Upgrade the connection to a secure one using TLS (587)
# If using SSL encryption (465), can skip this step
smtpObj.ehlo() # To start the connection
smtpObj.login('', 'psw')

nlist = 'name1', 'name2'
elist = '', ''

for i in range(0, len(elist)):
    msg = MIMEMultipart()
    msg['From'] = 'Yuqiong Li <>' # Note the format
    msg['To'] = '%s <' % nlist[i] +elist[i] + '>'
    msg['Subject'] = 'Testing email for %s' % nlist[i]
    message = '%s here is the email' % nlist[i]











交换时我度过一个中国年和半个美国年。两年前的除夕,学姐带我去吃饭,也叫上她的朋友。我们仨分别来自上海、南京和武汉,喜爱的食物分别是南翔小笼、鸭血粉丝和牛肉豆皮。而那天我们一起坐在 Asian Kitchen 的二楼,看窗外 State Street 上一排排路灯,小镇上最繁华的街道。火锅端上来滋滋冒着白气,汤里飘着几颗辣椒。我们迫不及待放土豆片进去煮,汤滚了就可以涮肉,用小勺把虾滑搓进锅里。席间我们谈论选课、套瓷和申请,某某老师是个怪胎。学姐犹豫读博还是工作,什么时候回国,暧昧的男生好像没有未来但是到底要不要在一起。饭后我们各自散去,我给家里打电话,国内是下午。爷爷问我想不想他,我说想。我又问爷爷是否想我,他笑了,说:“不是很想”。二姑有对活泼的龙凤胎,慢慢长大,终于剥夺了我做小孩的特权。





上大学后我很少把住处称作 “家”。第一年我住宿舍,第二、三年我住另一间宿舍。第四年我和朋友在外租房。第五年我第三次搬宿舍。今年我又租下另一间房。我常找不到合适的词来称呼这些住处。“出租屋” 太书面,“住的地方” 太啰嗦,“家” 又太亲昵。直到我看到一位博士学长在博客里把他在美国的住处称为“公寓”,觉得这词很能概括那不冷不热的疏离感。公寓和家的区别,用淡豹的话讲,是缺少了那个“认定”的动作。