Natural Language Processing Notes – Keep Updating

References –

Structure of this note: I plan to note down key concepts, mathematical constructs and explanations.

Basic tasks in NLP

  • language modeling,
  • POS tagging,
  • named entity recognition,
  • sentiment analysis
  • and paraphrase detection

From a StackExchange user’s posts-

NLP is very vast and varied. Here are a few basic tools in NLP:

  1. Sentence splitting: Identifying sentence boundaries in text
  2. Tokenization: Splitting a sentence into individual words
  3. Lemmatization: Converting a word to its root form. E.g. says, said, saying will all map to root form – say
  4. Stemmer: It is similar to a lemmatizer, but it stems a word rather than get to the root form. e.g. laughed, laughing will stem to laugh. However, said, saying will map to sa – which is not particularly enlightening in terms of what “sa” means
  5. POS tagger: Tags a word with the Part of Speech – what is a noun, verb, preposition etc.
  6. Parser: Links words with POS tags to other words with POS tags. E.g. John ate an apple. Here John and apple are nouns linked by the verb – eat. John is the subject of the verb, and apple is the object of the verb.

If you are looking for the state of the art for these tools, check out StanfordCoreNLP, which has most of these tools and a trained model to identify the above from a document. There is also an online demo to check out stanfordCoreNLP before downloading and using it with your application.

NLP has several subfields. Here are a few of them:

  1. Machine Translation: Automatic Translation from one language to another
  2. Information Retrieval: Something like a search engine, that retrieves relevant information from a large set of documents based on a search query
  3. Information Extraction: Extract concepts and keywords – such as names of people, locations, times, synonyms etc.
  4. Deep Learning has lately become a new field of NLP, where a system tries to understand a document like a human understands it.

2017 Mar 8 Women in Data Science HKU

A project of text classification based on news articles

  • N-grams language model
  • Chinese text segmentation – a tool Jieba on github
  • Hidden Markov Model (HMM)

From the web – general review articles – pop articles 

State of art of – AI hard – a concept — strong AI – Ali iDST – as of 2017 Apr 17

Stage I: 2014


  • They started human-machine conversation in 2014. Commercial motivation = two trends. First, smart phone are popular, thus various use scenarios that mouse and touch cannot satisfy. Second, Internet service expand to all scenarios of everyday life! Thus possibility for other interaction methods.


  • In early 2014 human machine conversation is at primary stage.
  • In China Taobao, few people actually use speech to search.
  • In early 2014 many companies’ speech search product died.

Stage II: 2015 – 2016

Facts and current work

  • Alibaba iDST work with YUNOS on a speech-enabled interactive search assistant (something like Siri?) But then problems = compete for users’ attention with other apps on all activities. Thus have to take care of what every application is doing (buy air tickets, buy food… etc) = huge workload. Thus in 2015 they transit to work on a platform to build an API to every app. (no details here)
  • Main tasks now: 1) use machine to understand human language (intention + key information. e.g. book train ticket). 2) Manage interaction (keep asking until get key information). 3) Open-ended conversations (wikipedia…) 4) Casual chat (not so valuable in commercial sense)


  • Engineering: scalability … scale on particular domain + scale on particular device! also talked in that NLG paper where the authors built a commercialised natural language generation system for a hospital. From an engineering perspective, challenges include: 1) accuracy and 2) scalable system.
  • Science: general challenges in NLP include various of user’s language (what exactly do you want when you said “you want to go to the US”) + understand context + robustness + sharing of common sense that makes the conversation flow smooth

Stage III: 2016 – now (2017 Apr)


  • The NLP engine from traditional machine learning method to deep learning!! Intention = CNN, slot-filling = Bi-LSTM-CRF, context (didn’t specify which model!) robustness = data augmentation
  • developed a task-flow language to solve the “understand context” problem
  • For API to domain, developed a OpenDialogue with different specific tasks

Current applications 

  • On smart appliances… something like Amazon Alexa 
  • Some scenarios where human-machine conversation can be useful = Internet and cars. talk when you drive..

Challenges again

  • open-domain NLP — deal with scalability of specific domains
  • current human machine interaction are mostly single-round conversation, or can only take into account limited context. So, modelling with regard to specific context important.
  • Data-driven learning instead of user-defined modules in conversations.

Deep Learning 

  • caffe – a Berkerley package



Conversation with Yuchen Zhang

Semantic parsing

自然语言– logical form — -knowledge base

  • 问答,学术界关心,比较复杂的问题的问答,需要逻辑推理
    • Argmax …
    • Question — 数据源的操作
  • 做到什么程度,open problem 任何一点 progress 就可以直接 transfer 到 industry
  • Research community — cannot get the data — abstract the problem — Wikipedia question answering dataset


Personality — 分析情感的 paper

问题的简化,why? 一开始并不是因为分析情感有太大用处,例如 Amazon review — 直接看分就可以了。不需要看文字

Chatbot 分析也不是很准


Word extraction 分析比较差。

现实,并不是特别能  transfer ,形成了这么一个群体,sentiment analysis 其实已经不太关心 semantic analysis


data coding –

information extraction – paragraph — A & B relations

  • Open source — open IE
  • UWashington

summarization – open question still active research 

  • Research question
  • Long paragraph — short pararaphs
  • Neural models


NLP – 不能帮助理解。但是可以提供基本工具,股价预测,新闻,MicroSoft stock going up ? Review — 用户体验,输出

  • Opensource cannot do this
  • What it can do? Extract features
    • Sentiment analysis — emotions positive or negative
    • Relation
    • Feature vector 做预测


  • 关键词 – topic model 是什么,先做 training 一万篇新闻,unsupervised training 谈论同一件事情
  • Sentiment analysis — amazon 产品 review
  • 端到端的任务,summarization active research area
    • Baseline first line
    • Language independent — 中文语料
  • question answering
  • Reading comphehensive


  • 分问题:word embedding — 新闻,文学  corpus


Parcy Liang

相当一部分  professor in linguistics 语言学家,computational methods

Cho 拍到 datasets 上面来

Cho neural machine translation 出名

  • Postdoc
  • Deep learning class paper
  • Attention model
  • GRU
  • Auto- drive
  • Vision

Chris Manning

  • Deep learning + NLP

NLP 工业界比较热门

  • semantic parsing — search engine — rule based more conservative — Google in search
  • Question answering — dialog system — hot — AI assistant
  • Fundamental module — word embedding, parsing, post-text , dependency as features used in applications
Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *