- Speech and Language Processing (2nd Edition, 2007, Prentice-Hall), by Daniel Jurafsky and James Martin
- UCB CS 294-5: Statistical Natural Language Processing, Fall 2005 course notes
- Manning and Shuetze, Foundations of Statistical Natural Language Processing
- Deep Learning NLP tutorials By Richard Socher and Christopher Manning
- A Stanford course
Structure of this note: I plan to note down key concepts, mathematical constructs and explanations.
Basic tasks in NLP
- language modeling,
- POS tagging,
- named entity recognition,
- sentiment analysis
- and paraphrase detection
NLP is very vast and varied. Here are a few basic tools in NLP:
- Sentence splitting: Identifying sentence boundaries in text
- Tokenization: Splitting a sentence into individual words
- Lemmatization: Converting a word to its root form. E.g. says, said, saying will all map to root form – say
- Stemmer: It is similar to a lemmatizer, but it stems a word rather than get to the root form. e.g. laughed, laughing will stem to laugh. However, said, saying will map to sa – which is not particularly enlightening in terms of what “sa” means
- POS tagger: Tags a word with the Part of Speech – what is a noun, verb, preposition etc.
- Parser: Links words with POS tags to other words with POS tags. E.g. John ate an apple. Here John and apple are nouns linked by the verb – eat. John is the subject of the verb, and apple is the object of the verb.
If you are looking for the state of the art for these tools, check out StanfordCoreNLP, which has most of these tools and a trained model to identify the above from a document. There is also an online demo to check out stanfordCoreNLP before downloading and using it with your application.
NLP has several subfields. Here are a few of them:
- Machine Translation: Automatic Translation from one language to another
- Information Retrieval: Something like a search engine, that retrieves relevant information from a large set of documents based on a search query
- Information Extraction: Extract concepts and keywords – such as names of people, locations, times, synonyms etc.
- Deep Learning has lately become a new field of NLP, where a system tries to understand a document like a human understands it.
2017 Mar 8 Women in Data Science HKU
A project of text classification based on news articles
- N-grams language model
- Chinese text segmentation – a tool Jieba on github
- Hidden Markov Model (HMM)
From the web – general review articles – pop articles
State of art of – AI hard – a concept — strong AI – Ali iDST http://mp.weixin.qq.com/s/hYrNi17HiyW9kxokgLR_Ig – as of 2017 Apr 17
Stage I: 2014
- They started human-machine conversation in 2014. Commercial motivation = two trends. First, smart phone are popular, thus various use scenarios that mouse and touch cannot satisfy. Second, Internet service expand to all scenarios of everyday life! Thus possibility for other interaction methods.
- In early 2014 human machine conversation is at primary stage.
- In China Taobao, few people actually use speech to search.
- In early 2014 many companies’ speech search product died.
Stage II: 2015 – 2016
Facts and current work
- Alibaba iDST work with YUNOS on a speech-enabled interactive search assistant (something like Siri?) But then problems = compete for users’ attention with other apps on all activities. Thus have to take care of what every application is doing (buy air tickets, buy food… etc) = huge workload. Thus in 2015 they transit to work on a platform to build an API to every app. (no details here)
- Main tasks now: 1) use machine to understand human language (intention + key information. e.g. book train ticket). 2) Manage interaction (keep asking until get key information). 3) Open-ended conversations (wikipedia…) 4) Casual chat (not so valuable in commercial sense)
- Engineering: scalability … scale on particular domain + scale on particular device! also talked in that NLG paper where the authors built a commercialised natural language generation system for a hospital. From an engineering perspective, challenges include: 1) accuracy and 2) scalable system.
- Science: general challenges in NLP include various of user’s language (what exactly do you want when you said “you want to go to the US”) + understand context + robustness + sharing of common sense that makes the conversation flow smooth
Stage III: 2016 – now (2017 Apr)
- The NLP engine from traditional machine learning method to deep learning!! Intention = CNN, slot-filling = Bi-LSTM-CRF, context (didn’t specify which model!) robustness = data augmentation
- developed a task-flow language to solve the “understand context” problem
- For API to domain, developed a OpenDialogue with different specific tasks
- On smart appliances… something like Amazon Alexa
- Some scenarios where human-machine conversation can be useful = Internet and cars. talk when you drive..
- open-domain NLP — deal with scalability of specific domains
- current human machine interaction are mostly single-round conversation, or can only take into account limited context. So, modelling with regard to specific context important.
- Data-driven learning instead of user-defined modules in conversations.
- caffe – a Berkerley package http://caffe.berkeleyvision.org
Conversation with Yuchen Zhang
自然语言– logical form — -knowledge base
- Argmax …
- Question — 数据源的操作
- 做到什么程度，open problem 任何一点 progress 就可以直接 transfer 到 industry
- Research community — cannot get the data — abstract the problem — Wikipedia question answering dataset
Personality — 分析情感的 paper
问题的简化，why? 一开始并不是因为分析情感有太大用处，例如 Amazon review — 直接看分就可以了。不需要看文字
Word extraction 分析比较差。
现实，并不是特别能 transfer ，形成了这么一个群体，sentiment analysis 其实已经不太关心 semantic analysis
data coding –
information extraction – paragraph — A & B relations
- Open source — open IE
summarization – open question still active research
- Research question
- Long paragraph — short pararaphs
- Neural models
NLP – 不能帮助理解。但是可以提供基本工具，股价预测，新闻，MicroSoft stock going up ? Review — 用户体验，输出
- Opensource cannot do this
- What it can do? Extract features
- Sentiment analysis — emotions positive or negative
- Feature vector 做预测
- 关键词 – topic model 是什么，先做 training 一万篇新闻，unsupervised training 谈论同一件事情
- Sentiment analysis — amazon 产品 review
- 端到端的任务，summarization active research area
- Baseline first line
- Language independent — 中文语料
- question answering
- Reading comphehensive
- 分问题：word embedding — 新闻，文学 corpus
相当一部分 professor in linguistics 语言学家，computational methods
Cho 拍到 datasets 上面来
Cho neural machine translation 出名
- Deep learning class paper
- Attention model
- Auto- drive
- Deep learning + NLP
- semantic parsing — search engine — rule based more conservative — Google in search
- Question answering — dialog system — hot — AI assistant
- Fundamental module — word embedding, parsing, post-text , dependency as features used in applications