Monthly Archives: December 2017

I will probably applying for psychology grad programs. And I will write my first chapter of literature like this

/This post described 1) my past few days’ work since Christmas ( not very much) and 2) my change of future plans and reasons and 3) my progress on MPhil thesis literature review chapter.

My last few days work

After Christmas break productivity decreased (from ~ 11 hour per day to ~ 8 then to ~ 5). Hypothesis seems to be validated that a structured environment and external accountability is important.

After coming back to my old office in Hong Kong, my productivity immediately decreased at least 70%. Truly amazing. Possible explanations: 1) jet lag. I woke up at 4am the day before, slept at 4pm, then woke up 10 pm that day. In-between I experienced periods of alternating hallucination (no kidding) and absent-mindedness. I sing along Youtube videos for 2 hours early this morning (3am – 5am). Weird.

Explanation 2) this environment arose certain negative emotions associated with past experience. Whenever I sit at my old desk I found it hard to concentrate, falling back to old patterns. In a new place where I had not worked before I felt much better.

My changed future plans and why 

On another note, to the readers of this blog, I very likely will apply for PhD programs in psychology next year. Continuing working on my master’s thesis (will talk about it shortly) makes me realize my past struggle started from misaligned interest with my supervisor, my department, ethnography, and sociology in general.

The question I was interested in was in fact studied more by psychologists, and I am much satisfied with their approach to the question (belief formation, political attitude formation, etc.)

This is the reason I struggled for so long in my study. I first read about social movement when I was at UW-Madison. Then in my first few months in MPhil in 2015, I found some literature in public opinion research close to what I would like to know, although they are not directly on protests. Protest scholars on the other hand are not that interested in opinions per se – they are interested in opinions (grievances) as an independent variable in explaining emergence of protests and participation.

That’s the first time I was confused. I treated the literature review process as finding answers, but I found that the answer did not exist. Mistake 1: For research, this is supposed to be a good thing (research gap), but I did not realize at that time. Because of the habit of an undergraduate, I just thought if I did not find an answer that meant I didn’t work hard! (Two years later, in 2017 I wrote to Pam Oliver, then a social movement mail list she was in, then  John Josh asking them for literature – and indeed there is none from the angle I would like to know. )

Back to Nov 2015. With supervisor’s pointer I went to methodological literature (Grounded theory, Methods of Discovery, two books by Howard Becker – find them not that helpful). I dislike my ethnography class from week 1. Complained to supervisors but decided to hang on to it (while secretly hate it). Then during Chinese New Year 2016 I read about  necessarity of narratives / contingencies , which lead to some reading in the sociology of science – combined with a previous RA task in tacit knowledge— basically literature justifying why interpretative sociology / narrative methods in historical sociology are useful. I did not like the arguments, becasue I found the concepts vague and the logic shabby. Mistake 2: did not take action immediately due to fear of authority.

Then it’s March 2016 or so. I went to sit in at Xiang Biao’s anthropology class. Read some beautiful anthropology, but they are not relevant to my research. Foucault write about biopolitics, power / knowledge which led to some efforts in vain. Probably around the same time still trying to make sense of ethnography, so researched Alice Goffman’s On the Run and Sida Liu’s doctoral thesis. Not much progress on thesis.

In the mean time, read into cultural sociology (cognitive sociology?) but did not find a fit. Cultural sociology concerns culture / action, but 1) I’m not interested in culture and 2) I’m not interested in action.

In the meantime I read Andreas Glaeser’s book political epistemics recommended by my supervisor. We seemed to agree using this framework. This books reads further lead to phenomenology and social otology. Nobody cites the book back then. In 2017 my supervisor published a paper using this book. Now I think back, I do not really like or believe the theory. I tried to read so many times, and I will use the theory in my thesis for graduateion. But if I’m being uttly honest, I don’t understand why the theory is good and I do not feel comfortable using this theory. I don’t know why.

By this time I’m basically disillusioned and disenchanted with sociology and academia. Around May or Jun I began seriously considering a career change. Mistake 3: In facing a problem, not thinking about solutions but want to run away!

For around 12 months I did not seriously work on my thesis. In Nov and Dec I did a whole round of data coding, but not much writing. Thinking back, this is typical procrastination. In 2017 I started to learn about machine learning and coding. Read briefly into cognitive sociology, ideology… Late April to July, internship. Mistake 4: poor time management skills and no priority.

Reflecting this experience. The root problem is my literature review always felt uncomfortable, so I can not formulate a research question using any of the directions I had searched. I felt uncomfortable because belief formation is a problem in social psychology not sociology, which I realized pretty late.

Thus I will proabably applying for grad schools in psychology next year.

I will write my literature review draft this way

The literature review will have two chapters: why do people have different understandings of the same social movement, and how do they form these understandings.

Why question is tricky. I imagine myself talking to scholars in sociology on protests. Thus I should cite them. But they have not studied the problem directly. The closest I can find is 1) relative deprivation theory and 2) framing and 3) social identity and 4) system justification. The end goal of all these theories is explain participation. But I can use them to say, because some have feelings of relative deprivation while others’ don’t.

Then a natural question is, what leads to perceptions of relative deprivation? Here, sociology stops and is fine with describing the understandings themselves (meaning-making). Psychologists answer this, one theory is frustraction-aggression, the other is cognitive dissonance. These two theories are what I really wanted to know.

I will do the same for framing, social identity and system justification. The part of them that describes psychological process is useful, while the other part on how psychological process leads to participation is not useful. Lesson 1: how to selectively use a theory when it is not a good fit but it’s the only thing I can find. (I might not be doing this right though.)

How do they form these understandings? This part will likely be in political socialization, pursuation, attitude change. Still quite broad. Lesson 2: When reading about a new field for a question, do not be taken away by their arguments. One paper leads to another and this needs to be controled.

I developed a standard workflow that I currently find helpful:

  • Read one highly cited document. Any paper contain the key words. Better be a review paper.
  • Use the Harvard thesis guide literature review template, trace the theory history. Note key sources.
  • Write down 5 ~ 7 key sources biblography information.
  • Download these papers. Do not download anything more than these 5 ~ 7. Keep the most important papers! (Dr. Tian had once taught me to find the citation list. Use these with highest citations.)
  • Take notes and read these 5 ~ 7 papers using the Harvard thesis guide template.
  • Print out the reading notes. Select the top priority papers again.

This is like preventing over-fitting in machine learning. Because both deals with the trade-off between generalziation and spceific problems. The best strategy is thus early stopping – leave some of the specific data points unvisited for better generalization.


Updates 2018 Mar 13 : I gave up with this idea after taking a course in psychology. Reason is that I am not really interested in their debates.

Debugging larger pyspark ML programs

This post describes my learning experience in developing larger programs, especially those :

  • Takes a long time to run – due to big data sets and computationally intensive algorithms
  • Requires developing locally and on HPC. That is cannot be solved in a Python IDE

The take away is:

  • To save time, try writing scripts in one place only.
  • Do not develop interactive and then paste everything to an editor!

The problem

I found myself spending excessive time (~ 4 days) developing a program that should be simple in its logic. Basically, I just need to call Pyspark API functions to do classification on 20Newsgroup dataset. It should be a simple program!

How did I spend my time?

  • First day, I found a script doing a similar task. When I tried to use it I came into the following problems:
    • Read files on HDFS. Spark only works with files on HDFS not local file system. This mistakes took me some time, as I thought the problem was with syntax in reading nested folders.
    • The script does not clean text – remove headers, numbers, punctuations. To do this, I had to understand the .trans function. This also works differently in Python2 and Python3, which took me some time to realize.
    • The script use Pyspark SQL, thus took some time to learn DataFrame as well.
  • By day 2 the data is read into Spark DataFrame.
    • I then had problem calling functions from MLLib, because I didn’t realize ML and MLlib are two libraries and have different data structures. I then came into problem when using an MLlib function to ML data.
    • I also tried to convert the data structure back and forth, from RDD to Labeled Point to ML.
    • To inspect what is in the data, I also spent time calling the wrong functions, or transform everything into an RDD and call map functions.
  • By day 3 I intend to use Spark-submit on HPC. The main task is to learn to use editor.
    • Because someone told me I should be using editor instead of debugging interactively, or else I cannot see code structure, I began to learn vim. That took a morning or so (!).
  • By day 4, I am trying to clean up the code and write functions.
    • This creates another level of complexity. One of the bug is I forget to update the function argument


  • I type every line of code three to four times. First on my workstation, then on the server’s pyspark shell. Finally I copy the code into a script. This not only creates space for mistakes, but also is inefficient.
    • I did this because I am not comfortable writing scripts on HPC yet.
    • Also because I am not comfortable debugging with a script and an editor, without using IDE.
  • I input every line of code at least three to four times.
    • First in interactive mode.
    • Then, copying the line into my editor. Since the project spans across days, every time I start again I need to re-read the files!
    • Also, I then creates a file on a small dataset on HPC and run that.
    • After that, I run the file again on a larger dataset.
  • Since I am debugging at multiple places, I need to do version control as well.
    • I remember scp back and forth sending files along. Whenever I edited something, I remove the older version of the program at the other place.
  • I am not familiar with data structure and functions on Pyspark. This also leads to waste of time.
  • Interactive mode of Spark is slow. I once forgot to cut the dataset smaller, and running one command (e.g. transform) on the whole dataset takes 10 minues! If there are 20 commands like this that would be ~3 hours.


  • The above problems can be summarized as :
    • I need to write script on multiple places: server and local.
      • Solution: learn to use editor in the server. e.g. Vim
    • Submit program interactively vs in batch. I do not know how to debug with a script so I have to use interactive mode to make sure I know what I am doing. But debugging interactively means double the amount of typing because every command needs to go through the terminal and the editor. That’s also running the program twice.
      • Solution: learn to run the program from the script. The con of this is need to laod data multiple times. Also, use a smaller dataset so data loading will not be a pain. 
    • Time spent on start-up. The data file is huge, thus re-loading it takes time. If every time load the dataset is 3 minutes, load it 20 times would be 60 minutes.
      • Solution: use a smaller sample dataset for developing.

Moral: try to write only one set of programs in a single place.

  • For pyspark, I can only use HPC. So I just write on HPC.
  • Pyspark also cannot use pdb.
  • For Python, I should test everything locally first. There are two final programs
    • A script that can run locally and on HPC.
    • A script to be submitted to the cluster.

Debugging philosophy

  • “Bugs” are not bugs but errors. The responsibility lies with the programmer. A program that has errors is simple wrong, because 1) the programmer is not really familiar with the rules and grammars of the library. Used the wrong data structure, used the wrong function call. etc.
  • Debugging is a learning experience.  Why does debugging takes lots of time? Because the programmer is learning something new, thus need time to try and make mistakes. There seems to be a wishful thinking that a perfect program will magically began working by itself, and thus time will be well spent. It wont’, because learning always takes time!

Pyspark ML vs MLLib

A post that summarizes main difference between Pyspakr ML and MLlib. This is based on Spark 2.2.0 and Python 3.

Data Structure

  • pyspark.mllib is the older library for machine learning. It can only use RDD labeled point. But then more features than ML
  • newer library for machine learning. It can only use sql DataFrame. But easier to construct a ML pipeline.
  • It is recommended to use DataFrame APIs (i.e. ML) because it is more stable. Also this is newer.


The two libraries seem to be similar in terms of feature selection APIs.


  • implements Pipeline workflow, which is based on dataFrame and can be used to “quickly assemble and configure practical machine learning pipelines”.

Classification algorithms provided by MLlib:

Classification algorithm by ML:

Seems that ML has multilayerPerceptron which is not in MLLib.


  • pyspark ml implements while pyspark.mllib does not
  • However Pyspark ML’s evaluation function is difficult to use.  The docs are not clear. Thus, changing back to MLlib to use this.

When to use which?

  • Efficiency: MLLib uses Labelpoint RDD, while ML uses structured DataFrame. Thus, “if the data is already structured DataFrames, then ML confer some performance benefits over RDD’s, this is apparently drastic as the complexity of your operations increase.”
  • Resource: DataFrames consume far less memory when caching than RDDs. Thus lower level operations RDD’s are great but high level operations, viewing and typing with other API’s use DataFrames. (Above all quote Stackoverflow user Grr)
  • It is recommended to use ML and DataFrame. However if want to use Gridsearch, seems still need to revert to Labeled point!

Reference :

A stackoverflow question:

MLlib doc:

ML doc:

Comparison of data structures:


Clean Text in Python3

A pain in the ass. This post summarizes “best” approaches to clean text data in Python3.

It will not cover depreciated syntax in Python2. For example string.maketans has a different usage in python2 — it is not discussed here.

Is that a string or a Unicode?

Reference here.

When you see a string in python2, there are two possibilities:

  • ASCII strings: Every character in a string is a byte. Look up the hexadecimal value in the ASCII table.
  • Unicode strings: every character in a string is one or more than one byte. Look up the hexiadecimal value in the Unicode table – there are many of them. The most popular one is UTF-8.
    • Example: 猫 is represened in three bytes in Unicode. when Python2 reads this, it gots it wrong – it thinks there are 5 characters but there is in fact just three… As python2 use ascii to decode.
      • To produce the correct representation, use x.decode(‘utf-8’)

String vs sequence of bytes in Python2

  • String is a sequence of Unicode codepoints… they are abstract concepts and cannot be stored on disk. They are a sequence of characters.
  • bytes are actual numbers… they can be stored on disk.
  • Anything has to be mapped to a byte to be stored on a computer
  • To map a codepoint to a byte, use Unicode encoding
  • To convert a byte to a string, use decoding .

String vs sequences of bytes in Python3

  • In sum, in Python 3, str represents a Unicode string, while the bytes type is a sequence of bytes. This naming scheme makes a lot more intuitive sense.

Encode vs Decode

  • To represent a unicode string as a string of bytes is known as encoding

Remove punctuation

The best answer (I think) from Stackoverflow :

import string

# Thanks to Martijn Pieters for this improved version

# This uses the 3-argument version of str.maketrans
# with arguments (x, y, z) where 'x' and 'y'
# must be equal-length strings and characters in 'x'
# are replaced by characters in 'y'. 'z'
# is a string (string.punctuation here)
# where each character in the string is mapped
# to None
translator = str.maketrans('', '', string.punctuation)

# This is an alternative that creates a dictionary mapping
# of every character from string.punctuation to None (this will
# also work)
#translator = str.maketrans(dict.fromkeys(string.punctuation))

s = 'string with "punctuation" inside of it! Does this work? I hope so.'

# pass the translator to the string's translate method.

The code above removes punctuation and just delete it.

Replace punctuation with a blank

This method uses regular expression… I find it to be better than using a translator!

import re
re.sub(r'[^\w]', ' ', text)

Dealing with all kinds of blanks

Again borrowing from this awesome stackoverflow post…

# Remove leading and ending spaces, use str.strip():
sentence = ' hello  apple'
>>>'hello  apple'

# Remove all spaces, use str.replace():

sentence = ' hello  apple'
sentence.replace(" ", "")
>>> 'helloapple'

# Remove duplicated spaces, use str.split(), then join the words together

sentence = ' hello  apple'
" ".join(sentence.split())
>>>'hello apple'

Summary of workflow:

  1. Decide if that’s a string or a unicode, or a sequence of bytes. This will decide whether need to encode or not. Ultimately we want a string, i.e. “str” type.
  2. Import re module. MHOP this is the most convenient one to remove punctuations.
    1. re.sub can replace all numbers and punctuations with blanks. Thus, will not join two unrelated words together if they are connected by punctuations.
  3. Use s.lower() to change a string to lower case… Note here s is a string! not a unicode object.
  4. Use s.strip() to strip out the excessive blanks!
  5. Use . join to join the list of words together with only one blank between them, if needed.

Machine Learning in Spark

This post describes Spark’s architecture for building and running machine learning algos.

Machine learning algorithms can be found by Spark’s MLlib API.

The data structure used are DataFrame. With regard the the last post, this means we need to import spark SQL context to use as well.

There are five key concepts:

  • DataFrame: the data structure. Each column can be text, labels, features, or numbers…
  • Transformer: an object to transform a DataFrame to another one. A machine learning algo is represented as a transformer that takes an input DataFrame with features and outputs a DataFrame with the predictions. A transformer is called by a transform() method on a DataFrame.
  • Estimator: an algorithm can be fit to a DataFrame and produces a transformer.  e.g. a ML algo can be fit to the training set and produce a model. An estimator is called by a fit() method on a DataFrame.
  • Pipeline: a collection of transformers and estimators that eventually consists of a trained model.
  • Parameters: parameters of the transformer and the estimator.

Now go over the example on Spark Website for understanding of how to use machine learning Pipeline.

Step one: read in data 

First create the spark SQL context, and use that to create a SparkDataFrame. This assumes a spark context is already created as sc.

sqlContext = SQLContext(sc)

training = sqlContext.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

The results of this step:
| id|            text|label|
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|


Step two: preprocess + training = ML pipeline 

Now, preprocess data by constructing a ML pipeline.

  • Tokenizer here tokenize every text into words ….
  • hashingTF: this one is more tricky. As per the document,

Our implementation of term frequency utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table. The default feature dimension is 220=1,048,576

My understanding on hashTF (hash term frequency?) is that :

  • Hash functions map data of arbitrary size to a fixed size. It maps every word to a unique position, hopefully. But there will be collisions, which we ignore…
  • Next we just count the number of items at each position.
  • This saves the step of storing a map of word – index. Instead we just need to store a function.

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model =

Now let’s go over the output of each step. First, check the results of the tokenizer….

p1 = Pipeline(stages = [tokenizer])

| id|            text|label|               words|
|  0| a b c d e spark|  1.0|[a, b, c, d, e, s...|
|  1|             b d|  0.0|              [b, d]|
|  2|     spark f g h|  1.0|    [spark, f, g, h]|
|  3|hadoop mapreduce|  0.0| [hadoop, mapreduce]|

Then, check the results of the HashingTF….

hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
p2 = Pipeline(stages = [tokenizer, hashingTF])

| id|            text|label|               words|            features|
|  0| a b c d e spark|  1.0|[a, b, c, d, e, s...|(262144,[97,98,99...|
|  1|             b d|  0.0|              [b, d]|(262144,[98,100],...|
|  2|     spark f g h|  1.0|    [spark, f, g, h]|(262144,[102,103,...|
|  3|hadoop mapreduce|  0.0| [hadoop, mapreduce]|(262144,[134181,1...|

Third, check the output of the logistic regression model:

lr = LogisticRegression(maxIter=10, regParam=0.001)
p3 = Pipeline(stages=[tokenizer, hashingTF, lr])
['id', 'text', 'label', 'words', 'features', 'rawPrediction', 'probability', 'prediction']

In our original data, we only have id, text and label.

  • ‘words’ are added by the tokenizer,
  • ‘features’ are added by the hashingTF,
  • ‘rawPrediction’ is created by logistic regression. It is the “score” of each class labels. It varies algo by algo. For example, in a neural network it can be interpreted as the last layer passed to the softmax.
  • prediction: the predicted class label. the argmax of raw prediction.
  • probability : the “real” probability given raw prediction… Looks like passing rawprediction to a softmax for multi-class classification problems.

Step three: making predictions

After building the model now we can start to make predictions by building and testing on a toy test set.

First build test set — it is similar to the training set, except without labels

test = sqlContext.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])
| id|              text|
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|

Make predictions…

prediction = model.transform(test)

DataFrame[id: bigint, text: string, words: array<string>, features: vector, rawPrediction: vector, probability: vector, prediction: double]

Here prediction is of the same format as the transformed train dataset. The reason is, fitting machine learning algorithm is equivalent to transform the test data to add other columns…

Finally, printing out the results we care: prediction

selected ="id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.159640773879,0.840359226121], prediction=1.000000  
(5, l m n) --> prob=[0.837832568548,0.162167431452], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.0692663313298,0.93073366867], prediction=1.000000
(7, apache hadoop) --> prob=[0.982157533344,0.0178424666556], prediction=0.000000

Question: what will the output be if some of the classes are not in the training set?


Spark SQL quick intro

This simple post describes what spark SQL is and how to use it. Basic operations concepts are not difficult to grasp.

What is SparkSQL?

  • Spark’s interface to work with structured and semistructured data.

How to use SparkSQL?

  • Use it inside a Spark application. In pySpark this translates to create a SQLContext based on SparkContext. Code is :
    sqlContext = SQLContext(sc)

This is initialization.

What are the objects in Spark SQL?

  • DataFrame (previously SchemaRDD)

What is the difference between a DataFrame and a regular RDD?

  • DataFrame is also a RDD. Thus, can use map and reduce methods.
  • Elements in DataFrame are Rows.
    • A row is simply a fixed length of array, wrapped by some more methods.
    • A row represent a record.
    • A row knows its schema.
    • DataFrame has information of the schema — the structure of the data fields.
    • A DataFrame provides new methods to run SQL queries on it.

How to run query on SparkSQL?

  • First initialize sql context.
  • Then create DataFrame.
  • Then register the DataFrame as a temporary table. Code in Python:
    • input = sqlcontext.jsonFile(inputFile)
      toptweets = sqlcontext.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
  • Register it as table — seems to have new versions in Spark 2.0

Here is a good guide to working with SparkSQL using 20newsgroup data:


Recommendation System: Matrix Factorization

This post explains how to build a recommendation system based on matrix factorization. However in practice, the engineering aspects are more challenging when the dataset is huge – these are not covered here.

Recommendation system:

The goal is to recommend an item to a user based on this user’s previous ratings on the item. Such systems are now used everywhere.

The formulation of this problem is to build a utility matrix, in which:

  • Each row represent a user
  • Each column represent an item
  • The elements in that matrix can be 1) user rating on this item or 2) frequency user visited this item (e.g. in the context of Spotify)

The problem is such matrix is usually sparse. Not every user has rated every item. Thus the goal is to fill in the blanks.

After filling in the blanks, we can recommend the items with high ratings but the user has not yet reviewed, to the user.


Two sets of algorithms:

  • Collaborative filtering — use clustering to find similar rows/columns, then compute mean
  • Matrix Factorization  — as the name indicates.

We only talk about matrix factorization here.

The utility matrix can be decomposed as :

U = \Sigma \cdot V^{T}

If U is a m * n matrix, Now $\Sigma$ is a m * k matrix, and V is a n * k matrix.

We can interpret U as :

  • each column represent a latent factor
  • each row represent a user
  • each element represents the users’ preference to that latent factor

We can interpret V as :

  • each column represent a latent factor
  • each row represent an item
  • each element represents the latent factor’s contribution to that item in rating

This looks complex but in fact is really simple. We basically interpret the rating of user i on item j, as the summarization of user i’s ratings on each latent factor!

An example:

  • I rate the book Little Prince a 3.6/5.
  • I rate, because Little Prince is 20% about youth and 80% about romance.
  • I give the youth factor 2 score.
  • I give the romance factor 4 score.
  • My total score of little prince is then 2 * 0.2 + 4 * 0.8 = 3.6.

Compute matrix factorization

The above is about a theoretical, fully filled utility matrix.

Now our goal is to find such a matrix, by finding corresponding \Sigma and V.

We can view this as a supervised learning problem.

The observed value is the observed user-item utility matrix.

The theoretical value is \Sigma \cdot V.

The cost function is square loss, for every i and j in that matrix.

Our goal is to minimize the empirical cost function – the sum of all square loss. We can of course add regularization terms to this.

How to achieve this? Using iteration.

  • First randomly initialize  \Sigma and V.
  • Then choose one element (parameter) in \Sigma or V. Set this one to be the argmin of the empirical cost function
  • Change to another parameter and iterate the process, until convergence.

Now the following is to be noticed:

  • Proprocessing of matrix: since different users have different rating scale, can normalize the utilitity matrix by row, by column.
  • How to choose the order in choosing parameters?
    • Can randomly choose!
    • Can choose by column!
  • Overfit:
    • Early stop.

Till now I have described this algorithm. It turns out to be pretty simple! Took me 15 minutes to understand and 18 minutes to write this post. Next time when learning techniques, should aim for the same time span.


The Stanford textbook :

(Not really)Solving a memory limit problem

Through a few months’ study I noticed the challenges of big data is less of an intellectual one than an engineering one. There are algorithms designed for actually sampling huge data streams (e.g. this Stanford course), but still in my practice the biggest challenges are usually long running time, out of resource problems, and thus difficulty in debugging, and prolonged development circle.

Frustrations arise from one minor problem, which can hinder days’ of efforts.

The key is to break the process down step by step and identifying the source problem.

This posts describes such a case and my solution to it.


I want to train an hierarchical GRU model with attention on 20000 documents of medical summary. I use the model for classification on the test set.

My implementation of attention is a fixed linear layer. Due to attention mechanism, the number of hidden states in each level of GRU should be the same for the training set and the test set.

Since number of hidden states corresponds to the max number of sentences in all documents, and the max number of tokens in all sentences, an easier way is to truncate all documents and pad them into the same length of sentences and tokens.

To ensure training set and test set have the same shape, I need to read them all in memory to truncate together.

My group mate already break the files down into smaller chunks: Full, Med, and Tiny for running the program and testing.

The problem is since the files are huge, I cannot read them all into memory. The data are stored in pickle files.

Possible solutions

  1. Increase the memory limit on HPC
  2. Read pickle file incrementally. But still, eventually I need them all together in memory.
  3. Change a different storage format.


  • The problem lies in the format of source file. It’s pickle file, thus difficult to read line by line.
  • Also, the problem lie in running python interactively on server. It seems that server allocate a small memory limits for interactive mode. However, for batch submission to the computing nodes, the memory limits are higher and adjustable. See here.

Possible solutions phrase 2:

  • Now I have two sets of memory: on login node, and on cluster (?)
  • The original problem becomes : I cannot read big files on login node interactively
  • However,
    • I can read small files on login node interactively, thus can debug and test
    • I can request for more memory on cluster. Thus I can read big files from cluster, and break them down into smaller chunks
    • I can then read the smaller chunks from login node interactive

Problem is then solved.

But the solution to this problem is specific, in that:

  • I can break the source file down in later programs. Thus, big files are not really necessary for my program
  • I in fact have access to larger memory.

I have not solved the problem when there is a real memory limit and hard file size specification. For that, might need to store in different format (pickles are not meant to be read line by line) or design different algorithms.

pyspark combineByKey explained

This post explains how to use the function “combineByKey”, as the pyspark document does not seem very clear on this.

When to use combineByKey?

  • You have an RDD of (key, value) pairs – a paired RDD
    • In Python, such an RDD is constructed by creating elements of tuples
  • You want to aggregate the values based on Key

Why use combineByKey, instead of aggregateByKey or groupByKey, or reduceByKey?

  • groupByKey is computationally expensive. I once had a document of 50000+ elements, and the sheer computational resource made my program un-runnable.
  • aggregateByKey – the docs are not very clear. I did not find resource on how to use them.
  • reduceByKey — is powered by combineByKey.  so the latter is the more generic one.

How to use combineByKey?

The document goes like this:

# create an RDD
x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
def to_list(a):
return [a]

def append(a, b):
return a

def extend(a, b):
return a

sorted(x.combineByKey(to_list, append, extend).collect())

>>> [('a', [1, 2]), ('b', [1])]

Here is how the document goes for combineByKey:

combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=)
Generic function to combine the elements for each key using a custom set of aggregation functions.

Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C.

Users provide three functions:

  • createCombiner, which turns a V into a C (e.g., creates a one-element list)
  • mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
  • mergeCombiners, to combine two C’s into a single one (e.g., merges the lists)

To avoid memory allocation, both mergeValue and mergeCombiners are allowed to modify and return their first argument instead of creating a new C.

In addition, users can control the partitioning of the output RDD.

What does that mean? Basically the combineByKey functions takes three arugments, each is a function – or an operation – on the elements of an RDD.

In this example, the keys are ‘a’ and ‘b’. The values are ints. The task is to combine the ints to a list, for each key.

  1. The first function is createCombiner — which turn a value in the original key-value pair, to a combined type. In this example, the combined type is a list. This function will takes an int and returns a list.
  2. The second function is mergeValue — which adds a new int into the list. This is to combine value with combiners. — thus uses append.
  3. The third function is mergeCombiners — which merges two combiners. Thus in this example is to merge two lists  — thus uses extend.





Machine learning: soft slassifiers and ROC

This post explains the concept of soft classifiers (in its simple form) and offers examples in sklearn.

Soft classifiers

In classification problems, hard classifiers gives the exact predicted class.

But soft classifiers gives a probability estimation over all classes. Prediction can then be made using a threshold. This also gives the possibility of multi-label classifications.

Code in sklearn:

This is a sample program in python using a KNN classier.

import pandas

url = ""
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

#  change to binary classification
new_class = np.random.randint(0, 2, len(dataset), dtype='l')
dataset['class'] = new_class


Now build the classifiers.

from sklearn.neighbors import KNeighborsClassifier

# Make predictions on validation dataset
knn = KNeighborsClassifier(), Y_train)
predictions = knn.predict(X_validation) # hard 
predictions_prob = knn.predict_proba(X_validation) # soft


Results of hard predictor.


array([ 1.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,
        0.,  1.,  1.,  1.])


Results of soft classifier:

array([[ 0.2,  0.8],
       [ 0.2,  0.8],
       [ 0.6,  0.4],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.8,  0.2],
       [ 0.6,  0.4],
       [ 0.8,  0.2],
       [ 0.6,  0.4],
       [ 0.8,  0.2],
       [ 0.4,  0.6],
       [ 0.2,  0.8],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.6,  0.4],
       [ 0.6,  0.4],
       [ 0.8,  0.2],
       [ 0.4,  0.6],
       [ 0.4,  0.6],
       [ 0.2,  0.8]])


This will have results on the ROC curve produced. For the hard classifier, the ROC is linear. For the soft classifier, the ROC is continuous… Note the input of soft classifier: pred = predictions_prob[:, 1] === this generates the probability of the positive class and input to ROC function.


import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import roc_curve, auc

plt.figure(figsize = (12, 8))

truth = Y_validation
pred = predictions
fpr, tpr, thresholds = roc_curve(truth, pred)
roc_auc = auc(fpr, tpr)
c = (np.random.rand(), np.random.rand(), np.random.rand())
plt.plot(fpr, tpr, color=c, label= 'HARD'+' (AUC = %0.2f)' % roc_auc)

truth = Y_validation
pred = predictions_prob[:, 1]
fpr, tpr, thresholds = roc_curve(truth, pred)
roc_auc = auc(fpr, tpr)
c = (np.random.rand(), np.random.rand(), np.random.rand())
plt.plot(fpr, tpr, color=c, label= 'SOFT'+' (AUC = %0.2f)' % roc_auc)

plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.legend(loc="lower right")


Why is this the case?

This has to do with the input parameters of sklearn. See doc:

sklearn.metrics.roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True)


y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).


But still, why is the ROC curve continuous? This has to do with what is ROC curve?

ROC = receiver operating characteristics

For an ROC graph, the x-axis is the false positive rate, the y-axis is the true positive rate.

Each point corresponding to a particular classification strategy.

(0, 0) = classify all instances to negative.

(1, 1) = classifying all instances to positive. ??

A ranking model produces a set of points in ROC space. Each point correspond to the result of a threshold – each threshold produces a different point in ROC. Thus this soft classifier in effect equals to (many) strategies.

Final plot:

Update 20180428:

sklearn.metrics.roc_curve has an argument “drop_intermediate=True”, which will automatically adjust the number of thresholds in the curve. This will result in different number of points in a plot. When comparing results from different classifiers, have this point in mind!