Category Archives: Techy

Doctor AI: Using deep learning to automatically diagnoses diseases

(This post is based on a classmate’s and mine project in our first semester at NYU. I’ve been trying to write it into a blog post for three months and today finally decided to remove the task out of my to-do list…)

Everyone have heard about AlphoGo, the hallmark of current artificial intelligence (AI) research that can beat human Go champions.

But how about using AI to automatically diagnoses diseases? Or, at least, automatically narrows down the range of possible diseases and recommend suitable medical consultants? Imagine how much burden this will take off from the always over-worked doctors. Imagine how much benefits this will do to our health care system!

In this blog, we present a machine learning project that automate the identification of relevant diseases, by using “intake clinical notes” to predict the final diagnosis code. From this code, we can then identity which consultants are relevant.

#### Data Source
As with any other current machine learning systems (or really, even with human learning), an algorithm needs to see enough data before it can learn to make smart decisions.

So for our algorithm, we feed it with a public available medical dataset called MIMIC (‘Medical Information Mart for Intensive Care’). This dataset is composed by a group of MIT researchers, and comprises over 40,000 hospital admission notes for more than 30,000 adults and more than 7,000 neonates who are admitted to critical care unites at one hospital.

(It’s freely available at : in the form of a SQL database if you would like to take a look!)

The complete dataset has rich information, including vital signs, medications, laboratory measurements, observations and notes taken by care providers. However, for our purpose, we only used two variables in the dataset:

– medical discharge summaries for each patients. A discharge summary is a letter written by the
doctor caring for a patient in hospital. It contains important information about the patients
hospital visit, including why they came into hospital, the results of any tests they had, the treatment they received, any changes to their medication, etc.

– the ICD-9 diagnoses for patients. “ICD-9” stands for “International Classification of Diseases, Ninth Revision”. Basically, it is a three digit code for a disease. For example, 428 stands for “heart failure”.

#### Data Transformation / Feature Engineering
One interesting thing about this project is the model needs to learn from text. To tackle this challenge, we used techniques from natural language processing (NLP) to transform text into machine readable inputs – numbers. In the jargons of machine learning, this is also called “feature engineering”.

How to transform text into numbers? We first tried the following methods for data transformation.

* tokenize the input text. This means be break down sentences into sequences of words. For example, ‘She loves dog’ becomes a list of words, (‘She’, ‘loves’, ‘dog’). We used a python library NLTK (“Natural Language Toolkit”) to achieve this.

* turned all text into lower-cases, because we would like to prevent ‘She’ and ‘she’ being recognized as two different words.

* converted all numbers to the generic character \*, so that ‘2017-03-13’ becomes ‘\*\*\*\*-\*\*-\*\*’. This is because evert discharge summary contains at least one date, but numbers are not so useful in our prediction so we just mask the information.

After these steps, we will have every discharge summery turned into a list of words. The total corpus will contain a list of list, with every list being a discharge summary. Not bad!

Our next step would be to turn the lists of words into lists of numbers. For this, the method is called one-hot encoding, in that we construct a dictionary consisting of most common words in the **corpus**, and then replace every word in by its position in the dictionary. Easy!

For example, for this dataset we did the following:

* Construct a vocabulary consisting of words that appeared at least 5 times in the dataset. We basically discard the rare words. After this step, we have a vocabulary of 38,119 unique words.
* Use these 38,119 unique words to index each word in individual list of words. For example, if “the” appeared in the first position, it’s going to be replaced by the number ‘1’.
* After creating the features, we transform each piece of text summary into numbers, by assigning each word (token) with its count in the document.

For a simple example, consider the text: ”harry had had enough”. A Counter object on the tokenized text would yield (’had’: 2, ’harry’: 1, ’enough’: 1). So, this sentence will be transformed into a vector (2, 1, 1).

We also did other preprocess techniques to handle challenges in this dataset, including rate words, out-of-vocabulary words, but these techniques are omitted in this post for simplicity.

#### Model
We used a “deep learning” framework to handle this task. If that sounds pretentious, you are probably right! In fact, “deep learning” is just a fancy name of saying “neural networks with many layers”, while “neural network” is just a fancy way of saying “nested functions”. So nothing mysterious here.

We used a Hierarchical-Attention GRU (HA-GRU) model for this task (Yang, 2016; Baumel 2017). This model’s main characteristics are that 1) it uses a hierarchical model structure to encode at sentence level and document level separately, and 2) it has two levels of attention mechanisms applied at the word and sentence-level. All data manipulation and model training was done in Python 3.6.0, Pytorch 0.2.0.

To understand this model we needs to understand two components: GRU model and attention mechanism. There are some complicated mathematical constructs, but the idea of this post is to describe the main ideas without digging into too much detail, and point to other helpful sources if necessary.

Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) models, which is a type of neural networks. Let’s start from the simplest construct: neural networks.

A neural network is a fancy way of saying “function approximation”. For example, suppose we have two variables, $X$ and $Y$. $X$ is a vector of length 3, $Y$ is a vector of length 2.

We would like to discover the relationship between $Y$ and $X$ in the form of $Y = f(X)$.

We might first do a linear transformation on $X$ and get $g(X)$, then throw the $g(X)$ into another non-linear function $z(\cdot)$, and get the final result $z(g(X))$.

So in the end, we will have $Y = f(X) = z(g(X))$.

A neural network model is essentially a graphical way of presenting this mathematical transformation. If we present the previous mathematical formulation into a neural network, it will look like the following:


Here is the coorespondence:
– the first three nodes are our $X$, since dimension is three we have three nodes.
– the second two nodes are our transformed hidden vector$h$. The dimension of hidden vector is two, so we have two nodes.
– the final two nodes are our final predicted vector $Y$. The dimension is two, so we also have two nodes.

With the above understanding of simple neural networks, a recurrent neural networks adds autoregression into the formula by passing the output of every simple neural network to the input of the next neural network. The idea is somewhat similar to an autoregressive time series model. Such architecture helps dealing with long term dependencies.

To understand more about neural networks, the following blog posts are excellent resources:

Attention mechanism, on the other hand, is based on recurrent neural networks. The mechanism is multiplying outputs at multiple time stamp by a weight matrix that might or might not be influenced by the output. There are many implementations of attention mechanism, while the following posts also serve as great introduction to this topic:

After understanding recurrent neural networks and attention, the specific architecture of our model – HA-GRU is easy to follow!

Basically, this model is a stack of two recurrent neural networks.
At the first level we encode every sentence in a document into a vector. This is achieved by first embedding words into word vectors, then applying a bi-directional GRU (word GRU) with neural attention mechanism on the embedded words. The output of this step is sequences of sentence vectors.

At the second level, we repeat the process of a recurrent neural network plus attention mechanism, but this time at sentence level.

### Results
We train our HA-GRU model with different parameter configurations of embedding layer, word bi-GRU layer and sentence bi-GRU layer. We use mini-batch gradient descent for numeric optimization.
After first running the model on our full dataset
(over 29,000 summaries and 936 labels), we noticed that, because of the extreme sparsity of the label set, the results were negligible. So we decided to test a smaller label set and finally achieved an accuracy ranging from 0.01-0.04. In other words, our model does not seem to make good predictions.

### Discussion
While our results were not great, the project was still a good exercise for thinking about, and testing how artificial intelligence models can help with medical practices. It also shows that obviously, our Dr. AI has a large room of improvement!

Two approaches for logistic regression

Finally, a post in this blog that actually gets a little bit technical …

This post discusses two approaches for understanding of logistic regression: Empirical risk minimizer vs probabilistic approaches.

Empirical Risk Minimizer

Empirical risk minimizing frames a problem in terms of the following components:

  • Input space X \in R^d. Corresponds to observations, features, predictors, etc
  • outcome space Y \in \Omega. Corresponds to target variables.
  • Action space A_R  = R Also called decision function, predictor, hypothesis.
    • A sub element in action space could be hypothesis space: all linear transformations
  • Loss function: l(\hat{y}, y) a loss defined on the predicted values and observed values.

The goal of the whole problem is to select a function mapping $F$ in action space that minimizes the total loss on sample. This is achieved by selecting the value of the parameters in $f$ such that it minimizes the empirical loss in the training set. We also do hyperparameter tuning, which is done on the validation set in order to prevent overfitting.

In a binary classification task:

  • Input space X \in R^d.
  • outcome space Y \in {0, 1}. Binary target values.
  • Action space A_R  = R The hypothesis space: all linear score functions
    • F_{score} = {x \rightarrow x^Tw | w \in R^d}
  • Loss function: l(\hat{y}, y) = l_{logistic}(m) = \text{log} (1 + e ^{-m})
    • This is a kind of margin based loss, thus the m here.
    • Margin is defined as \hat{y} y, which has interpretation in binary classification task. Consider:
      • if m = \hat{y} y > 0 , we know we have our prediction and true value are of the same sign. Thus, in binary classification, we could already get the correct result. Thus, for m > 0 we should have loss = 0.
      • if m = \hat{y} y < 0 , we know we have our prediction and true value are of different signs. Thus, in binary classification, we are wrong. We need to define a positive value for loss function.
        • In SVM, we define hinge loss l(m) = \text{max}(0, 1-m), which is a “maximum-margin” based loss (more on this in the next post, which will cover the derivation of SVM, kernel methods) Basically, for this loss, we have when m \geq 1 no loss, $\latex m < 1$ loss. We can interpret m as “confidence” of our prediction. When m < 1 this means a low confidence, thus still penalize!
    • With this intuition, how do we understand logistic loss? We know:
      • This loss always > 1
      • When m negative (i.e. wrong prediction), we have greater loss !
      • When m positive (i.e. correct prediction), we have less loss…
      • Note also for same amount of increase in m, the scale that we “reward” correct prediction is less than the scale we penalize wrong predictions.

Bournoulli regression with logistic transfer function

  • Input space X = R^d
  • Outcome space y \in {0, 1}
  • action space A = [0, 1] An action is the probability that an outcome is 1

Define the standard logistic function as \phi (\eta) = 1 / (1 + e^{-\eta})

  • Hypothesis space as F = {x \leftarrow\phi (w^Tx) | w \in R^d}
  • Sigmoid function is any function that has an “S” shape. One example is the simple case of logistic function! Used in neural networks as activation function / transfer function. Purpose is to add non-linearity to the network.

Now we need to do a re-labeling for y_i in the dataset.

  • For every y_i = 1, we define y'  = 1
  • For every y_i = -1, we define y'  = 0

Can we do this? Doesn’t this change the value of y-s ? The answer is , in binary classification ( or in any classification), the labels do not matter. Instead, this trick just makes the equivalent shown much easier…

Then, the negative log likelihood objective function, given this F and dataset $laex D$, is :

  • NLL(w) = \sum_i^n [-y_i ' \text{log} \phi (w^T x_i)] +(y_i ' -1) \text{log} (1 - \phi (w^T x_i))

How to understand this approach? Think about a neural network…

  • Input x
  • First linear layer: transform x into w^Tx
  • Next non-linear activation function. \phi (\eta) = 1 / (1 + e^{-\eta}).
  • The output is interpreted as a probability of positive classes.
    • Think about multi-class problems, the second layer is a softmax — and we get a vector of probabilities!

With some calculation, we can show NLL is equivalent to the sum of empirical loss.



Encoding categorical features: likelihood, one-hot, and feature selection

This post describes techniques used to encode high cardinality categorical features in a supervised learning problem.

In particular, since these values cannot be ordered, the features are nominal. Specifically, I am working with the Kaggle competition here. The problem with this dataset is that some features (e.g. types of cell phone operating systems) are categorical and has hundreds of values.

The problem occurs in how to fit these features in our model. Nominal features work fine with decision trees (random forests), Naive Bayes (use count to estimate pmf). But for other models, e.g. neural networks, logistic regression, the input needs to be numbers.

Before introducing likelihood encoding, we can go over other methods in handling such situations.

Likelihood encoding 

Likelihood encoding is a way of representing the values according to their relationships with the target variable. The goal is finding a meaningful numeric encoding for a categorical feature. Meaningful in this case means as much related to the output/target as possible.

How do we do this? A simple way is 1) first group the training set by this particular categorical feature and 2) representing each value by within group mean of that value. For example, a categorical feature might be gender. Suppose the target is height. Then, we might have the average height for male is 1.70m, while the average height for female is 1.60m. We then change ‘male’ to 1.70, while ‘female’ to 1.60.

Perhaps we should also add some noise to this mean to prevent overfitting to training data. This can be done by :

  • add Gaussian noise to the mean. Credit to Owen Zhang :
  • use the idea of “cross-validation”. So here, instead of using the grand group mean, we use the cross-validation mean. (Not very clear on this point at the moment. Need to examine the idea of cross-validation. Will write in the next post.) Some people propose on Kaggle about using two levels of cross-validation: 

One hot vector

This idea is similar to the dummy variable in statistics. Basically, each possible value is being transformed into its own columns. Each of these columns will be a 1 if the original feature equals this value, or 0 if the original feature does not equal this value.

An example is for natural language processing models, the first step is usually 1) tokenize the sentence and 2) constructing a vocabulary and 3) map every token to an index (a number, or a nominal value, basically). After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot

After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot vectors into low dim vectors). In this way, we are basically representing every symbol in a low dimensional vector. The exact form of the vector is learned. What is happening here is actually dimension reduction.  So, after learning the weighting matrix, other methods, like PCA, can potentially work here as well!


A classmate named Lee Tanenbaum told me about this idea. This is an extension on the one-hot encoding idea. Suppose there can be n values in the feature. Basically, we use two hash functions, hash the possible values into two variables. The first hash all values into \sqrt(n) number of baskets,basketh baskent there are \sqrt(n) number of feature values. All feature values in the same busket is going to be the same for variable A. Then, we use a second hash function, that carefully hash the values into another busket variable B. We want to make sure the combination of A and B can fully represent every possible value in the target feature. We then learn a low-dim representation for both A and B, and concantenate them together.

My opinion on this is, this is still one-hot encoding + weight learning. However, we are forcing certain structures onto the weight matrix.

Feature selection 

Still based on one-hot encoding. However, instead of compressing everything into a tiny low-d vector, we discard some dummy variables based on their importance. In fact, LASSO is exactly used for this! L1 usually drives the coefficient of some features to zero, due to the diamond shape of the constraint. Source:

  • On why l1 gives sparsity: video here :
  • Course here : Statoverflow answer here:

Domain specific methods

These models exploit the relationship between these symbols in the training set, and learn a vector representation of every symbol (e.g. word2vec). This is certainly another way of vectorizing the words. Probably I will write more about this after learning more on representation learning!

Python generators

This short post describes what is a generator in Python.

A function with yield in it is a function. However, when called the function, it returns a generator object. Generators allow you to pause a function and return an intermediate result. The function saves its execution context and can be resumed later if necessary.

def fibonacci():
    a, b = 0, 1
    while True:
        yield b
        a, b = b, a + b

g = fibonacci()

[next (g) for i in range(10)]

This will return [1, 1, 2, 3, 5, 8, 13, 21, 34, 55].

When call the list comprehension again, it will return:

[next (g) for i in range(10)]

[89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]

Here, note the function is like a machine that can generate what you want. It will also remember its last state. This is different from a function that returns a list, which will not remember its last state..

Python generators can also interact with the code called with the next method. yield becomes an expression, and a value can be passed along with a new method called sendHere is an example piece of code:

def psychologist():
    print("Please tell me your problem")
    while True:
        answer = (yield) # note the usage of yield here 
        if answer is not None:
            if answer.endswith("?"): # note it's endSwith, the s there
                print("Don't ask yourself too many questions.")
            elif "good" in answer:
                print("A that's good, go on. ")
            elif "bad" in answer:
                print("Don't be so negative.")

This defines a function that can return a generator.

free = psychologist()


This will return “generator”


This will return “Please tell me your problem.”


This will return “Don’t ask yourself too many questions.”

free.send("I'm feeling bad.")

This will return “Don’t be so negative.”

free.send("But she is feeling good!")

This will return, “A that’s good, go on.”

send acts like next, but makes yield return the value passed. The function can, therefore, change its behavior depending on the client code.

Should I use an IDE, or should I use Vim?

This problem has been bugging me for a while so I decide to write it out even though it’s just a short piece.  This post compares tools for Python programming using :

  • Jupyter Notebook
  • IDEs like PyCharm
  • Text editors like Vim

Jupyter Notebook:

  • The pro is it’s easily visualizable. When you want a graph, you can see a graph immediately. The comments are also beautifully formatted.
  • Another pro is it can be connected to a Linux server like Dumbo.
  • The con is, it’s not a program. A notebook is it’s own file and although it can be downloaded as a .py file, the file is usually too long, with lots of comments like typesetting parameters.

When to use it?

  • Data exploration. because of the visualization and analysis nature.
  • Big data. Because it can be connected to a server, that makes running large amount of data possible.


  • The pro is it’s suited for Python development. I have not learnt the functionalities entirely, but e.g. search and replace are easily doable in PyCharm. Debugging also seems to be easy?
  • The con is it’s usually not available on a server.
  • Another con is need extra finger movement when switching from terminal to Pycharm.

When to use it?

  • Debugging complicated programs. e.g. NLP programs.
  • No need to run on a server.


  • The pro is it’s everywhere. Thus, whenever you write on your own machine, or on the server,   it feels the same.
  • Another pro is it can be used for anything. like python, C++, markdown, bash… So there is no need to switch to other places when ssh to the server.
  • The con is it’s not that easy to use. e.g. search and replace… hard to do this. Adjust tab? also not immediately doable.
  • Another con is it’s not that easy to debug. have to manually print out variables… This makes it particularly difficult when the program is large.

When to use it?

  • When need to connect to a server. e.g. big data size.

Chi-square and two-sample t-test

This post explains a basic question I encountered, and the statistical concepts behind it.

The real-life problem

Someone asks me to construct data to prove that a treatment is useful for 1) kindergarten and 2) elementary school kids in preventing winter cold.

Chi-square and student’s t test

First, decide how to formulate the problem using statistical tests. This includes deciding the quantity and statistic to compare.

Basically, I need to compare two groups. Two tests come to mind: Pearson’s chi-square test, and two-sample t-test. This article summarizes main difference between the two tests, in terms of Null Hypothesis, Types of Data, Variations and Conclusions. The following section is largely based on that article.

Null Hypothesis

  • Pearson’s chi-square test: test the relationship between two variables, or whether something has effects on the other thing (?). e.g. men and women are equally likely to vote for Republican, Democrat, Others, or Not at all. Here the two variables are “gender” and “voting choice”. The null is “gender does not affect voting choice”.
  • Two-sample t-test : whether two sample have the same mean. Mathematically, this means \mu_1 = \mu_2 or \mu_1 - \mu_2 = 0 . e.g. boys and girls have the same height.

Types of Data

  • Pearson’s chi-square test: usually requires two variables. Each is categorical and can have many number of levels. e.g. one variable is “gender”, the other is “voting choice”.
  • two sample t-test: requires two variables. One variable has exactly two levels (two-sample), the other is quantitively calculatable. e.g. in the example above, one variable is gender, the other is height.


  • Pearson’s chi-square test: variations can be when the two variables are ordinal instead of categorical.
  • two-sample t-test: variations can be that the two samples are paired instead of independent.

Transform the real-life problem into a statistical problem

Using chi-square test

Variable 1 is “using treatment”. Two levels: use or not.

Variable 2 is “getting winter cold”. Two levels: get cold or not.

For kindergarten kids and for pre-school kids, I thus have two 2 * 2 tables.

(question: can I do a chi-square test on three variables? The third one being “age”.)

Using two-sample t-test

Variable 1 is “using treatment”. Two levels: use or not

Variable 2 is supposed to be a numerical variable —- here use disease rate. But then there is no enough number of samples.

Thus, conclude that Chi-square test should be used here. 


Brief explanation of statistical learning framework

This post explains what is a statistical learning framework, and common results under this framework.


We have a random variable X, another random variable Y. Now we want to determine the relationship between X and Y.

We define the relationship by a prediction function f(x). For each x, this function produces an “action” a in the action space.

Now how do we get the predictive function f? We use a loss function l(a, y), that for each a and y, we produce a “loss”. Note since X is a random variable, f(x) is a transformation, so a is a random variable, too.

Also, l(a, y) is a transformation of (a, y), so l(a, y) is a random variable too. It’s distribution is based on both X and Y.

We then calculate f by minimizing the expectation of the loss, which is called “risk”. Since the distribution of l(a, y) is based both on the distribution of X and Y, to get this expectation, we need to do integration both on X and on Y. In the case of discrete variables, we do summation based on the pmf of (x, y).

The above are about theoretical properties of Y, X, loss function and prediction function. But we usually do not know the distribution of (X, Y). Thus, we choose to minimize empirical risk instead. We calculate empirical risk by summing all the empirical loss together, divided by m. (q: does this resemble Monte Carlo method? is this about computational statistics? Need a review.)


In the case of square-loss, we have the result, a = E(y|x).

In the case of 0-1 loss, we have the result, a = arg max P(y|x)



We want to predict a student’s mid-term grade (Y). We want to know the relationship between predicted value, and whether she is hard-working (X).

We use square-loss for this continuous variable Y. Since we know that to minize square loss, for any random variable we should predict the mean value of the variable (c.f. regression analysis, in OLS scenerio we calculate the MSE — but need further connection to this framework).

Now we just observed unfortunately the student is not hard-working.

We know for a not-hardworking student the expectation of mid-term grade is 40.

We then predict the grade to be 40, as a way to minimize square-loss.


Probability, statistics, frequentist and Bayesian

This post is a review of basic concepts in probability and statistics.

Useful reference:


It’s a tool to mathematically measure uncertainty.

Formal definition involving \sigma-algebra:

A probability space isa triple (\Omega, F, P) consisting of :

  • A sample space \Omega
  • A set of events F – which will be \sigma-algebra
  • A probability measure P that assigns probabbilites to the events in F.

Example: We have a fair coin. Now we toss it 1000 times, what’s the probability of getting 600 heads or more?


The goal of statistics is to 1) draw conclusion from data (e.g. reject Null Hypothesis) and 2) evaluate the uncertainty of this information (e.g. p-value, confidence interval, or posterier distribution).

At the bottem, statistical statement is also about probability. Because it applies probability to draw conclusions from data.

Example: We would like to know whether the probability of raining tomorrow is 0.99. Then tomorrow comes, and it does not rain. Do we conclude that P(rain) = 0.99 is true?

Example 2: We would like to decide if a coin is fair. (Data) Toss the coin 1000 times, and 809 times it’s a head. Do we conclude the coin is fair?

Note : probability is logically self-contained. There are a few rules, and the answers follow from the rules. Statistics can be messy, because it involves draw conclusion from data – much art than science.

Frequentist vs Bayesian

Two schools of statistics. They are different in their interpretation of probability.

Frequentist interpret probability to be the frequencies of events in repeating experiments. E.g. P(head) = 0.6. Then if we toss a coin 1000 times, we will have 600 heads.

Bayesian interprets probability to be a state of knowledge, or a state of belief, about a preposition. E.g. P(head) = 0.6, means we are fairly certain (around 60% certain!) that a coin will be tossed head.

In practice though, Bayesian seldom use a single value to characterize such belief. Rather, it uses a distribution.

Frequentists are used in social science, biology, medicine, public health. We see two sample t-tests, p-values. Bayesian is used in computer science, “big data”.

Core difference between Frequentists and Bayesian

Bayesian considers the results from previous experiments, in the form of a prior.

See this comic for an illustration.

What does it mean?

A frequentist and a Bayesian are making a bet about whether the sun has exploded.

It’s night, so they can not observe.

They ask some expert whether the sun has gone Nova.

They also know that this expert will toss two coins. If both get 6, she will lie. Else, she won’t. (Data generation process)

Now they ask the expert, who tells them yes, the sun has gone Nova.

Frequent conclude that since the probability of getting two 6’s is 1/36 = 0.0027 <0.05 (p < 0.05), it’s very unlikely the expert has lied. Thus, she concludes the expert did not lie. Thus, she concludes that the sun has exploded.

Bayesian, however, has a strong belief that the sun has not exploded (or else they will be dead already). The prior distribution is

  • P(sun has not exploded) = 0.99999999999999999,
  • P(sun has exploded) = 0.00000000000000001.

Now the data generation process is essentially the following distribution:

  • P(expert says sun exploded |Sun not exploded) =  1/36.
  • P(expert says sun exploded |Sun exploded) =  35/36.
  • P(expert says sun not exploded |Sun exploded) =  1/36.
  • P(expert says sun not exploded |Sun not exploded) =  35/36.

The observed data is “expert says sun exploded”. We want to know

  • P( Sun exploded | expert says sun exploded ) = P( expert says sun exploded | Sun exploded) * P( Sun exploded) / P(expert says sun exploded)

Since P(Sun exploded) is extremely small compared to other probabilities, P( Sun exploded | expert says sun exploded ) is also extremely small.

Thus although the expert is unlikely to lie (p = 0.0027), the sun is much more unlikely to have exploded. Thus, the expert most likely lied, and the sun has not exploded.

Debugging larger pyspark ML programs

This post describes my learning experience in developing larger programs, especially those :

  • Takes a long time to run – due to big data sets and computationally intensive algorithms
  • Requires developing locally and on HPC. That is cannot be solved in a Python IDE

The take away is:

  • To save time, try writing scripts in one place only.
  • Do not develop interactive and then paste everything to an editor!

The problem

I found myself spending excessive time (~ 4 days) developing a program that should be simple in its logic. Basically, I just need to call Pyspark API functions to do classification on 20Newsgroup dataset. It should be a simple program!

How did I spend my time?

  • First day, I found a script doing a similar task. When I tried to use it I came into the following problems:
    • Read files on HDFS. Spark only works with files on HDFS not local file system. This mistakes took me some time, as I thought the problem was with syntax in reading nested folders.
    • The script does not clean text – remove headers, numbers, punctuations. To do this, I had to understand the .trans function. This also works differently in Python2 and Python3, which took me some time to realize.
    • The script use Pyspark SQL, thus took some time to learn DataFrame as well.
  • By day 2 the data is read into Spark DataFrame.
    • I then had problem calling functions from MLLib, because I didn’t realize ML and MLlib are two libraries and have different data structures. I then came into problem when using an MLlib function to ML data.
    • I also tried to convert the data structure back and forth, from RDD to Labeled Point to ML.
    • To inspect what is in the data, I also spent time calling the wrong functions, or transform everything into an RDD and call map functions.
  • By day 3 I intend to use Spark-submit on HPC. The main task is to learn to use editor.
    • Because someone told me I should be using editor instead of debugging interactively, or else I cannot see code structure, I began to learn vim. That took a morning or so (!).
  • By day 4, I am trying to clean up the code and write functions.
    • This creates another level of complexity. One of the bug is I forget to update the function argument


  • I type every line of code three to four times. First on my workstation, then on the server’s pyspark shell. Finally I copy the code into a script. This not only creates space for mistakes, but also is inefficient.
    • I did this because I am not comfortable writing scripts on HPC yet.
    • Also because I am not comfortable debugging with a script and an editor, without using IDE.
  • I input every line of code at least three to four times.
    • First in interactive mode.
    • Then, copying the line into my editor. Since the project spans across days, every time I start again I need to re-read the files!
    • Also, I then creates a file on a small dataset on HPC and run that.
    • After that, I run the file again on a larger dataset.
  • Since I am debugging at multiple places, I need to do version control as well.
    • I remember scp back and forth sending files along. Whenever I edited something, I remove the older version of the program at the other place.
  • I am not familiar with data structure and functions on Pyspark. This also leads to waste of time.
  • Interactive mode of Spark is slow. I once forgot to cut the dataset smaller, and running one command (e.g. transform) on the whole dataset takes 10 minues! If there are 20 commands like this that would be ~3 hours.


  • The above problems can be summarized as :
    • I need to write script on multiple places: server and local.
      • Solution: learn to use editor in the server. e.g. Vim
    • Submit program interactively vs in batch. I do not know how to debug with a script so I have to use interactive mode to make sure I know what I am doing. But debugging interactively means double the amount of typing because every command needs to go through the terminal and the editor. That’s also running the program twice.
      • Solution: learn to run the program from the script. The con of this is need to laod data multiple times. Also, use a smaller dataset so data loading will not be a pain. 
    • Time spent on start-up. The data file is huge, thus re-loading it takes time. If every time load the dataset is 3 minutes, load it 20 times would be 60 minutes.
      • Solution: use a smaller sample dataset for developing.

Moral: try to write only one set of programs in a single place.

  • For pyspark, I can only use HPC. So I just write on HPC.
  • Pyspark also cannot use pdb.
  • For Python, I should test everything locally first. There are two final programs
    • A script that can run locally and on HPC.
    • A script to be submitted to the cluster.

Debugging philosophy

  • “Bugs” are not bugs but errors. The responsibility lies with the programmer. A program that has errors is simple wrong, because 1) the programmer is not really familiar with the rules and grammars of the library. Used the wrong data structure, used the wrong function call. etc.
  • Debugging is a learning experience.  Why does debugging takes lots of time? Because the programmer is learning something new, thus need time to try and make mistakes. There seems to be a wishful thinking that a perfect program will magically began working by itself, and thus time will be well spent. It wont’, because learning always takes time!

Pyspark ML vs MLLib

A post that summarizes main difference between Pyspakr ML and MLlib. This is based on Spark 2.2.0 and Python 3.

Data Structure

  • pyspark.mllib is the older library for machine learning. It can only use RDD labeled point. But then more features than ML
  • newer library for machine learning. It can only use sql DataFrame. But easier to construct a ML pipeline.
  • It is recommended to use DataFrame APIs (i.e. ML) because it is more stable. Also this is newer.


The two libraries seem to be similar in terms of feature selection APIs.


  • implements Pipeline workflow, which is based on dataFrame and can be used to “quickly assemble and configure practical machine learning pipelines”.

Classification algorithms provided by MLlib:

Classification algorithm by ML:

Seems that ML has multilayerPerceptron which is not in MLLib.


  • pyspark ml implements while pyspark.mllib does not
  • However Pyspark ML’s evaluation function is difficult to use.  The docs are not clear. Thus, changing back to MLlib to use this.

When to use which?

  • Efficiency: MLLib uses Labelpoint RDD, while ML uses structured DataFrame. Thus, “if the data is already structured DataFrames, then ML confer some performance benefits over RDD’s, this is apparently drastic as the complexity of your operations increase.”
  • Resource: DataFrames consume far less memory when caching than RDDs. Thus lower level operations RDD’s are great but high level operations, viewing and typing with other API’s use DataFrames. (Above all quote Stackoverflow user Grr)
  • It is recommended to use ML and DataFrame. However if want to use Gridsearch, seems still need to revert to Labeled point!

Reference :

A stackoverflow question:

MLlib doc:

ML doc:

Comparison of data structures: