Category Archives: Techy

2D convolution with tiling technique on GPU

This post describes how to write CUDA C code to perform 2D convolution on GPU with tiling technique. It is very brief, only covers basic concepts but with links to code examples. We are not going to use cuDNN, only the bare bones of CUDA. But here is an excellent post if you want to use cuDNN – for this particular task, cuDNN is slower than the code I will present below.

You can find the source code for this post here.

This post will consist of the following sections:

  • Installation CUDA
  • Basic GPU programming model
  • Vanilla convolution on GPU
  • Constant memory in GPU
  • Tiling technique and indexing
  • Link to code example

Install CUDA

First, make sure if you have a NVIDIA GPU on your machine. Here is how. Or just search the model online and ask on reddit 🙂

Next, follow the official NVIDIA guide here to download CUDA Toolkit. Note not every card support every version of CUDA kit.

Note also the download path, as you will need to specify that to the compiler later. If you are using Linux you can type whereis nvcc to find out. nvcc is the NVIDIA C compiler.

It’s easy to get wrong. Just follow carefully every step in the official installation guide. In particular don’t forget to finish the post-installation set-ups! Wrong PATH environment variables can be a real pain.

Basic CUDA Programming model

The most prominent characteristic of CUDA programming is parallization. When you write one line of code in a CUDA kernel (explained later), it will be executed by thousands of threads on different data. This is called the SIMD (same instruction multiple data) parallel paradigm.

With this idea in mind, here is what a typical CUDA C code look like:

  • Setting up the program on the host (CPU) by initializing variables and populate them. Write non-parallel sequential code.
  • Preparing for GPU work by initializing device (GPU) variables that reside in device memory, and copy contents from the corresponding host variables to it.
  • Write a GPU kernel function to deal with the parallel part of computation (e.g. convolution).
  • Set up the parallel configurations by specifying how many blocks and grids are used for the kernel. These two variables will decide how many threads we started for parallel execution.
  • Call the kernel from host code. Store the results in a device variable.
  • Copy contents from device to the host.

Some concepts to clear:

  • How many threads we use in the GPU is controlled by Grid size and Block size. A grid consists of many blocks. Grid and Block are both 3D structures. For example, a grid of dimension <2, 2, 2> will have 8 blocks in total, and a block of dimension <2, 2, 2> will have 8 threads in total. A block cannot have more than 1024 threads.

Here is a basic matrix addition example:


// function declarations 
__global__ void vecAddKernel(float * a, float * b, float * c, unsigned int N);

// main function 
int main()
    int N = 10;    // length of vector 
    float * a, * b, * c;  // a and b are vectors. c is the result
    unsigned int size = N * sizeof(float);  // number of bytes to allocate 
    a = (float *)calloc(N, sizeof(float));
    b = (float *)calloc(N, sizeof(float));

    int i = 0;
    float sum = 0;
    for (i = 0; i < N; i++){
        a[i] = (float)i / 0.23 + 1;
        b[i] = (float)i / 5.89 + 9;
        sum += a[i] + b[i];

    c = (float*) malloc(size);
    // 1. allocate memory on CUDA
    float * d_a, * d_b, * d_c;   // device memory 
    cudaError_t err1 =  cudaMalloc((void **) & d_a, size);
    cudaError_t err2 = cudaMalloc((void **) & d_b, size);
    cudaError_t err3 = cudaMalloc((void **) & d_c, size);
    if (err1 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err1), __FILE__, __LINE__);
    if (err2 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
    if (err3 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err3), __FILE__, __LINE__);
    // copy memory 
    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // 2. operate on kernels 
    vecAddKernel<<<ceil(N/256.0), 256>>>(d_a, d_b, d_c, N);

    // 3. copy the results back to host
    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    cudaError_t error = cudaGetLastError();
        fprintf(stderr,"ERROR: %s\n", cudaGetErrorString(error) );
    float cuda_res = 0;
    for(i = 0; i < N; i++){
        printf("%.2f\t", c[i]);
        cuda_res += c[i];
    printf("Results from host :%.2f\n", sum);
    printf("Results from device:%.2f\n", cuda_res);


    return 0;

void vecAddKernel(float * a, float * b, float * c, unsigned int N){
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i<N)  c[i] = a[i] + b[i];


Vanilla convolution on GPU

Where to parallelize for convolution? We use every thread to correspond to one single element in the output, and all threads do the same convolution operation but with different data chunks.

Constant memory in GPU

Constant memory variables are shared by all threads in all blocks. This memory is small but fast. So we should put the filter in it because everyone needs it, and it’s small.

Tiling technique and indexing

Tiling is making use of the shared memory on GPU. Shared memory is shared by all threads in the same block, it is small but fast. So the idea is for every thread in a block, they will operate on data regions that have some overlaps. And we make use of that to copy the small region of the larger picture that these threads need into the shared memory and dealing with them.

We use all threads to copy these data elements. Here is a further illustration of this tiling idea.

Code example



Here is the tiling organization :

Understanding GRUs

This post follows this post last year about vanilla Recurrent Neural Network structures.

One of the ultimate goal of a recurrent neural network is to summarize a sequence of vectors into a single vector representing all the information.

In the simplest case, imagine for the past month, every day  you and your friend did some push-ups. Let’s denote your number as x_i at day i, and your friend’s number as y_i at day i. Now a month has passed and you two want to compare who is more keen on working out.

So what you do you? Directly comparing two vectors is hard, because there is no obvious “greater-than” relations across dimensions? One way is to calculate the simple average of the two sequences, i.e. on average you did \bar{x} and your friend did \bar{y} per day. Whichever number is larger, you can bet that person is a better athlete!

Now in this example, time order does not really matter, because we are treating numbers from 30 days ago as the same as number yesterday. Another example where time does matter is interests from bank savings. Suppose every year you save b_i amount of Yuan into the bank, and the bank pays out interests at the end of year, once per year. How much will your total savings be at the end of the fifth year? You can image the oldest money will receive the highest interests, because of confounding.

With the same logic, we can apply this idea to summarizing sentences. Suppose every word can be represented as a n-D vector in a semantic space. How do we produce one single vector that represent the meaning of this sentence?

Taking average is a way, but note every word will have some influence on the words after it. e.g. “this burger is delicious.” Note how the “burger” constrains word choices after “is” … And that’s why we need recurrent neural network : at every step of the model, the hidden vector (which contains information about previous words) will concatenate with the current word, and becomes the input into producing the next hidden vector. So every word will have a lasting influence in the final output.

However, simple RNN has a vanishing (explosive) gradient problem. Mathematically, the older a word is, the higher order it will has on the multiplication factor of weight matrix. When taking first-order gradients, the weight matrix will have an order of n-1. Now if one term is larger than 1, as n approaches infinity this will will approach infinity too. If one term is smaller than 1, as n approaches infinity this will will approach zero and thus the model will “stop learning” (i.e. weights will not update).

Let’s formulate the intuition above more formally. For a simple RNN, every update we have

h_t =g (W x_t + U h_{t-1} + b)

Here g can be any non-linear activation function, e.g. a RELU or a sigmoid. Now we consider the gradient of h_t with regard to W.


h_t = g (O_t)


O_t =W x_t + U h_{t-1} + b

Using the chain rule we have:

\frac{\partial h_t}{\partial W} =\frac{\partial h_t}{\partial O_t} \frac{\partial O_t}{\partial W}

We also know:

\frac{\partial O_t}{\partial W} = X_t + U_h \frac{\partial h_{t-1}}{\partial W}

So plug in the second equation above into the first equation, we have:

\frac{\partial h_t}{\partial W} = {\partial g} \cdot (X_t + U_h \frac{\partial h_{t-1}}{\partial W})

We can already see a recurrent relation between \frac{\partial h_t}{\partial W} and \frac{\partial h_{t-1}}{\partial W}. To make things clear, write \frac{\partial h_t}{\partial W} = \alpha_t, and expand the equation a few more steps:

\alpha_t = {\partial g} \cdot (X_t + U_h \alpha_{t-1})

\alpha_t = {\partial g} \cdot (X_t + U_h \cdot \partial g \cdot (X_{t-1} + \alpha_{t-2}) )


\alpha_t = C + (\partial g U_h)^{n-1} \alpha_0

Here $C$ represent the other terms (with lower order of \alpha_t-s) in the formula that we omitted for simplicity. So we can see, as n increases, if norm of \partial g U_h is greater than 1 this term will approach infinity, or if is less than 1 it will approach zero.

The main reason is the same term is multiplied n-1 times. i.e. information always flow at the same rate every step. If you think about a sentence, this does not really make sense as words meaning / importance does not really decrease (increase) exponentially w.r.t. it’s distance to the end of the sentence. The first word might very well be very important (e.g. a female name which will influence later pronoun choices “he” or “her”), but this vanilla RNN structure is too simple to capture that. We need more dynamic structures to allow information flow freely between time stamps.

Gated recurrent unit solves the problem by adapting different terms in every update step. There are two key ideas:

  • Introduces an external “memory” vector to capture long distance dependencies.
  • Allow error messages to flow at different strengths depending on inputs.

It achieves this by first computing two “gates”:

update gate : z_t = \sigma (W_z x_t + U_z h_{t-1} + b_z)

reset gate: r_t = \sigma (W_r x_t + U_r h_{t-1} + b_r)

They are both continuous vectors of the same length as the hidden state, constructed by passing the current word x_t and the last hidden vector h_{t-1} through an MLP. The sigmoid function makes every element between 0 and 1, so when used to perform element-wise multiplications \o, these two gates essentially controls how much information “flows” through.

There is also a candidate vector representing “new memory content”:

\tilde{h_t} = \text{tanh} (W_h X_t + r_t \circ (U_h h_{t-1} + b_h)

The usage of tanh and W_h X_t is pretty standard as in all other unites. But note the meaning of reset gate r_t here. If r_t is close to zero, then we “forget” all the previous information in the h_{t-1} vector, and just use the current word information.

An example where the reset gate is useful is sentiment analysis on a movie review, where the author spend many sentences describing and summarizing the movie plot, but conclude that “But the movie is really boring”. With this sentence, all the previous information become useless. This reset gate can help the model “forget” the previous summaries.

Finally, we update the new hidden vector as :

h_t =  z_t \circ h_{t-1} + (1 - z_t) \circ \tilde{h_t}

You can see the update controls how much information is remembered from old state, and how much information is acquired from the new memory content.

A caveat is why do we need the reset gate at all now that we have the update gate? Doesn’t we already have a gate to control how much old information is retained by using an update gate? Doesn’t the update gate also have the ability to eliminate all historical information embedded in h_{t-1}? I did not find satisfactory information about this point online.

Now think about why GRU can solve the vanish gradient problem. The update gate allows us to retain all previous information by setting all elements of z_t to 1, and h_t is exactly the same as h_{t-1}. In this case, we do not have the vanishing gradient problem because no weight matrix is multiplied.

For example, in the same movie review suppose the author says “I love this movie so much” and then goes on to explain the movie plot. Now the first sentence is really important, but with recurrent neural network, we update the gradient at every step and the content will be washed out as time passes. But now the update gate allows the model to retain the first sentence without exponentially decay its content.

Units with short term memories usually have reset gate very activate (i.e. numbers close to 1), and thus forget most about the past…

If reset gate is entirely 1 and update gate is entirely zero, then we just have a standard RNN.

Why tanh and sigmoid?

  • In theory tanh can also be RElu, but in practice it just performs better.
  • Sigmoid will control value between 0 and 1, but again no formal justification.

Finally, here is an illustration that pictures what’s going on in a GRU unit:


pointnet 论文理解

这几天(上周)在看 pointnet 系列的论文,具体研究问题是如何处理点云数据。大的研究领域是机器学习如何处理3D视觉问题,在无人车和机器人领域有应用,因为 depth camera,传感器,激光雷达传回来的图片有些是 point cloud。

点云数据是用点来代表三维图像的一种数据格式,例如一张椅子的三维图。从椅子平面上抽取2400个点,每个点是 (x, y, z) 的一个三维向量,那么把这些点在三维空间里画出来,就会得到飞机的图片。这里有一个可以在线观看的例子。这张图片里的点很多,所以观看效果很逼真。如果点少的话,就会变成这样

不同于2D图片,点云数据是无序的。传统 CNN 通过滑动 kernel 的卷积操作,能提取出 local 像素点之间的依赖关系(例如 max pooling,以及 convolution 是 weighted sum of local points)。但点云数据没有清晰的 local 关系。如何解决呢?1)可以把点云数据复原到三维,把它们分割成一小格(voxel)然后进行3D卷积。2)可以把点云数据投影到二维,代表算法有MV3D 和 AVDO

pointnet 是这系列的重点论文,2016年发表。这篇论文说不需要进行这样的处理,而是直接把点云数据输入到 multi-layer perceptron 中。步骤如下:

  1. 假设 batch size = 32, 每张图片有 1024个点。那么每批次输入的数据形状为 32 * 1024 * 3
  2. 首先把这个数据中每一张图片和一个 3 * 3 的 transform matrix 相乘。整体的形状,就是 32 * 1024 * 3 的 tensor乘上 32 * 3 * 3 的 tensor,使用 tf.matmul 函数,得到的结果是 32 * 1024 * 3。物理意义是把这个点进行坐标变换。这样做的原因是这篇讲  spatial transformer 的论文,还没看。
    • 这个 transform matrix 并不是随机生成,而是一个较复杂的 T-net。它的网络结构和这个大的网络结构一样,也是将输入图片几次升维,非线性变换 + bias,然后降维,得到的矩阵。所以不同的图片虽然分享了网络参数,但是并没有乘以相同系数。
    • 为什么这个网络这么复杂?其实和GRU这类网络有异曲同工之处,都是用当下数据既生成权重,又用这个权重再次处理当下数据。
  3. 图片处理好之后的形状是  32 * 1024 * 3. 接下来是第一次升维。具体操作是,每个点的三个维度,求 weighted sum 获得一个 scalar。然后这样的操作重复64次,得到一个64维的向量。其实也就是长度为3的向量乘以 3 * 64 的矩阵。这样会得到 32 * 1024 * 64 维的数据。将这个数据加上随机的 bias 之后,做 batch normalization,形状不变。
    • 在作者的源代码中,这一步没有用拼接 64 个线性层的方法,而是用 tensorflow 中的 conv2d 实现。通过变换 channel 数来代表64个 fully connected layer。
  4. 接下来又是一次“升维”操作,和上一步一样,但是输出的维度仍然是64.
  5. 现在的数据形状是 32 * 1024 * 64.接下来是 feature transform。通过同样的网络结构,获得一个 32 * 64 * 64 的 tensor。用 tf.matmul 把每一个点再次变换,原理和2)一样。
  6. transform 之后,将图片进行升维。先升到64维,然后是128,最后是 1024.最后我们得到一个 32 * 1024 * 1024 的数据形状。
    • 在作者的源码中,数据形状是32 * 1024 * 1 * 1024,因为要使用 conv2d 操作。
  7. 接下来是关键一步:要概括总结上一步中每个图片中所有的点,得到代表全局信息的向量。这里通过 maxpooling 实现,也就是对每个 channel ,取图片中所有点中的最大值。也就是 element-wise max pooling。
    • 源码中的实现是通过tf.max_pool 函数,输入就是 32 * 1024 * 1 * 1024 的这个数据,kernel size = [1, num_points (1024), 1, 1],输出是 32 * 1 * 1 * 1024。
    • 为什么不用 sum 或者 average?这篇论文针对 max pooling 操作提出了理论性质,我还没看。
  8. 一番折腾后,终于得到了代表每张图片的全局向量。下一步是reshape ,成为形状为 32 * 1024 的一个数据块。把两个多余维度扔掉。


  • classification:只需要全局表征即可。把这个表征通过三层 fully connected layers 生成长度为 40 的向量(因为一共40类),通过 softmax 层得到概率,用 cross entropy 的均值作为 loss function,通过一维逼近得到参数的最优值。
  • segmentation:作者的处理是对每个点做 classification,每个点都贴上这个全局变量后,进行分类。
  • semantic segmentation,不是很懂,略过。

这篇文章的重点在于1)将简单的网络结构直接用于点云数据,2)maxpooling 操作的性质(invariant to input orders)

Writing Python tests: doctest, unittest, and pytest

So I set myself another task the other day: writing tests for an open-source project —- a baby step towards software engineering! This post describes the tools and practices I’ve learnt.

The project I’ve committed to is here. The goal is to write tests for an experiment class, whose objective is to save state of machine learning experiments and quickly retrieve them.

There are two sets of tools to be learnt: 1) conceptually, what is a test and how to  write one? 2) practically, what tools does Python offer for writing and organizing test code, and how to use them? I will answer these two sets of questions below.

What is software testing and how to write it?

Testing is just write code to test if a software behaves as expected. If it does not, then there is a bug, and bugs are bad. Most of the online materials / books on this are kind of fluffy, describing concepts that are hard to define and unnecessarily convoluted. But really I just understand test to be like, this piece of software is supposed to output 1 when I input 3, is that what it is behaving?

There are also edge cases to be tested and taken care of accordingly.

So the real issue seems to be how to write good tests. This is a very broad topic with no definite answer, and I’ll omit discussion in this post. For this project, I’m following this guide.

Testing tools in Python 

  1. doctest
  2. unittest
  3. pytest

doctest  is packaged with Python. The usage is the following:

  • Open a Python interpreter, and interactively test a bunch of results with the function to be tested
  • Copy and paste all output from the interactive session to the docstring of that specific function
  • add a if __name__ == "__main__" line in the script to be tested so as to detect if it is imported or being executed standalone. If it is executed standalone, add one line doctest.testmod() so the doctest will run the test mode..

this module will parse the docstring to get expected results for every input. It will then compare the actual output with the expected input, and return errors if they do not match. If everything runs fine, the module will not output anything.

unittest the biggest difference with doctest is test cases are no longer defined in the original script. Instead, we now create a new script, import the old script and its methods, and test all out test cases there. The good news is now script is separated from tests, the bad news is we need to write more code.

How to use unittest module?

  • First create a new python script, import bothunittest and the original script in it. This script will be run alone, so need to specify the shebang line as well.
  • Create a unittest.TestCase class. syntax is class MyTestCase(unittest.TestCase), i.e. MyTestCase is a subclass of unittest.TestCase.
  • So the real issue is what methods we can use with unittest.TestCase. These methods are used to assert if the expected value is equal to the real reaults. Here is a link to the useful parts of its documentation. Basically, the testing process is a series of assertion processes.
  • The most naive way is just to define a bunch of assertTrue() methods in the class one by one. A more mature way is to use the setUp and tearDown methods to put all test cases into a data structure, e.g. a list, and in the test function body use a for loop to iterate through that data structure. Sample code here :

pytest is a third party package. It is easier to use than the unittest module, because it does not require all the class and method setups.

  • The simplest case, first create a python script with 1) the function to be tested and a function to test that function with assert. e.g. assert func(4) == 5
  • Then, run the scrip with pytest from command line in the folder containing the script. You do not even need to import pytest in!
  • On the other hand, if we do not specify an argument to pytest, it will run test on every file of the form * or test_*.py in the current directory. This is a convention to discover test files. Also see this link.

Finally, adding some good resources on pytest that I just found:

  • A webinar :
  • The accompany git repo:

Some updates on using pytest with PyCharm

  • If you want to run a single test instead of the whole script, see this SO discussion. The pipeline might be broken though.
  • Pycharm professional has a test-runner tab where you could use auto-test and it will re-run a test function after modification. Link here.
  • Add verbose option for pytest. It’s in “edit configurations” — pytest — additional argument. Specify a -v.
  • Put tests in a class so you can run them together. i.e. if they are related and always need to be run together.
  • auto test —- you can have test run continuously when you type

Happy testing!

Using ROS with Conda

For the past two weeks I’ve been playing around with ROS. All was fine until I tried to embed tensorflow and “deep learning” stuff into ROS with conda.

Things went south immediately. Pipelines broke down, what used to work no longer worked anymore. roslaunch showed `not a package` error, import in Python showed package not found even though I just pip installed it…

This post aims to clarify what happens, why my system became a mess, and how to resolve them.

The first problem with error importing modules in python is because ROS conflicts with Conda. This is because ROS sneakly changes the Python path in `~/.bash_rc` file. According to this post by Udacity NanoDegree github:

The problem here is that Python is looking for a package in /opt/ros/kinetic/lib/python2.7/dist-packages/ when it should be looking in the Anaconda installation directories. The reason for this is that with your ROS install, your .bashrc file contains the line source /opt/ros/kinetic/setup.bash, which sets your PYTHONPATH variable to the /opt/ros/kinetic/... location.

The solution is to unset your PYTHONPATH variable in your ~/.bash_rc file, right after sourcing ROS, but before sourcing your Anaconda environment. Use this command unset PYTHONPATH.

The second issue is I used many package installers to install python packages (and possibly python itself). Be careful of online tutorials my dude! I ends up having tens of python paths, scattered around different virtual environments, and things became very convoluted.

Package managers I used include:

  • pip (system wide and within conda; I have five pips …)
  • apt (did not figure out what it did until too late)
  • conda

Where do these package managers install packages?

  • pip:
    • When used without conda environment, will show the base python dir
    • pip when used with virtualenv will generally install packages in the path <virtualenv_name>/lib/<python_ver>/site-packages. For example, I created a test virtualenv named venv_test with Python 2.7, and the django folder is in venv_test/lib/python2.7/site-packages/django.
  • conda will install in the environment’s python site-packages.
    • If there is no Python2 but we want a python2 package, conda seems will install it to the system-wide python site-packages

A detailed comparison about pip and apt:

pip is used to download and install packages directly from PyPI. PyPI is hosted by Python Software Foundation. It is a specialized package manager that only deals with python packages.

apt-get is used to download and install packages from Ubuntu repositories which are hosted by Canonical.

Some of the differences between installing python packages from apt-get and pip are as follows:

  • Canonical only provides packages for selected python modules. Whereas, PyPI hosts a much broader range of python modules. So, there are a lot of python modules which you won’t be able to install using apt-get.
  • Canonical only hosts a single version of any package (generally the latest or the one released in recent past). So, with apt-get we cannot decide the version of python-package that we want. pip helps us in this situation. We can install any version of the package that has previously been uploaded on PyPI. This is extremely helpful in case of conflict in dependencies.
  • apt-get installs python modules in system-wide location. We cannot just install modules in our project virtualenv. pip solves this problem for us. If we are using pip after activating the virtualenv, it is intelligent enough to only install the modules in our project virtualenv. As mentioned in previous point, if there is a version of a particular python package already installed in system-wide location, and one of our project requires an older version of the same python package, in such situations we can use virtualenv and pip to install that older version of python package without any conflicts.
  • As @Radu Rădeanu pointed out in this answer, there would generally be difference in names of packages as well. Canonical usually names Python 2 packages as python-<package_name> and Python 3 packages as python3-<package_name>. Whereas for pip we generally just need to use <package_name> for both Python 2 as well as Python3 packages.

Which one should you use:

Both apt-get and pip are mature package managers which automatically install any other package dependency while installing. You may use anyone as you like. However, if you need to install a particular version of python-package, or install the package in a virtualenv, or install a package which is only hosted on PyPI; only pip would help you solve that issue. Otherwise, if you don’t mind installing the packages in system-wide location it doesn’t really matter whether you use apt-get or pip.

Also, this link that summarizes this problem perfectly:


The third issue is I confused Python packages with ROS packages somewhere in my projects. This is more conceptual but did misguide my actions. Here are some clarifications:


  • ROS is a programming ecosystem/framework that provide middleware for modules (packages) to communicate, especially for the purpose of robotics systems
  • ROS provide distribution in Python and C++
  • ROS code is organized by packages: folders containing src, data, other libraries…
  • ROS has a $ROS_PACKAGE_PATH, which is an environment variable indicating where to find packages for ROS. For example: `echo $ROS_PACKAGE_PATH` will give the output /opt/ros/kinetic/share


  • Python path: environment variables. Where to find import packages… the module search path
  • modules: individual scripts
  • package: a collection of modules
  • Given this structure, if the pkg directory resides in a location where it can be found (in one of the directories contained in sys.path), you can refer to the two modules with dot notation (pkg.mod1, pkg.mod2) and import them with the syntax you are already familiar with
  • The file in a package will import the modules into the package namespace, so that we can directly use imported modules in the main function.


Finally, some useful scripts for inspection and cleaning:

  • Find all Python:  sudo find / -type f -executable -iname 'python*' -exec file -i '{}' \; | awk -F: '/x-executable; charset=binary/ {print $1}' | xargs readlink -f | sort -u | xargs -I % sh -c 'echo -n "%: "; % -V'
  • `whereis pip`   finds all pips

Use binary representation of int to represent subsets of n elements

For the past two months I spent many hours solving algorithmic problems. It’s fun, but very time-consuming. I think the majority of time is spent on trying to find accurate solutions and systematic ways for problem solving. I also spent a few weeks (!) grasping data structure fundamentals.

Anyway, this post just describes a technique I came into when reading this book , page 73, on using integers to find all possible subsets from a list of length n. Example Leetcode problem:

The key insights are:

  • Every int is binary represented in a machine’s memory as a sequence of 0’s and 1’s
  • The range of ints from 0 to 2**n-1, have included all possible combinations of 0’s and 1’s for a list of length n.
  • Think about every position’s value (0 or 1) as an indicator whether element at that position appears or not.
  • So for every int from 0 to 2 **n -1, we only need to extract all the positions that are 1, to get the combination of list elements this int represents
  • With bitwise operators & and shift operator <<, we can easily get whether the i-th element of an int is 1 (turned on) or 0 (turned off). Try X & (1 << i).
  • Looping through all positions from 0 to (n-1) we can get all valid bits for this int. We append the corresponding elements to the answer list corresponding to this int.
  • Finally, we collect every answer list to a big list and return.

Very smart and interesting strategy. Code snippet here:

This strategy can used for other exhaustive search problems that requires looping through all possible combinations of n elements, with exponential time complexity ( 2**n). But the very fast bit manipulation makes it evades the time limit.

Doctor AI: Using deep learning to automatically diagnoses diseases

(This post is based on a classmate’s and mine project in our first semester at NYU. I’ve been trying to write it into a blog post for three months and today finally decided to remove the task out of my to-do list…)

Everyone have heard about AlphoGo, the hallmark of current artificial intelligence (AI) research that can beat human Go champions.

But how about using AI to automatically diagnoses diseases? Or, at least, automatically narrows down the range of possible diseases and recommend suitable medical consultants? Imagine how much burden this will take off from the always over-worked doctors. Imagine how much benefits this will do to our health care system!

In this blog, we present a machine learning project that automate the identification of relevant diseases, by using “intake clinical notes” to predict the final diagnosis code. From this code, we can then identity which consultants are relevant.

#### Data Source
As with any other current machine learning systems (or really, even with human learning), an algorithm needs to see enough data before it can learn to make smart decisions.

So for our algorithm, we feed it with a public available medical dataset called MIMIC (‘Medical Information Mart for Intensive Care’). This dataset is composed by a group of MIT researchers, and comprises over 40,000 hospital admission notes for more than 30,000 adults and more than 7,000 neonates who are admitted to critical care unites at one hospital.

(It’s freely available at : in the form of a SQL database if you would like to take a look!)

The complete dataset has rich information, including vital signs, medications, laboratory measurements, observations and notes taken by care providers. However, for our purpose, we only used two variables in the dataset:

– medical discharge summaries for each patients. A discharge summary is a letter written by the
doctor caring for a patient in hospital. It contains important information about the patients
hospital visit, including why they came into hospital, the results of any tests they had, the treatment they received, any changes to their medication, etc.

– the ICD-9 diagnoses for patients. “ICD-9” stands for “International Classification of Diseases, Ninth Revision”. Basically, it is a three digit code for a disease. For example, 428 stands for “heart failure”.

#### Data Transformation / Feature Engineering
One interesting thing about this project is the model needs to learn from text. To tackle this challenge, we used techniques from natural language processing (NLP) to transform text into machine readable inputs – numbers. In the jargons of machine learning, this is also called “feature engineering”.

How to transform text into numbers? We first tried the following methods for data transformation.

* tokenize the input text. This means be break down sentences into sequences of words. For example, ‘She loves dog’ becomes a list of words, (‘She’, ‘loves’, ‘dog’). We used a python library NLTK (“Natural Language Toolkit”) to achieve this.

* turned all text into lower-cases, because we would like to prevent ‘She’ and ‘she’ being recognized as two different words.

* converted all numbers to the generic character \*, so that ‘2017-03-13’ becomes ‘\*\*\*\*-\*\*-\*\*’. This is because evert discharge summary contains at least one date, but numbers are not so useful in our prediction so we just mask the information.

After these steps, we will have every discharge summery turned into a list of words. The total corpus will contain a list of list, with every list being a discharge summary. Not bad!

Our next step would be to turn the lists of words into lists of numbers. For this, the method is called one-hot encoding, in that we construct a dictionary consisting of most common words in the **corpus**, and then replace every word in by its position in the dictionary. Easy!

For example, for this dataset we did the following:

* Construct a vocabulary consisting of words that appeared at least 5 times in the dataset. We basically discard the rare words. After this step, we have a vocabulary of 38,119 unique words.
* Use these 38,119 unique words to index each word in individual list of words. For example, if “the” appeared in the first position, it’s going to be replaced by the number ‘1’.
* After creating the features, we transform each piece of text summary into numbers, by assigning each word (token) with its count in the document.

For a simple example, consider the text: ”harry had had enough”. A Counter object on the tokenized text would yield (’had’: 2, ’harry’: 1, ’enough’: 1). So, this sentence will be transformed into a vector (2, 1, 1).

We also did other preprocess techniques to handle challenges in this dataset, including rate words, out-of-vocabulary words, but these techniques are omitted in this post for simplicity.

#### Model
We used a “deep learning” framework to handle this task. If that sounds pretentious, you are probably right! In fact, “deep learning” is just a fancy name of saying “neural networks with many layers”, while “neural network” is just a fancy way of saying “nested functions”. So nothing mysterious here.

We used a Hierarchical-Attention GRU (HA-GRU) model for this task (Yang, 2016; Baumel 2017). This model’s main characteristics are that 1) it uses a hierarchical model structure to encode at sentence level and document level separately, and 2) it has two levels of attention mechanisms applied at the word and sentence-level. All data manipulation and model training was done in Python 3.6.0, Pytorch 0.2.0.

To understand this model we needs to understand two components: GRU model and attention mechanism. There are some complicated mathematical constructs, but the idea of this post is to describe the main ideas without digging into too much detail, and point to other helpful sources if necessary.

Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) models, which is a type of neural networks. Let’s start from the simplest construct: neural networks.

A neural network is a fancy way of saying “function approximation”. For example, suppose we have two variables, $X$ and $Y$. $X$ is a vector of length 3, $Y$ is a vector of length 2.

We would like to discover the relationship between $Y$ and $X$ in the form of $Y = f(X)$.

We might first do a linear transformation on $X$ and get $g(X)$, then throw the $g(X)$ into another non-linear function $z(\cdot)$, and get the final result $z(g(X))$.

So in the end, we will have $Y = f(X) = z(g(X))$.

A neural network model is essentially a graphical way of presenting this mathematical transformation. If we present the previous mathematical formulation into a neural network, it will look like the following:


Here is the coorespondence:
– the first three nodes are our $X$, since dimension is three we have three nodes.
– the second two nodes are our transformed hidden vector$h$. The dimension of hidden vector is two, so we have two nodes.
– the final two nodes are our final predicted vector $Y$. The dimension is two, so we also have two nodes.

With the above understanding of simple neural networks, a recurrent neural networks adds autoregression into the formula by passing the output of every simple neural network to the input of the next neural network. The idea is somewhat similar to an autoregressive time series model. Such architecture helps dealing with long term dependencies.

To understand more about neural networks, the following blog posts are excellent resources:

Attention mechanism, on the other hand, is based on recurrent neural networks. The mechanism is multiplying outputs at multiple time stamp by a weight matrix that might or might not be influenced by the output. There are many implementations of attention mechanism, while the following posts also serve as great introduction to this topic:

After understanding recurrent neural networks and attention, the specific architecture of our model – HA-GRU is easy to follow!

Basically, this model is a stack of two recurrent neural networks.
At the first level we encode every sentence in a document into a vector. This is achieved by first embedding words into word vectors, then applying a bi-directional GRU (word GRU) with neural attention mechanism on the embedded words. The output of this step is sequences of sentence vectors.

At the second level, we repeat the process of a recurrent neural network plus attention mechanism, but this time at sentence level.

### Results
We train our HA-GRU model with different parameter configurations of embedding layer, word bi-GRU layer and sentence bi-GRU layer. We use mini-batch gradient descent for numeric optimization.
After first running the model on our full dataset
(over 29,000 summaries and 936 labels), we noticed that, because of the extreme sparsity of the label set, the results were negligible. So we decided to test a smaller label set and finally achieved an accuracy ranging from 0.01-0.04. In other words, our model does not seem to make good predictions.

### Discussion
While our results were not great, the project was still a good exercise for thinking about, and testing how artificial intelligence models can help with medical practices. It also shows that obviously, our Dr. AI has a large room of improvement!

Two approaches for logistic regression

Finally, a post in this blog that actually gets a little bit technical …

This post discusses two approaches for understanding of logistic regression: Empirical risk minimizer vs probabilistic approaches.

Empirical Risk Minimizer

Empirical risk minimizing frames a problem in terms of the following components:

  • Input space X \in R^d. Corresponds to observations, features, predictors, etc
  • outcome space Y \in \Omega. Corresponds to target variables.
  • Action space A_R  = R Also called decision function, predictor, hypothesis.
    • A sub element in action space could be hypothesis space: all linear transformations
  • Loss function: l(\hat{y}, y) a loss defined on the predicted values and observed values.

The goal of the whole problem is to select a function mapping $F$ in action space that minimizes the total loss on sample. This is achieved by selecting the value of the parameters in $f$ such that it minimizes the empirical loss in the training set. We also do hyperparameter tuning, which is done on the validation set in order to prevent overfitting.

In a binary classification task:

  • Input space X \in R^d.
  • outcome space Y \in {0, 1}. Binary target values.
  • Action space A_R  = R The hypothesis space: all linear score functions
    • F_{score} = {x \rightarrow x^Tw | w \in R^d}
  • Loss function: l(\hat{y}, y) = l_{logistic}(m) = \text{log} (1 + e ^{-m})
    • This is a kind of margin based loss, thus the m here.
    • Margin is defined as \hat{y} y, which has interpretation in binary classification task. Consider:
      • if m = \hat{y} y > 0 , we know we have our prediction and true value are of the same sign. Thus, in binary classification, we could already get the correct result. Thus, for m > 0 we should have loss = 0.
      • if m = \hat{y} y < 0 , we know we have our prediction and true value are of different signs. Thus, in binary classification, we are wrong. We need to define a positive value for loss function.
        • In SVM, we define hinge loss l(m) = \text{max}(0, 1-m), which is a “maximum-margin” based loss (more on this in the next post, which will cover the derivation of SVM, kernel methods) Basically, for this loss, we have when m \geq 1 no loss, $\latex m < 1$ loss. We can interpret m as “confidence” of our prediction. When m < 1 this means a low confidence, thus still penalize!
    • With this intuition, how do we understand logistic loss? We know:
      • This loss always > 1
      • When m negative (i.e. wrong prediction), we have greater loss !
      • When m positive (i.e. correct prediction), we have less loss…
      • Note also for same amount of increase in m, the scale that we “reward” correct prediction is less than the scale we penalize wrong predictions.

Bournoulli regression with logistic transfer function

  • Input space X = R^d
  • Outcome space y \in {0, 1}
  • action space A = [0, 1] An action is the probability that an outcome is 1

Define the standard logistic function as \phi (\eta) = 1 / (1 + e^{-\eta})

  • Hypothesis space as F = {x \leftarrow\phi (w^Tx) | w \in R^d}
  • Sigmoid function is any function that has an “S” shape. One example is the simple case of logistic function! Used in neural networks as activation function / transfer function. Purpose is to add non-linearity to the network.

Now we need to do a re-labeling for y_i in the dataset.

  • For every y_i = 1, we define y'  = 1
  • For every y_i = -1, we define y'  = 0

Can we do this? Doesn’t this change the value of y-s ? The answer is , in binary classification ( or in any classification), the labels do not matter. Instead, this trick just makes the equivalent shown much easier…

Then, the negative log likelihood objective function, given this F and dataset $laex D$, is :

  • NLL(w) = \sum_i^n [-y_i ' \text{log} \phi (w^T x_i)] +(y_i ' -1) \text{log} (1 - \phi (w^T x_i))

How to understand this approach? Think about a neural network…

  • Input x
  • First linear layer: transform x into w^Tx
  • Next non-linear activation function. \phi (\eta) = 1 / (1 + e^{-\eta}).
  • The output is interpreted as a probability of positive classes.
    • Think about multi-class problems, the second layer is a softmax — and we get a vector of probabilities!

With some calculation, we can show NLL is equivalent to the sum of empirical loss.



Encoding categorical features: likelihood, one-hot, and feature selection

This post describes techniques used to encode high cardinality categorical features in a supervised learning problem.

In particular, since these values cannot be ordered, the features are nominal. Specifically, I am working with the Kaggle competition here. The problem with this dataset is that some features (e.g. types of cell phone operating systems) are categorical and has hundreds of values.

The problem occurs in how to fit these features in our model. Nominal features work fine with decision trees (random forests), Naive Bayes (use count to estimate pmf). But for other models, e.g. neural networks, logistic regression, the input needs to be numbers.

Before introducing likelihood encoding, we can go over other methods in handling such situations.

Likelihood encoding 

Likelihood encoding is a way of representing the values according to their relationships with the target variable. The goal is finding a meaningful numeric encoding for a categorical feature. Meaningful in this case means as much related to the output/target as possible.

How do we do this? A simple way is 1) first group the training set by this particular categorical feature and 2) representing each value by within group mean of that value. For example, a categorical feature might be gender. Suppose the target is height. Then, we might have the average height for male is 1.70m, while the average height for female is 1.60m. We then change ‘male’ to 1.70, while ‘female’ to 1.60.

Perhaps we should also add some noise to this mean to prevent overfitting to training data. This can be done by :

  • add Gaussian noise to the mean. Credit to Owen Zhang :
  • use the idea of “cross-validation”. So here, instead of using the grand group mean, we use the cross-validation mean. (Not very clear on this point at the moment. Need to examine the idea of cross-validation. Will write in the next post.) Some people propose on Kaggle about using two levels of cross-validation: 

One hot vector

This idea is similar to the dummy variable in statistics. Basically, each possible value is being transformed into its own columns. Each of these columns will be a 1 if the original feature equals this value, or 0 if the original feature does not equal this value.

An example is for natural language processing models, the first step is usually 1) tokenize the sentence and 2) constructing a vocabulary and 3) map every token to an index (a number, or a nominal value, basically). After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot

After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot vectors into low dim vectors). In this way, we are basically representing every symbol in a low dimensional vector. The exact form of the vector is learned. What is happening here is actually dimension reduction.  So, after learning the weighting matrix, other methods, like PCA, can potentially work here as well!


A classmate named Lee Tanenbaum told me about this idea. This is an extension on the one-hot encoding idea. Suppose there can be n values in the feature. Basically, we use two hash functions, hash the possible values into two variables. The first hash all values into \sqrt(n) number of baskets,basketh baskent there are \sqrt(n) number of feature values. All feature values in the same busket is going to be the same for variable A. Then, we use a second hash function, that carefully hash the values into another busket variable B. We want to make sure the combination of A and B can fully represent every possible value in the target feature. We then learn a low-dim representation for both A and B, and concantenate them together.

My opinion on this is, this is still one-hot encoding + weight learning. However, we are forcing certain structures onto the weight matrix.

Feature selection 

Still based on one-hot encoding. However, instead of compressing everything into a tiny low-d vector, we discard some dummy variables based on their importance. In fact, LASSO is exactly used for this! L1 usually drives the coefficient of some features to zero, due to the diamond shape of the constraint. Source:

  • On why l1 gives sparsity: video here :
  • Course here : Statoverflow answer here:

Domain specific methods

These models exploit the relationship between these symbols in the training set, and learn a vector representation of every symbol (e.g. word2vec). This is certainly another way of vectorizing the words. Probably I will write more about this after learning more on representation learning!

Python generators

This short post describes what is a generator in Python.

A function with yield in it is a function. However, when called the function, it returns a generator object. Generators allow you to pause a function and return an intermediate result. The function saves its execution context and can be resumed later if necessary.

def fibonacci():
    a, b = 0, 1
    while True:
        yield b
        a, b = b, a + b

g = fibonacci()

[next (g) for i in range(10)]

This will return [1, 1, 2, 3, 5, 8, 13, 21, 34, 55].

When call the list comprehension again, it will return:

[next (g) for i in range(10)]

[89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]

Here, note the function is like a machine that can generate what you want. It will also remember its last state. This is different from a function that returns a list, which will not remember its last state..

Python generators can also interact with the code called with the next method. yield becomes an expression, and a value can be passed along with a new method called sendHere is an example piece of code:

def psychologist():
    print("Please tell me your problem")
    while True:
        answer = (yield) # note the usage of yield here 
        if answer is not None:
            if answer.endswith("?"): # note it's endSwith, the s there
                print("Don't ask yourself too many questions.")
            elif "good" in answer:
                print("A that's good, go on. ")
            elif "bad" in answer:
                print("Don't be so negative.")

This defines a function that can return a generator.

free = psychologist()


This will return “generator”


This will return “Please tell me your problem.”


This will return “Don’t ask yourself too many questions.”

free.send("I'm feeling bad.")

This will return “Don’t be so negative.”

free.send("But she is feeling good!")

This will return, “A that’s good, go on.”

send acts like next, but makes yield return the value passed. The function can, therefore, change its behavior depending on the client code.