Monthly Archives: December 2018

重读 hierarchical attention network for document classification

这篇文章的 key idea 是,把关于文档结构的层级结构信息加入模型,有助于生成更好的文本表征。

这里的文档结构主要是说文章由句子组成,是层级结构。之前的方法是把所有句子连成一起输入一个 RNN 模型,这样其实丢失了段落这样的层级结构。

相应地,另一种方法是把每个句子里的词先分步输入一个 RNN 模型,生成句子表征;再将所有的句子表征输入另一个RNN模型,生成文本表征;最后再用文本表征进行分类。这样的方法,神经网络之间并不共享参数,而且两次输入RNN,故可以捕捉到层级结构。




  • 将每个句子里的词向量送入 GRU (see last post on explanation on what is a GRU),收集每一步输出的 hidden state (因而不能直接调用 pytorch nn.GRU 函数而需要稍作改变写个 for-loop 并把结果存起来)
  • 把所有的 hidden state 送入MLP,生成对词向量的表征
  • 随机初始化一个 context vector,这个 context vector 的意义是句子的含义。用它和每个向量的表征求点积,代表 attention score。score 越高,说明两个向量越相似,也就说明这个词在这个句子里有更显著的意义。因此给它的 attention weight 也就应该比较高。
  • 将 attention score 送入 softmax 函数求得权重。用这个权重和原始的 hidden states sequence 求 weighted sum 得到整个句子的表征。


  • 我们一共只训练两套 GRU 单元,一个负责总结句子,一个负责总结段落。因此所有的句子必须一样长,所有的段落必须有一样长度的句子。因此在预处理时,过长的句子被剪掉,过短的句子被补足。具体的长度选取可以看训练数据中长度的分布,用 qunatile 选择。
  • 将数据划整齐后,首先用上述方法得到每个句子的表征。
  • 其次,针对每个段落,再把所有的句子表征送入GRU,得到段落表征。
  • 最后就可以用这个表征做分类了。


2D convolution with tiling technique on GPU

This post describes how to write CUDA C code to perform 2D convolution on GPU with tiling technique. It is very brief, only covers basic concepts but with links to code examples. We are not going to use cuDNN, only the bare bones of CUDA. But here is an excellent post if you want to use cuDNN – for this particular task, cuDNN is slower than the code I will present below.

You can find the source code for this post here.

This post will consist of the following sections:

  • Installation CUDA
  • Basic GPU programming model
  • Vanilla convolution on GPU
  • Constant memory in GPU
  • Tiling technique and indexing
  • Link to code example

Install CUDA

First, make sure if you have a NVIDIA GPU on your machine. Here is how. Or just search the model online and ask on reddit 🙂

Next, follow the official NVIDIA guide here to download CUDA Toolkit. Note not every card support every version of CUDA kit.

Note also the download path, as you will need to specify that to the compiler later. If you are using Linux you can type whereis nvcc to find out. nvcc is the NVIDIA C compiler.

It’s easy to get wrong. Just follow carefully every step in the official installation guide. In particular don’t forget to finish the post-installation set-ups! Wrong PATH environment variables can be a real pain.

Basic CUDA Programming model

The most prominent characteristic of CUDA programming is parallization. When you write one line of code in a CUDA kernel (explained later), it will be executed by thousands of threads on different data. This is called the SIMD (same instruction multiple data) parallel paradigm.

With this idea in mind, here is what a typical CUDA C code look like:

  • Setting up the program on the host (CPU) by initializing variables and populate them. Write non-parallel sequential code.
  • Preparing for GPU work by initializing device (GPU) variables that reside in device memory, and copy contents from the corresponding host variables to it.
  • Write a GPU kernel function to deal with the parallel part of computation (e.g. convolution).
  • Set up the parallel configurations by specifying how many blocks and grids are used for the kernel. These two variables will decide how many threads we started for parallel execution.
  • Call the kernel from host code. Store the results in a device variable.
  • Copy contents from device to the host.

Some concepts to clear:

  • How many threads we use in the GPU is controlled by Grid size and Block size. A grid consists of many blocks. Grid and Block are both 3D structures. For example, a grid of dimension <2, 2, 2> will have 8 blocks in total, and a block of dimension <2, 2, 2> will have 8 threads in total. A block cannot have more than 1024 threads.

Here is a basic matrix addition example:


// function declarations 
__global__ void vecAddKernel(float * a, float * b, float * c, unsigned int N);

// main function 
int main()
    int N = 10;    // length of vector 
    float * a, * b, * c;  // a and b are vectors. c is the result
    unsigned int size = N * sizeof(float);  // number of bytes to allocate 
    a = (float *)calloc(N, sizeof(float));
    b = (float *)calloc(N, sizeof(float));

    int i = 0;
    float sum = 0;
    for (i = 0; i < N; i++){
        a[i] = (float)i / 0.23 + 1;
        b[i] = (float)i / 5.89 + 9;
        sum += a[i] + b[i];

    c = (float*) malloc(size);
    // 1. allocate memory on CUDA
    float * d_a, * d_b, * d_c;   // device memory 
    cudaError_t err1 =  cudaMalloc((void **) & d_a, size);
    cudaError_t err2 = cudaMalloc((void **) & d_b, size);
    cudaError_t err3 = cudaMalloc((void **) & d_c, size);
    if (err1 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err1), __FILE__, __LINE__);
    if (err2 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
    if (err3 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err3), __FILE__, __LINE__);
    // copy memory 
    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // 2. operate on kernels 
    vecAddKernel<<<ceil(N/256.0), 256>>>(d_a, d_b, d_c, N);

    // 3. copy the results back to host
    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    cudaError_t error = cudaGetLastError();
        fprintf(stderr,"ERROR: %s\n", cudaGetErrorString(error) );
    float cuda_res = 0;
    for(i = 0; i < N; i++){
        printf("%.2f\t", c[i]);
        cuda_res += c[i];
    printf("Results from host :%.2f\n", sum);
    printf("Results from device:%.2f\n", cuda_res);


    return 0;

void vecAddKernel(float * a, float * b, float * c, unsigned int N){
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i<N)  c[i] = a[i] + b[i];


Vanilla convolution on GPU

Where to parallelize for convolution? We use every thread to correspond to one single element in the output, and all threads do the same convolution operation but with different data chunks.

Constant memory in GPU

Constant memory variables are shared by all threads in all blocks. This memory is small but fast. So we should put the filter in it because everyone needs it, and it’s small.

Tiling technique and indexing

Tiling is making use of the shared memory on GPU. Shared memory is shared by all threads in the same block, it is small but fast. So the idea is for every thread in a block, they will operate on data regions that have some overlaps. And we make use of that to copy the small region of the larger picture that these threads need into the shared memory and dealing with them.

We use all threads to copy these data elements. Here is a further illustration of this tiling idea.

Code example



Here is the tiling organization :

Understanding GRUs

This post follows this post last year about vanilla Recurrent Neural Network structures.

One of the ultimate goal of a recurrent neural network is to summarize a sequence of vectors into a single vector representing all the information.

In the simplest case, imagine for the past month, every day  you and your friend did some push-ups. Let’s denote your number as x_i at day i, and your friend’s number as y_i at day i. Now a month has passed and you two want to compare who is more keen on working out.

So what you do you? Directly comparing two vectors is hard, because there is no obvious “greater-than” relations across dimensions? One way is to calculate the simple average of the two sequences, i.e. on average you did \bar{x} and your friend did \bar{y} per day. Whichever number is larger, you can bet that person is a better athlete!

Now in this example, time order does not really matter, because we are treating numbers from 30 days ago as the same as number yesterday. Another example where time does matter is interests from bank savings. Suppose every year you save b_i amount of Yuan into the bank, and the bank pays out interests at the end of year, once per year. How much will your total savings be at the end of the fifth year? You can image the oldest money will receive the highest interests, because of confounding.

With the same logic, we can apply this idea to summarizing sentences. Suppose every word can be represented as a n-D vector in a semantic space. How do we produce one single vector that represent the meaning of this sentence?

Taking average is a way, but note every word will have some influence on the words after it. e.g. “this burger is delicious.” Note how the “burger” constrains word choices after “is” … And that’s why we need recurrent neural network : at every step of the model, the hidden vector (which contains information about previous words) will concatenate with the current word, and becomes the input into producing the next hidden vector. So every word will have a lasting influence in the final output.

However, simple RNN has a vanishing (explosive) gradient problem. Mathematically, the older a word is, the higher order it will has on the multiplication factor of weight matrix. When taking first-order gradients, the weight matrix will have an order of n-1. Now if one term is larger than 1, as n approaches infinity this will will approach infinity too. If one term is smaller than 1, as n approaches infinity this will will approach zero and thus the model will “stop learning” (i.e. weights will not update).

Let’s formulate the intuition above more formally. For a simple RNN, every update we have

h_t =g (W x_t + U h_{t-1} + b)

Here g can be any non-linear activation function, e.g. a RELU or a sigmoid. Now we consider the gradient of h_t with regard to W.


h_t = g (O_t)


O_t =W x_t + U h_{t-1} + b

Using the chain rule we have:

\frac{\partial h_t}{\partial W} =\frac{\partial h_t}{\partial O_t} \frac{\partial O_t}{\partial W}

We also know:

\frac{\partial O_t}{\partial W} = X_t + U_h \frac{\partial h_{t-1}}{\partial W}

So plug in the second equation above into the first equation, we have:

\frac{\partial h_t}{\partial W} = {\partial g} \cdot (X_t + U_h \frac{\partial h_{t-1}}{\partial W})

We can already see a recurrent relation between \frac{\partial h_t}{\partial W} and \frac{\partial h_{t-1}}{\partial W}. To make things clear, write \frac{\partial h_t}{\partial W} = \alpha_t, and expand the equation a few more steps:

\alpha_t = {\partial g} \cdot (X_t + U_h \alpha_{t-1})

\alpha_t = {\partial g} \cdot (X_t + U_h \cdot \partial g \cdot (X_{t-1} + \alpha_{t-2}) )


\alpha_t = C + (\partial g U_h)^{n-1} \alpha_0

Here $C$ represent the other terms (with lower order of \alpha_t-s) in the formula that we omitted for simplicity. So we can see, as n increases, if norm of \partial g U_h is greater than 1 this term will approach infinity, or if is less than 1 it will approach zero.

The main reason is the same term is multiplied n-1 times. i.e. information always flow at the same rate every step. If you think about a sentence, this does not really make sense as words meaning / importance does not really decrease (increase) exponentially w.r.t. it’s distance to the end of the sentence. The first word might very well be very important (e.g. a female name which will influence later pronoun choices “he” or “her”), but this vanilla RNN structure is too simple to capture that. We need more dynamic structures to allow information flow freely between time stamps.

Gated recurrent unit solves the problem by adapting different terms in every update step. There are two key ideas:

  • Introduces an external “memory” vector to capture long distance dependencies.
  • Allow error messages to flow at different strengths depending on inputs.

It achieves this by first computing two “gates”:

update gate : z_t = \sigma (W_z x_t + U_z h_{t-1} + b_z)

reset gate: r_t = \sigma (W_r x_t + U_r h_{t-1} + b_r)

They are both continuous vectors of the same length as the hidden state, constructed by passing the current word x_t and the last hidden vector h_{t-1} through an MLP. The sigmoid function makes every element between 0 and 1, so when used to perform element-wise multiplications \o, these two gates essentially controls how much information “flows” through.

There is also a candidate vector representing “new memory content”:

\tilde{h_t} = \text{tanh} (W_h X_t + r_t \circ (U_h h_{t-1} + b_h)

The usage of tanh and W_h X_t is pretty standard as in all other unites. But note the meaning of reset gate r_t here. If r_t is close to zero, then we “forget” all the previous information in the h_{t-1} vector, and just use the current word information.

An example where the reset gate is useful is sentiment analysis on a movie review, where the author spend many sentences describing and summarizing the movie plot, but conclude that “But the movie is really boring”. With this sentence, all the previous information become useless. This reset gate can help the model “forget” the previous summaries.

Finally, we update the new hidden vector as :

h_t =  z_t \circ h_{t-1} + (1 - z_t) \circ \tilde{h_t}

You can see the update controls how much information is remembered from old state, and how much information is acquired from the new memory content.

A caveat is why do we need the reset gate at all now that we have the update gate? Doesn’t we already have a gate to control how much old information is retained by using an update gate? Doesn’t the update gate also have the ability to eliminate all historical information embedded in h_{t-1}? I did not find satisfactory information about this point online.

Now think about why GRU can solve the vanish gradient problem. The update gate allows us to retain all previous information by setting all elements of z_t to 1, and h_t is exactly the same as h_{t-1}. In this case, we do not have the vanishing gradient problem because no weight matrix is multiplied.

For example, in the same movie review suppose the author says “I love this movie so much” and then goes on to explain the movie plot. Now the first sentence is really important, but with recurrent neural network, we update the gradient at every step and the content will be washed out as time passes. But now the update gate allows the model to retain the first sentence without exponentially decay its content.

Units with short term memories usually have reset gate very activate (i.e. numbers close to 1), and thus forget most about the past…

If reset gate is entirely 1 and update gate is entirely zero, then we just have a standard RNN.

Why tanh and sigmoid?

  • In theory tanh can also be RElu, but in practice it just performs better.
  • Sigmoid will control value between 0 and 1, but again no formal justification.

Finally, here is an illustration that pictures what’s going on in a GRU unit: