Author Archives: Yuqli

Preparing 3D building data from CityGML for machine learning projects


I spent the last week of 2018 and the first week of 2019 preparing CityGML data for a machine learning project. In particular, I need to extract 3D point cloud representation of individual buildings in New York (and Berlin and Zurich) so they can be the training / validation / test data.

During the process, I had to pick up some knowledge on 1) the Linux file system / disk management, 2) PostgreSQL, 3) CityGML (a data format for 3D city maps) and 4) FME (a data warehouse software for ETL) and 5) PCL command tools and 6) shell programming. I acquired these bits of knowledge by asking on Stackoverflow, FME forums, and the issue forum of corresponding open source tools on Github. Most importantly, thanks to the help from people all around the world (!), I am finally able to figure it out. This post documents the necessary steps to finish this task.

Since there are many details involved, this post will mostly point the readers to the tools I used and places to find how to use them. It covers the following sections:

  • CityGML : what is it and database setup
  • FME: the angel that does the heavy lifting for you for free, from CityGML to mesh
  • PCL command tools: a not-so-smart way (i.e. the Engineering way) from mesh to point cloud

1. CityGML Setup

1.1 What is CityGML

CityGML is a markup format to document 3D city maps. It is a text document, basically telling you where a building is, what a building is composed of, where a bridge is, what’s the coordinate of a wall surface, where are the lines of a wall, etc…

CityGML is based on XML and GML (geometric markup language). XML specifies document encoding scheme, e.g. class relations, labels and tags… kind of like HTML. GML extends XML by adding sets of primitives, like topology, features, geometry. Finally, CityGML is based on GML but with more constraints that are specific to cities, e.g.

A 3D city model can have various level of details (LoD). LoD1 means the map is only 2D. LoD2 means 3D objects will be extruded from 2D maps into 3D shapes. LoD3 means a building will have windows, roofs and other openings. LoD4 means a building will have interior furnitures.

For example, part of a sample CityGML file might look like this :

This specifies several surfaces of a building.

A sample visualization of a CityGML LoD4 model might look like this:

You can see furnitures of the building in the main Window.

References to know more about CityGML:

  • Groger and Plumer (2012) CityGML – Interoperable semantic 3D city models. Link. This paper goes over details of CityGML and is a more organized guide than the official CityGML website.
  • CityGML website for downloading data for specific cities. Many of the Germany cities are available in CityGML format.

1.2 3D City Database

3D City Database is a tool to store geographic databases, and it provides good support for CityGML data.

Now why do we need a database to store CityGML data? Because a city map might be very large and structured, e.g. New York LoD2 city model is ~30G. Such a large file cannot be easily manipulated by text processing tools (e.g. gedit) or visualized because of memory constraint. More, the 3D city database can easily parse modules and objects in the CityGML file and store them in structured ways. So if someone asks the question, how many buildings are there in New York? This question is hard to answer with only the CityGML file, but very easy to answer with a SQL query run on the 3D city database.

To use the 3D City Database, we need to go over the following steps:

1. Set up a PostgreSQL database locally or on a remote server. This database tool also supports Oracle, but I did not make that work. PostgreSQL is a free database tool available for Linux. The documentation is good for specific queries, but if you are new to practical database management, this book might be more helpful in offering a general road trip.

Specifically, on Linux you need to first make sure you have a disk that has at least 50 GB of free space, and your current user can own directory on that disk. I spent 4 days trouble shooting this because the hard disk on my Laptop has a file system that is actually Windows NTFS, so my sudo user cannot own directories there. Commands like chmod or chown did not work. To solve that problem, I had to back up everything (compress and upload to Google Drive) and reformat that disk into Linux File system. Useful commands and tools here:

  • parted : a tool to partition a new disk. A disk has to be partitioned before it can be mounted and used.
  • mount : add a disk’s file system to the current operation system file system tree
  • lsblk : list block and check their file systems
  • mkfs: make filesystem. ext4 is a linux format.

Linux wrapper tools useful for PostgreSQL actions:

  • pg_createcluster: create a cluster. Used because the default PostgreSQL directories lie somewhere under /home, and this disk is usually small. If we want a cluster in another location should use this command. Note the path should be absolute path instead of relative path.
  • pg_cltcluster : start / stop / drop a cluster.
  • psql: used to connect to a running server. Note PostgreSQL has a default user “postgres”, and 1) you need to set up the password for it before a database can be connected from other clients (e.g. 3D database importer / exporter) and 2) this “postgres” user is not normally logged into directly, rather use “sudo -u postgres psql -p” to only use it to log into postgres server.

2. Connect the 3D City Database tool to the PostgreSQL database. Download the 3D City DB tools here. After running the install .jar file, you will notice there are basically a few sets of tools, of which we need to use two:

  • The SQL and Shell scripts used for setting up a geographic database
  • The importer / exporter used for importing data into the database

Before diving into details, here are two helpful resources you should check out for specific questions:

  • this documentation is very helpful and one actually needs to read it to proceed… i.e. no other better online documents…
  • This online Q&A on Github is actually active, the developers will answer questions there. I discovered it too late !

After getting the documentation, these sections are helpful:

  • Step 1 is to set up a PostgreSQL database using the scripts included in this 3D DB tool to set up all the schemas, this starts from page 102 / 318 of the version 4.0 documentation. Note you will need to create a postgis extension, details here. On Ubuntu you can get it with “apt install postgis” or something like this.
  • Step 2 is to use the importer / exporter to connect to that database and import data. The importer/exporter located in the “bin” folder, and you just run the .sh file to start it. Details can be found on page 125 / 318 of the version 4.0 documentation.

3. create tiled version of NYC data Some CityGML data is too large to load for other softwares, and the importer / exporter can help create tiles. Details on this are on Documentation Page 142 / 318 and 171 / 318. Essentially, we need to first read everything into a database, then output different tiles of the map. Note we should set up one database for one city.

On a high level, to set up the tiled export a user needs to 1) activated spatial indices under the “database” tab and 2) specified number of bounding boxes in the “preference” -> “Bounding Box” tab and 3) specified the bounding area in the “Export” tab. These would be enough for the program to start tiled export.

The documentation is not very clear on the details. Here are also two Q&As of help I got from other users:


2. FME Workbench: from cityGML to mesh

After the last section, we should already have a bunch of .gml files that are maps of different regions of a city. You can visualize a .gml file using FME software.  FME is a data warehouse tool and has two tools: FME workbench and FME data inspector. Data inspector is just for looking at data, while workbench can be used to edit data. You will need a license to open the software, but a student account can apply for free trial. The application process takes a few hours. Note you need to register your code on the FME licensing assistant and download the license file, or else the activation will void next time you open it.

For example, this is a region of Zurich looks like:

The next step is to extract individual buildings from it, as well as converting them into mesh files like .obj. The basic idea is:

  • First identify and output individual buildings from a single .gml file. This is done by a fanout operation.  You will also need an aggregator because a “building” consists of 1) roof surfaces, 2) wall surfaces, and 3) ground surfaces. Buildings can be identified by a parent id.
  • Second you will need to automate the process with all .gml files. This is done through setting up a file and document reader… i.e. you will need another FME workspace to read in different files in the same directory, and call the sub-workspace from this parent workspace.

FME also has a little bit of learning curve. To understand the above two operations, you might need to know:

Finally, here is a Q&A I posted on FME forum and people helped pointing me to the correct places.

Eventually, individual buildings will look like this:


3. PCL command line tool: from mesh to point cloud

The last step is to create point cloud from mesh, because we want to test methods specifically applicable to point cloud. I found the following tools :

  • This tool that someone gave me: It uses Poisson disk sampling. This tool have trouble processing 2D meshes (some building data seems to be dirty) as the function won’t return. I tried python package “signal” and “multiprocessing” to kill timeout function but neither work. So I gave up with this tool.
  • Cloud Compare: An open source file to edit cloud files. This tool has a command line tool but reports error when I try to save point cloud… So I gave up with this tool too.
  • PCL : Point Cloud Library.
  • pyntcloud: a python library. It seemed bit troublesome to read .obj instead of .ply files. It seems it mainly supports .ply files so I gave up with this tool

I consulted this Stackoverflow Q&A 

I eventually settled with use pcl_mesh_sampling. It can be installed on Ubuntu 16.04 with apt-get pcltools or something like that. The command is very easy to use: the first argument is the input file name the second argument is the output file name. Then you specify the number of points to sample.

A remaining issue is how to automate it. Since it is a command line a natural way is to use bash script, so you need some string manipulation in bash. A bigger problem is every time this tool generates and saves a point cloud, it will visualize it using a window. Until you close that Window, the process won’t finish. So we need to automatically “close” the window with anther shell script that calls xdotool at a fixed time interval to automatically close the specified Window. Note we cannot use the “windowkill ” option for xdotool but need to simulate the key stroke alt+F4 (i do not know why). Full command is

xdotool search “$WINDOWNAME” windowactivate –sync key –window 0 –clearmodifiers alt+F4

This Q&A is helpful (others are less).

Again, any questions please direct to…

Distributed Stochastic Gradient Descent with MPI and PyTorch

This post describes how to implement stochastic gradient in a distributed fashion with MPI. It will cover the following topics in a high-level fashion, as it is challenging to cover every details in a single post. I will point to other resources helpful in understanding key concepts.

  • What is MPI
  • How to do distributed SGD
  • PyTorch modules necessary for writing distributed SGD and how to design the program
  • Engineering caveats for building PyTorch with MPI backend 

What is MPI

I understand MPI as a programming framework that handles communication between computers. 

It’s like the lower-level software underneath MapReduce, another programming framework for distributed computing. In MapReduce you need to specify a map operation and a reduce operation and that’s all. The system take care of the allocation of workers and server and memory, which can of course taken care of by meta-information input into a shell script. The main challenge is designing the parallel algorithms at the algorithmic level, for example how to parse the two matrices when doing matrix multiplication, or how to design the mapper and reducer for min-hash and shingling documents. 

But in MPI, the programmer needs to specify the worker and server by their ID, and actively coordinate the computers, which probably requires another set of algorithms. For example, if many workers need to communicate with the server at the same time but the server cannot serve them all, there needs to be a scheduling algorithm e.g. round-robin. In MPI, the programmer needs to code the scheduling by themselves so that messages from different workers won’t be mixed. We’ll see an example later in the asynchronous SGD. 

MPI has many concrete implementations, for example OpenMPI and MPICH. 

MPI has the following basic operations:

  • point-to-point communications: blocking and non-blocking. This enables a single computer to send message to another, and another computer to receive message. blocking means the process is blocked until the send / receive operation is finished, while block means the process is returned immediately, not wait till the process is finished.
  • collective communications : examples are reduce and gather.

Here is a good place to learn basic operations of MPI. You need to go through all tutorials on that website to be able to understand and write a project that does distributed SGD with PyTorch.

How to do distributed SGD

There are two types of distributed SGD, depending on when to synchronize gradients computed on individual workers.

  •  synchronous SGD:
    • all workers have a copy of the model. they are all the same
    • at every mini-batch, all workers compute their share of gradient, and then compute average
    • the model update on every worker
    • then move on to the next batch
    • if there is a server,
      • the server has a copy of the model
      • the workers also have copies of the model
      • workers are assigned data to calculate forward and backward
      • gradients are sent to the server to take care of the averaging
      • workers receive updated gradients from the server
    • if there is no server, the gradient is calculated in all_reduce. all_reduce can (but not necessarily) be implemented in the ring algorithm: every model send the results to its neighbor ?
  • asynchronous SGD:
    • all workers have a copy of the model, but they are not the same. the model is inconsistent across workers
    • at every mini-batch:
      • workers get parameters from the server
      • workers get data from its own data loader, or randomly selected dataset (is there an epoch concept anymore?)
      • workers calculate forward and gradient
      • once the calculation is done, gradient is sent to the server,
      • the server updates the parameters

Now we need to conceptually understand how this workflow can be implemented using data structure and APIs provided by PyTorch. To do this, let’s first analyze how a single machine SGD is implemented in PyTorch.

In particular, in a single machine SGD implemented with PyTorch, 1) the output of a mini-batch is calculated by calling the forward function of the model, 2) the loss is calculated by sending the output and the target to a loss function, and 3) the gradient is calculated by calling loss.backward, and 4) the update to parameters is done by calling optimizer.step(), and we would pass the model parameters to the optimizer beforehand.

Now for this single-machine workflow, key functions are:

  • We calculate the gradients in the loss.backward() step
  • The gradients are stored in model.parameters()
  • The model parameters are updated when we call optimizer.step()

So, in a multi-machine scenario, the following questions need to be considered:

  1. How many copies of the model should be created?
  2. What should the server store?
  3. What should the slaves store?
  4. What should they communicate with each other?

To see why 1) is important, note the way we deploy MPI is by having the same script sending to every machine, but use MPI_rank to identify servers and slaves, and use if-else condition to decide which block of code should be run on which machine. So, we can theoretically create some objects when we encounter a server, but do not create these objects when we encounter a slave. To answer 1), an obvious answer is everybody has its own copy of a whole model (i.e. the whole computation graph created by model = myModelClass()), but is this necessary?

It turned out it is. Why? Although server only need the parameter value and gradient values to perform updates, and theoretically we only need to put the optimizer on server, and model parameters on slaves, this is not doable in operation because in PyTorch optimizer is tied to a set of model parameters. More, the whole computation graph contains more than data, but also relations. So, we must initialize a model and a optimizer on both server and slaves, and use communication to make sure their values are on the same page.

To answer question 4, here is the logic flow:

  • a slave establishes connection with the server
  • a slave fetches initial parameter values from the server
  • a slave breaks connection with the server
  • a slave calculate the loss on its own batch of data
  • a slave calculates the gradients by calling loss.backward()
  • a slave establishes connection with the server
  • a slave send the gradients and loss to the server
  • a slave breaks connection with the server
  • the server updates its parameter value.

Note here we have multiple steps concerning how to set up connections. In effect, parameter values have many layers corresponding to different layers of a neural network, and we have multiple salves. So if multiple salves trying to send the same set of parameters to the server, data from different sets might be messed up. In other words, MPI needs to know who sends data and programmers need to use a protocol to ensure that during the transmission, only one pair of connection is in effect.

Note MPI does not allow one slave to “block” other slaves, so we need to code this “block” up by using “handshake” technique in TCP. The idea is a slave first send a “hello” message to the server, and when a server is busy this message will wait in a queue until being received. And when the server is idle, and it receives the “hello” message, it will realize someone else is calling it and will wait for this particular person by only receiving message from the same ID for this round. It will also send a message to this ID, telling it “I’m ready for you and no one else is able to mess around”. When a slave receives this confirmation from the server, it can go on to send the important gradient information to the server. After it finishes, it will receive the information send from the server. And when the server finished sending update information to the slave, it will be idle again and able to receive “hello” from another slave. This is a very interesting process to establish connections!

Engineering caveats for building PyTorch with MPI backend 

Some caveats to watch out for:

  • To use PyTorch with MPI, it cannot be installed from Conda. You have to build it yourself. But PyTorch is a large package so check your disk space and act accordingly. It took me 4 days to build it successfully….

Some links:

  • PyTorch tutorial for distributed SGD. Note this guy uses a synchronized SGD so the process is easier – you can just run an all-reduce and you do not have to worry about slaves mess up with each other.
  • My implementation of the distributed SGD process…

Finally, this blog post is not very clearly written. If someone really reads this and has questions, please email me at

重读 hierarchical attention network for document classification

这篇文章的 key idea 是,把关于文档结构的层级结构信息加入模型,有助于生成更好的文本表征。

这里的文档结构主要是说文章由句子组成,是层级结构。之前的方法是把所有句子连成一起输入一个 RNN 模型,这样其实丢失了段落这样的层级结构。

相应地,另一种方法是把每个句子里的词先分步输入一个 RNN 模型,生成句子表征;再将所有的句子表征输入另一个RNN模型,生成文本表征;最后再用文本表征进行分类。这样的方法,神经网络之间并不共享参数,而且两次输入RNN,故可以捕捉到层级结构。




  • 将每个句子里的词向量送入 GRU (see last post on explanation on what is a GRU),收集每一步输出的 hidden state (因而不能直接调用 pytorch nn.GRU 函数而需要稍作改变写个 for-loop 并把结果存起来)
  • 把所有的 hidden state 送入MLP,生成对词向量的表征
  • 随机初始化一个 context vector,这个 context vector 的意义是句子的含义。用它和每个向量的表征求点积,代表 attention score。score 越高,说明两个向量越相似,也就说明这个词在这个句子里有更显著的意义。因此给它的 attention weight 也就应该比较高。
  • 将 attention score 送入 softmax 函数求得权重。用这个权重和原始的 hidden states sequence 求 weighted sum 得到整个句子的表征。


  • 我们一共只训练两套 GRU 单元,一个负责总结句子,一个负责总结段落。因此所有的句子必须一样长,所有的段落必须有一样长度的句子。因此在预处理时,过长的句子被剪掉,过短的句子被补足。具体的长度选取可以看训练数据中长度的分布,用 qunatile 选择。
  • 将数据划整齐后,首先用上述方法得到每个句子的表征。
  • 其次,针对每个段落,再把所有的句子表征送入GRU,得到段落表征。
  • 最后就可以用这个表征做分类了。


2D convolution with tiling technique on GPU

This post describes how to write CUDA C code to perform 2D convolution on GPU with tiling technique. It is very brief, only covers basic concepts but with links to code examples. We are not going to use cuDNN, only the bare bones of CUDA. But here is an excellent post if you want to use cuDNN – for this particular task, cuDNN is slower than the code I will present below.

You can find the source code for this post here.

This post will consist of the following sections:

  • Installation CUDA
  • Basic GPU programming model
  • Vanilla convolution on GPU
  • Constant memory in GPU
  • Tiling technique and indexing
  • Link to code example

Install CUDA

First, make sure if you have a NVIDIA GPU on your machine. Here is how. Or just search the model online and ask on reddit 🙂

Next, follow the official NVIDIA guide here to download CUDA Toolkit. Note not every card support every version of CUDA kit.

Note also the download path, as you will need to specify that to the compiler later. If you are using Linux you can type whereis nvcc to find out. nvcc is the NVIDIA C compiler.

It’s easy to get wrong. Just follow carefully every step in the official installation guide. In particular don’t forget to finish the post-installation set-ups! Wrong PATH environment variables can be a real pain.

Basic CUDA Programming model

The most prominent characteristic of CUDA programming is parallization. When you write one line of code in a CUDA kernel (explained later), it will be executed by thousands of threads on different data. This is called the SIMD (same instruction multiple data) parallel paradigm.

With this idea in mind, here is what a typical CUDA C code look like:

  • Setting up the program on the host (CPU) by initializing variables and populate them. Write non-parallel sequential code.
  • Preparing for GPU work by initializing device (GPU) variables that reside in device memory, and copy contents from the corresponding host variables to it.
  • Write a GPU kernel function to deal with the parallel part of computation (e.g. convolution).
  • Set up the parallel configurations by specifying how many blocks and grids are used for the kernel. These two variables will decide how many threads we started for parallel execution.
  • Call the kernel from host code. Store the results in a device variable.
  • Copy contents from device to the host.

Some concepts to clear:

  • How many threads we use in the GPU is controlled by Grid size and Block size. A grid consists of many blocks. Grid and Block are both 3D structures. For example, a grid of dimension <2, 2, 2> will have 8 blocks in total, and a block of dimension <2, 2, 2> will have 8 threads in total. A block cannot have more than 1024 threads.

Here is a basic matrix addition example:


// function declarations 
__global__ void vecAddKernel(float * a, float * b, float * c, unsigned int N);

// main function 
int main()
    int N = 10;    // length of vector 
    float * a, * b, * c;  // a and b are vectors. c is the result
    unsigned int size = N * sizeof(float);  // number of bytes to allocate 
    a = (float *)calloc(N, sizeof(float));
    b = (float *)calloc(N, sizeof(float));

    int i = 0;
    float sum = 0;
    for (i = 0; i < N; i++){
        a[i] = (float)i / 0.23 + 1;
        b[i] = (float)i / 5.89 + 9;
        sum += a[i] + b[i];

    c = (float*) malloc(size);
    // 1. allocate memory on CUDA
    float * d_a, * d_b, * d_c;   // device memory 
    cudaError_t err1 =  cudaMalloc((void **) & d_a, size);
    cudaError_t err2 = cudaMalloc((void **) & d_b, size);
    cudaError_t err3 = cudaMalloc((void **) & d_c, size);
    if (err1 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err1), __FILE__, __LINE__);
    if (err2 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
    if (err3 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err3), __FILE__, __LINE__);
    // copy memory 
    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // 2. operate on kernels 
    vecAddKernel<<<ceil(N/256.0), 256>>>(d_a, d_b, d_c, N);

    // 3. copy the results back to host
    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    cudaError_t error = cudaGetLastError();
        fprintf(stderr,"ERROR: %s\n", cudaGetErrorString(error) );
    float cuda_res = 0;
    for(i = 0; i < N; i++){
        printf("%.2f\t", c[i]);
        cuda_res += c[i];
    printf("Results from host :%.2f\n", sum);
    printf("Results from device:%.2f\n", cuda_res);


    return 0;

void vecAddKernel(float * a, float * b, float * c, unsigned int N){
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i<N)  c[i] = a[i] + b[i];


Vanilla convolution on GPU

Where to parallelize for convolution? We use every thread to correspond to one single element in the output, and all threads do the same convolution operation but with different data chunks.

Constant memory in GPU

Constant memory variables are shared by all threads in all blocks. This memory is small but fast. So we should put the filter in it because everyone needs it, and it’s small.

Tiling technique and indexing

Tiling is making use of the shared memory on GPU. Shared memory is shared by all threads in the same block, it is small but fast. So the idea is for every thread in a block, they will operate on data regions that have some overlaps. And we make use of that to copy the small region of the larger picture that these threads need into the shared memory and dealing with them.

We use all threads to copy these data elements. Here is a further illustration of this tiling idea.

Code example



Here is the tiling organization :

Understanding GRUs

This post follows this post last year about vanilla Recurrent Neural Network structures.

One of the ultimate goal of a recurrent neural network is to summarize a sequence of vectors into a single vector representing all the information.

In the simplest case, imagine for the past month, every day  you and your friend did some push-ups. Let’s denote your number as x_i at day i, and your friend’s number as y_i at day i. Now a month has passed and you two want to compare who is more keen on working out.

So what you do you? Directly comparing two vectors is hard, because there is no obvious “greater-than” relations across dimensions? One way is to calculate the simple average of the two sequences, i.e. on average you did \bar{x} and your friend did \bar{y} per day. Whichever number is larger, you can bet that person is a better athlete!

Now in this example, time order does not really matter, because we are treating numbers from 30 days ago as the same as number yesterday. Another example where time does matter is interests from bank savings. Suppose every year you save b_i amount of Yuan into the bank, and the bank pays out interests at the end of year, once per year. How much will your total savings be at the end of the fifth year? You can image the oldest money will receive the highest interests, because of confounding.

With the same logic, we can apply this idea to summarizing sentences. Suppose every word can be represented as a n-D vector in a semantic space. How do we produce one single vector that represent the meaning of this sentence?

Taking average is a way, but note every word will have some influence on the words after it. e.g. “this burger is delicious.” Note how the “burger” constrains word choices after “is” … And that’s why we need recurrent neural network : at every step of the model, the hidden vector (which contains information about previous words) will concatenate with the current word, and becomes the input into producing the next hidden vector. So every word will have a lasting influence in the final output.

However, simple RNN has a vanishing (explosive) gradient problem. Mathematically, the older a word is, the higher order it will has on the multiplication factor of weight matrix. When taking first-order gradients, the weight matrix will have an order of n-1. Now if one term is larger than 1, as n approaches infinity this will will approach infinity too. If one term is smaller than 1, as n approaches infinity this will will approach zero and thus the model will “stop learning” (i.e. weights will not update).

Let’s formulate the intuition above more formally. For a simple RNN, every update we have

h_t =g (W x_t + U h_{t-1} + b)

Here g can be any non-linear activation function, e.g. a RELU or a sigmoid. Now we consider the gradient of h_t with regard to W.


h_t = g (O_t)


O_t =W x_t + U h_{t-1} + b

Using the chain rule we have:

\frac{\partial h_t}{\partial W} =\frac{\partial h_t}{\partial O_t} \frac{\partial O_t}{\partial W}

We also know:

\frac{\partial O_t}{\partial W} = X_t + U_h \frac{\partial h_{t-1}}{\partial W}

So plug in the second equation above into the first equation, we have:

\frac{\partial h_t}{\partial W} = {\partial g} \cdot (X_t + U_h \frac{\partial h_{t-1}}{\partial W})

We can already see a recurrent relation between \frac{\partial h_t}{\partial W} and \frac{\partial h_{t-1}}{\partial W}. To make things clear, write \frac{\partial h_t}{\partial W} = \alpha_t, and expand the equation a few more steps:

\alpha_t = {\partial g} \cdot (X_t + U_h \alpha_{t-1})

\alpha_t = {\partial g} \cdot (X_t + U_h \cdot \partial g \cdot (X_{t-1} + \alpha_{t-2}) )


\alpha_t = C + (\partial g U_h)^{n-1} \alpha_0

Here $C$ represent the other terms (with lower order of \alpha_t-s) in the formula that we omitted for simplicity. So we can see, as n increases, if norm of \partial g U_h is greater than 1 this term will approach infinity, or if is less than 1 it will approach zero.

The main reason is the same term is multiplied n-1 times. i.e. information always flow at the same rate every step. If you think about a sentence, this does not really make sense as words meaning / importance does not really decrease (increase) exponentially w.r.t. it’s distance to the end of the sentence. The first word might very well be very important (e.g. a female name which will influence later pronoun choices “he” or “her”), but this vanilla RNN structure is too simple to capture that. We need more dynamic structures to allow information flow freely between time stamps.

Gated recurrent unit solves the problem by adapting different terms in every update step. There are two key ideas:

  • Introduces an external “memory” vector to capture long distance dependencies.
  • Allow error messages to flow at different strengths depending on inputs.

It achieves this by first computing two “gates”:

update gate : z_t = \sigma (W_z x_t + U_z h_{t-1} + b_z)

reset gate: r_t = \sigma (W_r x_t + U_r h_{t-1} + b_r)

They are both continuous vectors of the same length as the hidden state, constructed by passing the current word x_t and the last hidden vector h_{t-1} through an MLP. The sigmoid function makes every element between 0 and 1, so when used to perform element-wise multiplications \o, these two gates essentially controls how much information “flows” through.

There is also a candidate vector representing “new memory content”:

\tilde{h_t} = \text{tanh} (W_h X_t + r_t \circ (U_h h_{t-1} + b_h)

The usage of tanh and W_h X_t is pretty standard as in all other unites. But note the meaning of reset gate r_t here. If r_t is close to zero, then we “forget” all the previous information in the h_{t-1} vector, and just use the current word information.

An example where the reset gate is useful is sentiment analysis on a movie review, where the author spend many sentences describing and summarizing the movie plot, but conclude that “But the movie is really boring”. With this sentence, all the previous information become useless. This reset gate can help the model “forget” the previous summaries.

Finally, we update the new hidden vector as :

h_t =  z_t \circ h_{t-1} + (1 - z_t) \circ \tilde{h_t}

You can see the update controls how much information is remembered from old state, and how much information is acquired from the new memory content.

A caveat is why do we need the reset gate at all now that we have the update gate? Doesn’t we already have a gate to control how much old information is retained by using an update gate? Doesn’t the update gate also have the ability to eliminate all historical information embedded in h_{t-1}? I did not find satisfactory information about this point online.

Now think about why GRU can solve the vanish gradient problem. The update gate allows us to retain all previous information by setting all elements of z_t to 1, and h_t is exactly the same as h_{t-1}. In this case, we do not have the vanishing gradient problem because no weight matrix is multiplied.

For example, in the same movie review suppose the author says “I love this movie so much” and then goes on to explain the movie plot. Now the first sentence is really important, but with recurrent neural network, we update the gradient at every step and the content will be washed out as time passes. But now the update gate allows the model to retain the first sentence without exponentially decay its content.

Units with short term memories usually have reset gate very activate (i.e. numbers close to 1), and thus forget most about the past…

If reset gate is entirely 1 and update gate is entirely zero, then we just have a standard RNN.

Why tanh and sigmoid?

  • In theory tanh can also be RElu, but in practice it just performs better.
  • Sigmoid will control value between 0 and 1, but again no formal justification.

Finally, here is an illustration that pictures what’s going on in a GRU unit:


pointnet 论文理解

这几天(上周)在看 pointnet 系列的论文,具体研究问题是如何处理点云数据。大的研究领域是机器学习如何处理3D视觉问题,在无人车和机器人领域有应用,因为 depth camera,传感器,激光雷达传回来的图片有些是 point cloud。

点云数据是用点来代表三维图像的一种数据格式,例如一张椅子的三维图。从椅子平面上抽取2400个点,每个点是 (x, y, z) 的一个三维向量,那么把这些点在三维空间里画出来,就会得到飞机的图片。这里有一个可以在线观看的例子。这张图片里的点很多,所以观看效果很逼真。如果点少的话,就会变成这样

不同于2D图片,点云数据是无序的。传统 CNN 通过滑动 kernel 的卷积操作,能提取出 local 像素点之间的依赖关系(例如 max pooling,以及 convolution 是 weighted sum of local points)。但点云数据没有清晰的 local 关系。如何解决呢?1)可以把点云数据复原到三维,把它们分割成一小格(voxel)然后进行3D卷积。2)可以把点云数据投影到二维,代表算法有MV3D 和 AVDO

pointnet 是这系列的重点论文,2016年发表。这篇论文说不需要进行这样的处理,而是直接把点云数据输入到 multi-layer perceptron 中。步骤如下:

  1. 假设 batch size = 32, 每张图片有 1024个点。那么每批次输入的数据形状为 32 * 1024 * 3
  2. 首先把这个数据中每一张图片和一个 3 * 3 的 transform matrix 相乘。整体的形状,就是 32 * 1024 * 3 的 tensor乘上 32 * 3 * 3 的 tensor,使用 tf.matmul 函数,得到的结果是 32 * 1024 * 3。物理意义是把这个点进行坐标变换。这样做的原因是这篇讲  spatial transformer 的论文,还没看。
    • 这个 transform matrix 并不是随机生成,而是一个较复杂的 T-net。它的网络结构和这个大的网络结构一样,也是将输入图片几次升维,非线性变换 + bias,然后降维,得到的矩阵。所以不同的图片虽然分享了网络参数,但是并没有乘以相同系数。
    • 为什么这个网络这么复杂?其实和GRU这类网络有异曲同工之处,都是用当下数据既生成权重,又用这个权重再次处理当下数据。
  3. 图片处理好之后的形状是  32 * 1024 * 3. 接下来是第一次升维。具体操作是,每个点的三个维度,求 weighted sum 获得一个 scalar。然后这样的操作重复64次,得到一个64维的向量。其实也就是长度为3的向量乘以 3 * 64 的矩阵。这样会得到 32 * 1024 * 64 维的数据。将这个数据加上随机的 bias 之后,做 batch normalization,形状不变。
    • 在作者的源代码中,这一步没有用拼接 64 个线性层的方法,而是用 tensorflow 中的 conv2d 实现。通过变换 channel 数来代表64个 fully connected layer。
  4. 接下来又是一次“升维”操作,和上一步一样,但是输出的维度仍然是64.
  5. 现在的数据形状是 32 * 1024 * 64.接下来是 feature transform。通过同样的网络结构,获得一个 32 * 64 * 64 的 tensor。用 tf.matmul 把每一个点再次变换,原理和2)一样。
  6. transform 之后,将图片进行升维。先升到64维,然后是128,最后是 1024.最后我们得到一个 32 * 1024 * 1024 的数据形状。
    • 在作者的源码中,数据形状是32 * 1024 * 1 * 1024,因为要使用 conv2d 操作。
  7. 接下来是关键一步:要概括总结上一步中每个图片中所有的点,得到代表全局信息的向量。这里通过 maxpooling 实现,也就是对每个 channel ,取图片中所有点中的最大值。也就是 element-wise max pooling。
    • 源码中的实现是通过tf.max_pool 函数,输入就是 32 * 1024 * 1 * 1024 的这个数据,kernel size = [1, num_points (1024), 1, 1],输出是 32 * 1 * 1 * 1024。
    • 为什么不用 sum 或者 average?这篇论文针对 max pooling 操作提出了理论性质,我还没看。
  8. 一番折腾后,终于得到了代表每张图片的全局向量。下一步是reshape ,成为形状为 32 * 1024 的一个数据块。把两个多余维度扔掉。


  • classification:只需要全局表征即可。把这个表征通过三层 fully connected layers 生成长度为 40 的向量(因为一共40类),通过 softmax 层得到概率,用 cross entropy 的均值作为 loss function,通过一维逼近得到参数的最优值。
  • segmentation:作者的处理是对每个点做 classification,每个点都贴上这个全局变量后,进行分类。
  • semantic segmentation,不是很懂,略过。

这篇文章的重点在于1)将简单的网络结构直接用于点云数据,2)maxpooling 操作的性质(invariant to input orders)

Writing Python tests: doctest, unittest, and pytest

So I set myself another task the other day: writing tests for an open-source project —- a baby step towards software engineering! This post describes the tools and practices I’ve learnt.

The project I’ve committed to is here. The goal is to write tests for an experiment class, whose objective is to save state of machine learning experiments and quickly retrieve them.

There are two sets of tools to be learnt: 1) conceptually, what is a test and how to  write one? 2) practically, what tools does Python offer for writing and organizing test code, and how to use them? I will answer these two sets of questions below.

What is software testing and how to write it?

Testing is just write code to test if a software behaves as expected. If it does not, then there is a bug, and bugs are bad. Most of the online materials / books on this are kind of fluffy, describing concepts that are hard to define and unnecessarily convoluted. But really I just understand test to be like, this piece of software is supposed to output 1 when I input 3, is that what it is behaving?

There are also edge cases to be tested and taken care of accordingly.

So the real issue seems to be how to write good tests. This is a very broad topic with no definite answer, and I’ll omit discussion in this post. For this project, I’m following this guide.

Testing tools in Python 

  1. doctest
  2. unittest
  3. pytest

doctest  is packaged with Python. The usage is the following:

  • Open a Python interpreter, and interactively test a bunch of results with the function to be tested
  • Copy and paste all output from the interactive session to the docstring of that specific function
  • add a if __name__ == "__main__" line in the script to be tested so as to detect if it is imported or being executed standalone. If it is executed standalone, add one line doctest.testmod() so the doctest will run the test mode..

this module will parse the docstring to get expected results for every input. It will then compare the actual output with the expected input, and return errors if they do not match. If everything runs fine, the module will not output anything.

unittest the biggest difference with doctest is test cases are no longer defined in the original script. Instead, we now create a new script, import the old script and its methods, and test all out test cases there. The good news is now script is separated from tests, the bad news is we need to write more code.

How to use unittest module?

  • First create a new python script, import bothunittest and the original script in it. This script will be run alone, so need to specify the shebang line as well.
  • Create a unittest.TestCase class. syntax is class MyTestCase(unittest.TestCase), i.e. MyTestCase is a subclass of unittest.TestCase.
  • So the real issue is what methods we can use with unittest.TestCase. These methods are used to assert if the expected value is equal to the real reaults. Here is a link to the useful parts of its documentation. Basically, the testing process is a series of assertion processes.
  • The most naive way is just to define a bunch of assertTrue() methods in the class one by one. A more mature way is to use the setUp and tearDown methods to put all test cases into a data structure, e.g. a list, and in the test function body use a for loop to iterate through that data structure. Sample code here :

pytest is a third party package. It is easier to use than the unittest module, because it does not require all the class and method setups.

  • The simplest case, first create a python script with 1) the function to be tested and a function to test that function with assert. e.g. assert func(4) == 5
  • Then, run the scrip with pytest from command line in the folder containing the script. You do not even need to import pytest in!
  • On the other hand, if we do not specify an argument to pytest, it will run test on every file of the form * or test_*.py in the current directory. This is a convention to discover test files. Also see this link.

Finally, adding some good resources on pytest that I just found:

  • A webinar :
  • The accompany git repo:

Some updates on using pytest with PyCharm

  • If you want to run a single test instead of the whole script, see this SO discussion. The pipeline might be broken though.
  • Pycharm professional has a test-runner tab where you could use auto-test and it will re-run a test function after modification. Link here.
  • Add verbose option for pytest. It’s in “edit configurations” — pytest — additional argument. Specify a -v.
  • Put tests in a class so you can run them together. i.e. if they are related and always need to be run together.
  • auto test —- you can have test run continuously when you type

Happy testing!

Using ROS with Conda

For the past two weeks I’ve been playing around with ROS. All was fine until I tried to embed tensorflow and “deep learning” stuff into ROS with conda.

Things went south immediately. Pipelines broke down, what used to work no longer worked anymore. roslaunch showed `not a package` error, import in Python showed package not found even though I just pip installed it…

This post aims to clarify what happens, why my system became a mess, and how to resolve them.

The first problem with error importing modules in python is because ROS conflicts with Conda. This is because ROS sneakly changes the Python path in `~/.bash_rc` file. According to this post by Udacity NanoDegree github:

The problem here is that Python is looking for a package in /opt/ros/kinetic/lib/python2.7/dist-packages/ when it should be looking in the Anaconda installation directories. The reason for this is that with your ROS install, your .bashrc file contains the line source /opt/ros/kinetic/setup.bash, which sets your PYTHONPATH variable to the /opt/ros/kinetic/... location.

The solution is to unset your PYTHONPATH variable in your ~/.bash_rc file, right after sourcing ROS, but before sourcing your Anaconda environment. Use this command unset PYTHONPATH.

The second issue is I used many package installers to install python packages (and possibly python itself). Be careful of online tutorials my dude! I ends up having tens of python paths, scattered around different virtual environments, and things became very convoluted.

Package managers I used include:

  • pip (system wide and within conda; I have five pips …)
  • apt (did not figure out what it did until too late)
  • conda

Where do these package managers install packages?

  • pip:
    • When used without conda environment, will show the base python dir
    • pip when used with virtualenv will generally install packages in the path <virtualenv_name>/lib/<python_ver>/site-packages. For example, I created a test virtualenv named venv_test with Python 2.7, and the django folder is in venv_test/lib/python2.7/site-packages/django.
  • conda will install in the environment’s python site-packages.
    • If there is no Python2 but we want a python2 package, conda seems will install it to the system-wide python site-packages

A detailed comparison about pip and apt:

pip is used to download and install packages directly from PyPI. PyPI is hosted by Python Software Foundation. It is a specialized package manager that only deals with python packages.

apt-get is used to download and install packages from Ubuntu repositories which are hosted by Canonical.

Some of the differences between installing python packages from apt-get and pip are as follows:

  • Canonical only provides packages for selected python modules. Whereas, PyPI hosts a much broader range of python modules. So, there are a lot of python modules which you won’t be able to install using apt-get.
  • Canonical only hosts a single version of any package (generally the latest or the one released in recent past). So, with apt-get we cannot decide the version of python-package that we want. pip helps us in this situation. We can install any version of the package that has previously been uploaded on PyPI. This is extremely helpful in case of conflict in dependencies.
  • apt-get installs python modules in system-wide location. We cannot just install modules in our project virtualenv. pip solves this problem for us. If we are using pip after activating the virtualenv, it is intelligent enough to only install the modules in our project virtualenv. As mentioned in previous point, if there is a version of a particular python package already installed in system-wide location, and one of our project requires an older version of the same python package, in such situations we can use virtualenv and pip to install that older version of python package without any conflicts.
  • As @Radu Rădeanu pointed out in this answer, there would generally be difference in names of packages as well. Canonical usually names Python 2 packages as python-<package_name> and Python 3 packages as python3-<package_name>. Whereas for pip we generally just need to use <package_name> for both Python 2 as well as Python3 packages.

Which one should you use:

Both apt-get and pip are mature package managers which automatically install any other package dependency while installing. You may use anyone as you like. However, if you need to install a particular version of python-package, or install the package in a virtualenv, or install a package which is only hosted on PyPI; only pip would help you solve that issue. Otherwise, if you don’t mind installing the packages in system-wide location it doesn’t really matter whether you use apt-get or pip.

Also, this link that summarizes this problem perfectly:


The third issue is I confused Python packages with ROS packages somewhere in my projects. This is more conceptual but did misguide my actions. Here are some clarifications:


  • ROS is a programming ecosystem/framework that provide middleware for modules (packages) to communicate, especially for the purpose of robotics systems
  • ROS provide distribution in Python and C++
  • ROS code is organized by packages: folders containing src, data, other libraries…
  • ROS has a $ROS_PACKAGE_PATH, which is an environment variable indicating where to find packages for ROS. For example: `echo $ROS_PACKAGE_PATH` will give the output /opt/ros/kinetic/share


  • Python path: environment variables. Where to find import packages… the module search path
  • modules: individual scripts
  • package: a collection of modules
  • Given this structure, if the pkg directory resides in a location where it can be found (in one of the directories contained in sys.path), you can refer to the two modules with dot notation (pkg.mod1, pkg.mod2) and import them with the syntax you are already familiar with
  • The file in a package will import the modules into the package namespace, so that we can directly use imported modules in the main function.


Finally, some useful scripts for inspection and cleaning:

  • Find all Python:  sudo find / -type f -executable -iname 'python*' -exec file -i '{}' \; | awk -F: '/x-executable; charset=binary/ {print $1}' | xargs readlink -f | sort -u | xargs -I % sh -c 'echo -n "%: "; % -V'
  • `whereis pip`   finds all pips

Use binary representation of int to represent subsets of n elements

For the past two months I spent many hours solving algorithmic problems. It’s fun, but very time-consuming. I think the majority of time is spent on trying to find accurate solutions and systematic ways for problem solving. I also spent a few weeks (!) grasping data structure fundamentals.

Anyway, this post just describes a technique I came into when reading this book , page 73, on using integers to find all possible subsets from a list of length n. Example Leetcode problem:

The key insights are:

  • Every int is binary represented in a machine’s memory as a sequence of 0’s and 1’s
  • The range of ints from 0 to 2**n-1, have included all possible combinations of 0’s and 1’s for a list of length n.
  • Think about every position’s value (0 or 1) as an indicator whether element at that position appears or not.
  • So for every int from 0 to 2 **n -1, we only need to extract all the positions that are 1, to get the combination of list elements this int represents
  • With bitwise operators & and shift operator <<, we can easily get whether the i-th element of an int is 1 (turned on) or 0 (turned off). Try X & (1 << i).
  • Looping through all positions from 0 to (n-1) we can get all valid bits for this int. We append the corresponding elements to the answer list corresponding to this int.
  • Finally, we collect every answer list to a big list and return.

Very smart and interesting strategy. Code snippet here:

This strategy can used for other exhaustive search problems that requires looping through all possible combinations of n elements, with exponential time complexity ( 2**n). But the very fast bit manipulation makes it evades the time limit.


硕士论文还需要最后修改,陆续在看文献。从上周五开始,先后花了两个白天,两个傍晚,两个清晨,才大概看懂 Philip Converse 1964 年的名著 The Nature of Belief Systems in Mass Publics。引用了许多次,今天才真正搞懂。以为是他写得差,拿去问美国人,道是这文章写得太优美,导致理解困难。



为什么这篇文章难呢?首先是这人英文写得百转千回,一个句子要带三个逗号两个从句,因此语意也重重转折,难以分辨主要意思。这当然也说明作者思维缜密。其次是这篇文章结构不是很清晰。当代的社科论文作者已经学乖了,照顾读者耐心,第一句话就提出研究问题。第二三四句话摆出 alternative hypothesis,第五句话说结论,第六句话吹牛。但这篇文章比较像 essay,看不出这么简单粗暴的结构。第三难点是我对政治知识基本抓瞎,只知道个 liberal / conservative,放在作者的分类里就是社会底端的无知大众。第四难点是词汇量。



第二是这篇论文最后提及,政治精英参与 the history of ideas (思想史?)的制定,而他们的决策反过来影响大众。大学学习马克思的时候,好像也看到过类似的思想 (class consciousness?) ,好像也因此树立了一些理想。