Category Archives: Techy

ResNet: annotated version

This post annotate PyTorch’s implementation of ResNet.

ResNet is one of the most widely used network structure for image tasks in the industry.

The motivation for the original idea is deep neural networks, if we simply stack layers together, sometimes perform worse than shallow networks. For example, a network of 52 layers has higher training errors than a network of 36 layers. This is strange because in theory, deep networks should always be no worse than shallow networks: a constructed solution is to make all the extra layers identity matrix, and the two networks should perform the same.

  • The difficulty in optimization is NOT because of the gradient vanishing problem. 1) the authors use batch normalization after every conv layer, so forward propagated signals have non-zero variances. 2) the authors tested norms of backprop gradients and ensured that they exhibit healthy norms. Also the 34-layer plain net still have competitive accuracy so the gradients do not vanish.
  • The training error cannot be reduced simply by adding more iterations.
  • The conjecture is deep neural nets have exponentially low convergence rates.

To solve this problem, this paper’s key idea is to let some of the layers learn the “residuals” of a function instead of the original function. There is no rigorous math proofs. But the intuition is, we already one of the solutions to make sure that deep networks performs no worse than shallow layers is to make all the extra layers identity transformations. So we can pass this information to these layers to reduce its optimization efforts, by manually adding an identity transformation to their results.Thus, these layers only need to learn the “residuals” of the original transformation minus the identity transformation. Mathematically, suppose the original “real” transformation that these layers are trying to learn is H(x), now they only need to learn H(x) – X.

Now practically, should we assume H(x) and X are always the same dimension? (Here “dimension seems to be particularly referring to number of channels, as shown in figure 3 of the original paper) The answer is no. The authors provide two options for handling missing dimensions.

  1. Perform a linear project W_s in the shortcut connection so that we got H(x) = F(x) + W_s X. i.e. X is passed through a linear combination to make sure dimensions stay the same.
  2. Still use the identity mapping but make all extra channels zero paddings.

Theoretically, we can also do this when the dimensions are the same, i.e. transform X before adding it to the residuals. But this is unnecessary as identity matrix alone is sufficient, shown by the experiments. Using projective shortcut in models with all conv layers with channels increase is only marginally better than using identity shortcut in these models, and this is likely due to increase of parameters.

Note also practically F(x), i.e. the residuals, are implemented as 2 ~ 3 linear or conv layers. In the implementation we are going to see below they are conv layers.

Another insight for me is “deep” networks are really deep and complex. I only coded basic building blocks for these networks, e.g. convolution, LSTM, but I in fact have not tried building 100 layers of such building blocks! ResNet proposes one model of 152 layers. A practical question is how do we implement it when there are so many layers? Also, when the network is this deep there are many design decisions to make. e.g. how large should a filter’s kernel size be? What should the stride and padding be? How about dilation? I do not have experience with tuning all these hyper-parameters.

Some of the design principles I read from the ResNet paper, inspired by VGG nets:

  • most filters are 3 by 3. In fact, 3 conv3x3 layers has the same receptive field with one conv7x7 layer, but they have fewer parameters (27 vs 49). So stack many small filters is a more economic solution than using one big filter (with the price of more dependencies between output feature maps).
  • Number of features * feature map size should be roughly constant.
    • For example, for the same output feature map size, the conv layers have the same number of channels.
    • If we halve the output feature map size by using a conv layer, we usually double the number of filters to preserve the time complexity per layer.
  • conv layer can also be understood as a downsampling layer. e.g. a conv layer with stride=2 can halve the feature map size.
  • shortcut networks do not increase the number of parameters or computation complexity (only by a constant addition)

Fortunately, PyTorch offers a ready-made implementation in package. Here is my annotated version of the code.

import torch.nn as nn
import torch.utils.model_zoo as model_zoo

__all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',

model_urls = {
    'resnet18': '',
    'resnet34': '',
    'resnet50': '',
    'resnet101': '',
    'resnet152': '',

This is just the boilerplate code. Note the `model_urls` list stores pre-trained model weights for some network structures. “resnet18” means “Residual Net with 18 layers”.

def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)

def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

These are two most basic convolution filters used in ResNet. Notice ResNet does not use filters of other sizes, and all default stride sizes are 1. Filter size 1 is for reshape data channel dimensions. For example, a data might have 64 input channels and 28×28 pixel each. Then a conv1x1 layer, if it is to output 256 channels, will first sum over all channels to get one number, then replicate that number 256 times with different weights. It can also be used to reduce number of channels. Since the the residual operation need the input and output of same dimension, we will need such an operation to adjust data shapes.

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

Here `inplanes` and `planes` are simply input channels and output channels. Note this is a basic module to be stacked in ResNet. This module consists of two conv 3×3 layers, each followed by a batch normalization layer. The first conv layer is also passed through a ReLU for nonlinearility. (Note ReLU will cap all input greater than a certain value into a zero, so if you plan to use that output as a divisor be careful not dividing anything by zero.)

Then there is an option for downsampling or not. If you look at the `_make_layers` function in `ResNet` class, you will notice that `downsample` is an `nn.Sequential` consists of 1) a conv1x1 layer that aims to expand the number of channels of input data, and 2) a batch normalization layer that follows the conv1x1 layer. The downsample option is automatically enabled when the input dimension does not satisfy certain conditions (explained in the `ResNet` class annotation), or the stride is greater than 1.

class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = conv1x1(inplanes, planes)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = conv3x3(planes, planes, stride)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = conv1x1(planes, planes * self.expansion)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

Here the Bottleneck class defines a three layer network. Here are the meanings of the parameters:

  • inplanes: input channel to the first conv layer
  • planes: the number of channels for the intermediate conv layer.
  • The final number of output channels will be planes * 4, because there is an expansion factor = 4.

This whole module is first change input channels from “inplanes” to “planes”, then shrinkage the size of feature map, finally expand the output channels to 4 * planes. Why do we need such a structure? It’s mainly for computation efficiencies and reducing number of parameters. Compare two structures 1) two 3×3 conv layers with 256 channels and 2) one 1×1 conv with 64 channels, one 3×3 conv with 64 channels, and one 1×1 conv with 256 channels. The first : parameters = 3*3*256*2, the second 64 + 3*3*64+256.

class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False):
        super(ResNet, self).__init__()
        self.inplanes = 64
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                nn.BatchNorm2d(planes * block.expansion),

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

The key to understand this chunk of code is to the `_make_layer` function. It takes the following parameters:

  • block: can be either a `BasicBlock` object or a Bottleneck object. The first consists of two conv layers
  • planes: planes * self.expansion is the number of channels for the intermediate layers. (Q: why do we need the self.expansion parameter at all?)
  • blocks: number of basic blocks (e.g. a basicblock or a bottleneck block)
  • stride=1: stride for blocks
def resnet18(pretrained=False, **kwargs):
    """Constructs a ResNet-18 model.
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)
    if pretrained:
    return model

def resnet34(pretrained=False, **kwargs):
    """Constructs a ResNet-34 model.
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    model = ResNet(BasicBlock, [3, 4, 6, 3], **kwargs)
    if pretrained:
    return model

def resnet50(pretrained=False, **kwargs):
    """Constructs a ResNet-50 model.
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
    if pretrained:
    return model

def resnet101(pretrained=False, **kwargs):
    """Constructs a ResNet-101 model.
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    model = ResNet(Bottleneck, [3, 4, 23, 3], **kwargs)
    if pretrained:
    return model

def resnet152(pretrained=False, **kwargs):
    """Constructs a ResNet-152 model.
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    model = ResNet(Bottleneck, [3, 8, 36, 3], **kwargs)
    if pretrained:
    return model

The above are just different configurations for different ResNet structures.

Various SGDs

This post briefly documents variations of optimization algorithms and best practices. It’s a summary of this source . Also this course note is helpful for a review.

Tricks in checking if a gradient is implemented correctly: use centered difference instead of finite difference, because the first one has an error of O^2 if you try expand it by Taylor’s theorem. It’s more precise.


Interpretation: the momentum factor is multiplied on the past update velocity and decreases it. Momentum is more like a friction of the previous update scale. Also plus the newly learnt gradient. As a result if gradient goes to another direction it will be slowed down. If gradient goes to the same direction the scale of change will become faster and faster.

Nesterov Momentum

Similar to Momentum, but the update step, instead of calculating gradient at J(theta), we calculate it as theta + r * v, as a “look ahead step”. we know our momentum factor will take us to a different location and we get the gradient from that location.

Exponential decay of learning rate

As learning approaching the end, it makes sense to decrease the learning rate and go in smaller steps.


cache += dx ** 2

x += lr * dx / (sqrt(cache) + eps)

eps is a smoothing factor to avoid division by zero

Understanding: 1) the scaling factor is square root of sums for all previous gradients but in a per dimension fashion so that every dimension has its own learning rate 2) as a result, for more sparse parameters the learning rate is increased because the sum of gradients are smaller, and for dense parameters the learning rate is decreased because the sum of gradients are larger


It’s an adjustment to Adam in that the learning factor no longer goes monotonically decreasing (because the cache is always increasing). Instead, the cache here is “leaky” think of LSTM gates. The formula is :

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

Question: what’s the initial cache value for RMSprop? some implementation states it’s zero. 


Adam looks like a mixture of RMSprop and momentum.

m = beta1 * m + (1 – beta1) * dx

v = beta2 * v + (1 – beta2) * (dx**2)

x += -lr * m / (sqrt(v) + eps)

Note here 1) the m and v are leaky just as in RMSprop and 2) the update step is also exactly like RMSprop except that we now us a “smooth” version m instead of the raw gradient dx and 3) the construction of m looks like momentum.


Some points to help really understand VAE

It took me a long while (~ 4 days?) to understand the theories of Variation autoencoder (VAE) and how to actually implement it. And it’s not entirely my fault, because:

  • The original paper (Auto-Encoding Variational Bayes) explains things from a probabilistic perspective, which
    • requires knowledge on Bayesian statistics to know the math is correct
    • requires knowledge on variational inference, or general statistical inference to fully understand motivations of introducing distributional assumptions, and meanings of equations. (In fact, I should be familiar with this because I did something related in my final year of college, but I forgot most of it..)
    • only offers a general and abstract framework for the major parts of the paper, but leaves results for specific distributions in the appendix.
  • Online tutorials over-simplify this model using neural net languages
    • Most online tutorials just add a “sampling random noise from normal distribution” step in the auto-encoder framework. But often these tutorials cannot hold up in their writing and have logic loopholes. e.g. they never explain why the loss term have a KL-divergence between sampled and true z distributions, and how to calculate that distribution using empirical data (this is a key).
    • Most online implementations miss a point: if reconstructed x is sampled, how can we use MSE loss between original data and reconstructed data? They are not even the same data point. Some tries to explain this by acknowledging the reconstructed x is sampled from conditional P(x|z_i) and thus establishing the correspondence, but still, reconstructed x still has a randomness in it, and cannot use MSE loss.
    • Even the lecture on VAE from Stanford  does not explicitly explain this point above.
    • One will feel the online tutorial and the original paper do not match up.

In other words, most online tutorials do not bridge the gap from the general variational lower bound to the actual implementation equation, because they do not introduce three crucial distributional assumptions.

In fact, one just cannot derive the loss function in the original paper from a neural net perspective, because neural net does not make distributional assumptions. Also, the original paper is not about neural networks.

So, this post just tries point out the key ideas in understanding the original paper:

  1. The whole paper sets in the probabilistic framework. We do not have loss functions here. Instead, we make an assumption for p(x), and the goal is to find parameters for p to maximize likelihood to account for the current observed data x. Once we find these parameters, we can sample new data from this distribution.
  2. We introduce a latent variable z in order to better estimate p(x)
  3. We don’t know the conditional distribution of z|x. So we assume it follows normal, i.e. q(z|x)
  4. After introducing q(z|x), while we still can’t analytically tackle p(x), we can at least get a lower bound, which is - D_{KL} (q(z|x_i) || p(z)) + E_q ( log p(x|z)) You can find how to get this lower bound from many online tutorials, e.g. this.
  5. With the important lower bound we got from 4, now it’s time to plug in some concrete distributional assumptions. In the original paper appendix B, an analytical solution for - D_{KL} (q(z|x_i) || p(z)) is given when p(z) and q(x|z) are both Gaussian
  6. But then still we need to get $latex E_q ( log p(x|z))$. This is simplified to l2 loss when x is assumed to follow i.i.d. Gaussian. To see why, think about linear regression and how MLE is the same as minimizing the total sum of squares in terms of getting estimates for regression coefficients. Also this slide from UIUC!
  7. So now we got a concrete instance for VAE under Gaussian assumptions. How does this apply to neural network and autoencoders? Here is a glossary:
    1. encoder: output z|x, and the neural net is to simulate sampling from p(z|x). Used to get empirical observations of z. we have a loss term here in that we want to minimize the KL divergence between this output z|x and our assumption that z|x follows normal.
    2. decoder: output x|z. the neural net is to simulate sampling from p(x|z)

Preparing 3D building data from CityGML for machine learning projects


I spent the last week of 2018 and the first week of 2019 preparing CityGML data for a machine learning project. In particular, I need to extract 3D point cloud representation of individual buildings in New York (and Berlin and Zurich) so they can be the training / validation / test data.

During the process, I had to pick up some knowledge on 1) the Linux file system / disk management, 2) PostgreSQL, 3) CityGML (a data format for 3D city maps) and 4) FME (a data warehouse software for ETL) and 5) PCL command tools and 6) shell programming. I acquired these bits of knowledge by asking on Stackoverflow, FME forums, and the issue forum of corresponding open source tools on Github. Most importantly, thanks to the help from people all around the world (!), I am finally able to figure it out. This post documents the necessary steps to finish this task.

Since there are many details involved, this post will mostly point the readers to the tools I used and places to find how to use them. It covers the following sections:

  • CityGML : what is it and database setup
  • FME: the angel that does the heavy lifting for you for free, from CityGML to mesh
  • PCL command tools: a not-so-smart way (i.e. the Engineering way) from mesh to point cloud

1. CityGML Setup

1.1 What is CityGML

CityGML is a markup format to document 3D city maps. It is a text document, basically telling you where a building is, what a building is composed of, where a bridge is, what’s the coordinate of a wall surface, where are the lines of a wall, etc…

CityGML is based on XML and GML (geometric markup language). XML specifies document encoding scheme, e.g. class relations, labels and tags… kind of like HTML. GML extends XML by adding sets of primitives, like topology, features, geometry. Finally, CityGML is based on GML but with more constraints that are specific to cities, e.g.

A 3D city model can have various level of details (LoD). LoD1 means the map is only 2D. LoD2 means 3D objects will be extruded from 2D maps into 3D shapes. LoD3 means a building will have windows, roofs and other openings. LoD4 means a building will have interior furnitures.

For example, part of a sample CityGML file might look like this :

This specifies several surfaces of a building.

A sample visualization of a CityGML LoD4 model might look like this:

You can see furnitures of the building in the main Window.

References to know more about CityGML:

  • Groger and Plumer (2012) CityGML – Interoperable semantic 3D city models. Link. This paper goes over details of CityGML and is a more organized guide than the official CityGML website.
  • CityGML website for downloading data for specific cities. Many of the Germany cities are available in CityGML format.

1.2 3D City Database

3D City Database is a tool to store geographic databases, and it provides good support for CityGML data.

Now why do we need a database to store CityGML data? Because a city map might be very large and structured, e.g. New York LoD2 city model is ~30G. Such a large file cannot be easily manipulated by text processing tools (e.g. gedit) or visualized because of memory constraint. More, the 3D city database can easily parse modules and objects in the CityGML file and store them in structured ways. So if someone asks the question, how many buildings are there in New York? This question is hard to answer with only the CityGML file, but very easy to answer with a SQL query run on the 3D city database.

To use the 3D City Database, we need to go over the following steps:

1. Set up a PostgreSQL database locally or on a remote server. This database tool also supports Oracle, but I did not make that work. PostgreSQL is a free database tool available for Linux. The documentation is good for specific queries, but if you are new to practical database management, this book might be more helpful in offering a general road trip.

Specifically, on Linux you need to first make sure you have a disk that has at least 50 GB of free space, and your current user can own directory on that disk. I spent 4 days trouble shooting this because the hard disk on my Laptop has a file system that is actually Windows NTFS, so my sudo user cannot own directories there. Commands like chmod or chown did not work. To solve that problem, I had to back up everything (compress and upload to Google Drive) and reformat that disk into Linux File system. Useful commands and tools here:

  • parted : a tool to partition a new disk. A disk has to be partitioned before it can be mounted and used.
  • mount : add a disk’s file system to the current operation system file system tree
  • lsblk : list block and check their file systems
  • mkfs: make filesystem. ext4 is a linux format.

Linux wrapper tools useful for PostgreSQL actions:

  • pg_createcluster: create a cluster. Used because the default PostgreSQL directories lie somewhere under /home, and this disk is usually small. If we want a cluster in another location should use this command. Note the path should be absolute path instead of relative path.
  • pg_cltcluster : start / stop / drop a cluster.
  • psql: used to connect to a running server. Note PostgreSQL has a default user “postgres”, and 1) you need to set up the password for it before a database can be connected from other clients (e.g. 3D database importer / exporter) and 2) this “postgres” user is not normally logged into directly, rather use “sudo -u postgres psql -p” to only use it to log into postgres server.

2. Connect the 3D City Database tool to the PostgreSQL database. Download the 3D City DB tools here. After running the install .jar file, you will notice there are basically a few sets of tools, of which we need to use two:

  • The SQL and Shell scripts used for setting up a geographic database
  • The importer / exporter used for importing data into the database

Before diving into details, here are two helpful resources you should check out for specific questions:

  • this documentation is very helpful and one actually needs to read it to proceed… i.e. no other better online documents…
  • This online Q&A on Github is actually active, the developers will answer questions there. I discovered it too late !

After getting the documentation, these sections are helpful:

  • Step 1 is to set up a PostgreSQL database using the scripts included in this 3D DB tool to set up all the schemas, this starts from page 102 / 318 of the version 4.0 documentation. Note you will need to create a postgis extension, details here. On Ubuntu you can get it with “apt install postgis” or something like this.
  • Step 2 is to use the importer / exporter to connect to that database and import data. The importer/exporter located in the “bin” folder, and you just run the .sh file to start it. Details can be found on page 125 / 318 of the version 4.0 documentation.

3. create tiled version of NYC data Some CityGML data is too large to load for other softwares, and the importer / exporter can help create tiles. Details on this are on Documentation Page 142 / 318 and 171 / 318. Essentially, we need to first read everything into a database, then output different tiles of the map. Note we should set up one database for one city.

On a high level, to set up the tiled export a user needs to 1) activated spatial indices under the “database” tab and 2) specified number of bounding boxes in the “preference” -> “Bounding Box” tab and 3) specified the bounding area in the “Export” tab. These would be enough for the program to start tiled export.

The documentation is not very clear on the details. Here are also two Q&As of help I got from other users:


2. FME Workbench: from cityGML to mesh

After the last section, we should already have a bunch of .gml files that are maps of different regions of a city. You can visualize a .gml file using FME software.  FME is a data warehouse tool and has two tools: FME workbench and FME data inspector. Data inspector is just for looking at data, while workbench can be used to edit data. You will need a license to open the software, but a student account can apply for free trial. The application process takes a few hours. Note you need to register your code on the FME licensing assistant and download the license file, or else the activation will void next time you open it.

For example, this is a region of Zurich looks like:

The next step is to extract individual buildings from it, as well as converting them into mesh files like .obj. The basic idea is:

  • First identify and output individual buildings from a single .gml file. This is done by a fanout operation.  You will also need an aggregator because a “building” consists of 1) roof surfaces, 2) wall surfaces, and 3) ground surfaces. Buildings can be identified by a parent id.
  • Second you will need to automate the process with all .gml files. This is done through setting up a file and document reader… i.e. you will need another FME workspace to read in different files in the same directory, and call the sub-workspace from this parent workspace.

FME also has a little bit of learning curve. To understand the above two operations, you might need to know:

Finally, here is a Q&A I posted on FME forum and people helped pointing me to the correct places.

Eventually, individual buildings will look like this:


3. PCL command line tool: from mesh to point cloud

The last step is to create point cloud from mesh, because we want to test methods specifically applicable to point cloud. I found the following tools :

  • This tool that someone gave me: It uses Poisson disk sampling. This tool have trouble processing 2D meshes (some building data seems to be dirty) as the function won’t return. I tried python package “signal” and “multiprocessing” to kill timeout function but neither work. So I gave up with this tool.
  • Cloud Compare: An open source file to edit cloud files. This tool has a command line tool but reports error when I try to save point cloud… So I gave up with this tool too.
  • PCL : Point Cloud Library.
  • pyntcloud: a python library. It seemed bit troublesome to read .obj instead of .ply files. It seems it mainly supports .ply files so I gave up with this tool

I consulted this Stackoverflow Q&A 

I eventually settled with use pcl_mesh_sampling. It can be installed on Ubuntu 16.04 with apt-get pcltools or something like that. The command is very easy to use: the first argument is the input file name the second argument is the output file name. Then you specify the number of points to sample.

A remaining issue is how to automate it. Since it is a command line a natural way is to use bash script, so you need some string manipulation in bash. A bigger problem is every time this tool generates and saves a point cloud, it will visualize it using a window. Until you close that Window, the process won’t finish. So we need to automatically “close” the window with anther shell script that calls xdotool at a fixed time interval to automatically close the specified Window. Note we cannot use the “windowkill ” option for xdotool but need to simulate the key stroke alt+F4 (i do not know why). Full command is

xdotool search “$WINDOWNAME” windowactivate –sync key –window 0 –clearmodifiers alt+F4

This Q&A is helpful (others are less).

Again, any questions please direct to…

Distributed Stochastic Gradient Descent with MPI and PyTorch

This post describes how to implement stochastic gradient in a distributed fashion with MPI. It will cover the following topics in a high-level fashion, as it is challenging to cover every details in a single post. I will point to other resources helpful in understanding key concepts.

  • What is MPI
  • How to do distributed SGD
  • PyTorch modules necessary for writing distributed SGD and how to design the program
  • Engineering caveats for building PyTorch with MPI backend 

What is MPI

I understand MPI as a programming framework that handles communication between computers. 

It’s like the lower-level software underneath MapReduce, another programming framework for distributed computing. In MapReduce you need to specify a map operation and a reduce operation and that’s all. The system take care of the allocation of workers and server and memory, which can of course taken care of by meta-information input into a shell script. The main challenge is designing the parallel algorithms at the algorithmic level, for example how to parse the two matrices when doing matrix multiplication, or how to design the mapper and reducer for min-hash and shingling documents. 

But in MPI, the programmer needs to specify the worker and server by their ID, and actively coordinate the computers, which probably requires another set of algorithms. For example, if many workers need to communicate with the server at the same time but the server cannot serve them all, there needs to be a scheduling algorithm e.g. round-robin. In MPI, the programmer needs to code the scheduling by themselves so that messages from different workers won’t be mixed. We’ll see an example later in the asynchronous SGD. 

MPI has many concrete implementations, for example OpenMPI and MPICH. 

MPI has the following basic operations:

  • point-to-point communications: blocking and non-blocking. This enables a single computer to send message to another, and another computer to receive message. blocking means the process is blocked until the send / receive operation is finished, while block means the process is returned immediately, not wait till the process is finished.
  • collective communications : examples are reduce and gather.

Here is a good place to learn basic operations of MPI. You need to go through all tutorials on that website to be able to understand and write a project that does distributed SGD with PyTorch.

How to do distributed SGD

There are two types of distributed SGD, depending on when to synchronize gradients computed on individual workers.

  •  synchronous SGD:
    • all workers have a copy of the model. they are all the same
    • at every mini-batch, all workers compute their share of gradient, and then compute average
    • the model update on every worker
    • then move on to the next batch
    • if there is a server,
      • the server has a copy of the model
      • the workers also have copies of the model
      • workers are assigned data to calculate forward and backward
      • gradients are sent to the server to take care of the averaging
      • workers receive updated gradients from the server
    • if there is no server, the gradient is calculated in all_reduce. all_reduce can (but not necessarily) be implemented in the ring algorithm: every model send the results to its neighbor ?
  • asynchronous SGD:
    • all workers have a copy of the model, but they are not the same. the model is inconsistent across workers
    • at every mini-batch:
      • workers get parameters from the server
      • workers get data from its own data loader, or randomly selected dataset (is there an epoch concept anymore?)
      • workers calculate forward and gradient
      • once the calculation is done, gradient is sent to the server,
      • the server updates the parameters

Now we need to conceptually understand how this workflow can be implemented using data structure and APIs provided by PyTorch. To do this, let’s first analyze how a single machine SGD is implemented in PyTorch.

In particular, in a single machine SGD implemented with PyTorch, 1) the output of a mini-batch is calculated by calling the forward function of the model, 2) the loss is calculated by sending the output and the target to a loss function, and 3) the gradient is calculated by calling loss.backward, and 4) the update to parameters is done by calling optimizer.step(), and we would pass the model parameters to the optimizer beforehand.

Now for this single-machine workflow, key functions are:

  • We calculate the gradients in the loss.backward() step
  • The gradients are stored in model.parameters()
  • The model parameters are updated when we call optimizer.step()

So, in a multi-machine scenario, the following questions need to be considered:

  1. How many copies of the model should be created?
  2. What should the server store?
  3. What should the slaves store?
  4. What should they communicate with each other?

To see why 1) is important, note the way we deploy MPI is by having the same script sending to every machine, but use MPI_rank to identify servers and slaves, and use if-else condition to decide which block of code should be run on which machine. So, we can theoretically create some objects when we encounter a server, but do not create these objects when we encounter a slave. To answer 1), an obvious answer is everybody has its own copy of a whole model (i.e. the whole computation graph created by model = myModelClass()), but is this necessary?

It turned out it is. Why? Although server only need the parameter value and gradient values to perform updates, and theoretically we only need to put the optimizer on server, and model parameters on slaves, this is not doable in operation because in PyTorch optimizer is tied to a set of model parameters. More, the whole computation graph contains more than data, but also relations. So, we must initialize a model and a optimizer on both server and slaves, and use communication to make sure their values are on the same page.

To answer question 4, here is the logic flow:

  • a slave establishes connection with the server
  • a slave fetches initial parameter values from the server
  • a slave breaks connection with the server
  • a slave calculate the loss on its own batch of data
  • a slave calculates the gradients by calling loss.backward()
  • a slave establishes connection with the server
  • a slave send the gradients and loss to the server
  • a slave breaks connection with the server
  • the server updates its parameter value.

Note here we have multiple steps concerning how to set up connections. In effect, parameter values have many layers corresponding to different layers of a neural network, and we have multiple salves. So if multiple salves trying to send the same set of parameters to the server, data from different sets might be messed up. In other words, MPI needs to know who sends data and programmers need to use a protocol to ensure that during the transmission, only one pair of connection is in effect.

Note MPI does not allow one slave to “block” other slaves, so we need to code this “block” up by using “handshake” technique in TCP. The idea is a slave first send a “hello” message to the server, and when a server is busy this message will wait in a queue until being received. And when the server is idle, and it receives the “hello” message, it will realize someone else is calling it and will wait for this particular person by only receiving message from the same ID for this round. It will also send a message to this ID, telling it “I’m ready for you and no one else is able to mess around”. When a slave receives this confirmation from the server, it can go on to send the important gradient information to the server. After it finishes, it will receive the information send from the server. And when the server finished sending update information to the slave, it will be idle again and able to receive “hello” from another slave. This is a very interesting process to establish connections!

Engineering caveats for building PyTorch with MPI backend 

Some caveats to watch out for:

  • To use PyTorch with MPI, it cannot be installed from Conda. You have to build it yourself. But PyTorch is a large package so check your disk space and act accordingly. It took me 4 days to build it successfully….

Some links:

  • PyTorch tutorial for distributed SGD. Note this guy uses a synchronized SGD so the process is easier – you can just run an all-reduce and you do not have to worry about slaves mess up with each other.
  • My implementation of the distributed SGD process…

Finally, this blog post is not very clearly written. If someone really reads this and has questions, please email me at

2D convolution with tiling technique on GPU

This post describes how to write CUDA C code to perform 2D convolution on GPU with tiling technique. It is very brief, only covers basic concepts but with links to code examples. We are not going to use cuDNN, only the bare bones of CUDA. But here is an excellent post if you want to use cuDNN – for this particular task, cuDNN is slower than the code I will present below.

You can find the source code for this post here.

This post will consist of the following sections:

  • Installation CUDA
  • Basic GPU programming model
  • Vanilla convolution on GPU
  • Constant memory in GPU
  • Tiling technique and indexing
  • Link to code example

Install CUDA

First, make sure if you have a NVIDIA GPU on your machine. Here is how. Or just search the model online and ask on reddit 🙂

Next, follow the official NVIDIA guide here to download CUDA Toolkit. Note not every card support every version of CUDA kit.

Note also the download path, as you will need to specify that to the compiler later. If you are using Linux you can type whereis nvcc to find out. nvcc is the NVIDIA C compiler.

It’s easy to get wrong. Just follow carefully every step in the official installation guide. In particular don’t forget to finish the post-installation set-ups! Wrong PATH environment variables can be a real pain.

Basic CUDA Programming model

The most prominent characteristic of CUDA programming is parallization. When you write one line of code in a CUDA kernel (explained later), it will be executed by thousands of threads on different data. This is called the SIMD (same instruction multiple data) parallel paradigm.

With this idea in mind, here is what a typical CUDA C code look like:

  • Setting up the program on the host (CPU) by initializing variables and populate them. Write non-parallel sequential code.
  • Preparing for GPU work by initializing device (GPU) variables that reside in device memory, and copy contents from the corresponding host variables to it.
  • Write a GPU kernel function to deal with the parallel part of computation (e.g. convolution).
  • Set up the parallel configurations by specifying how many blocks and grids are used for the kernel. These two variables will decide how many threads we started for parallel execution.
  • Call the kernel from host code. Store the results in a device variable.
  • Copy contents from device to the host.

Some concepts to clear:

  • How many threads we use in the GPU is controlled by Grid size and Block size. A grid consists of many blocks. Grid and Block are both 3D structures. For example, a grid of dimension <2, 2, 2> will have 8 blocks in total, and a block of dimension <2, 2, 2> will have 8 threads in total. A block cannot have more than 1024 threads.

Here is a basic matrix addition example:


// function declarations 
__global__ void vecAddKernel(float * a, float * b, float * c, unsigned int N);

// main function 
int main()
    int N = 10;    // length of vector 
    float * a, * b, * c;  // a and b are vectors. c is the result
    unsigned int size = N * sizeof(float);  // number of bytes to allocate 
    a = (float *)calloc(N, sizeof(float));
    b = (float *)calloc(N, sizeof(float));

    int i = 0;
    float sum = 0;
    for (i = 0; i < N; i++){
        a[i] = (float)i / 0.23 + 1;
        b[i] = (float)i / 5.89 + 9;
        sum += a[i] + b[i];

    c = (float*) malloc(size);
    // 1. allocate memory on CUDA
    float * d_a, * d_b, * d_c;   // device memory 
    cudaError_t err1 =  cudaMalloc((void **) & d_a, size);
    cudaError_t err2 = cudaMalloc((void **) & d_b, size);
    cudaError_t err3 = cudaMalloc((void **) & d_c, size);
    if (err1 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err1), __FILE__, __LINE__);
    if (err2 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
    if (err3 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err3), __FILE__, __LINE__);
    // copy memory 
    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // 2. operate on kernels 
    vecAddKernel<<<ceil(N/256.0), 256>>>(d_a, d_b, d_c, N);

    // 3. copy the results back to host
    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    cudaError_t error = cudaGetLastError();
        fprintf(stderr,"ERROR: %s\n", cudaGetErrorString(error) );
    float cuda_res = 0;
    for(i = 0; i < N; i++){
        printf("%.2f\t", c[i]);
        cuda_res += c[i];
    printf("Results from host :%.2f\n", sum);
    printf("Results from device:%.2f\n", cuda_res);


    return 0;

void vecAddKernel(float * a, float * b, float * c, unsigned int N){
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i<N)  c[i] = a[i] + b[i];


Vanilla convolution on GPU

Where to parallelize for convolution? We use every thread to correspond to one single element in the output, and all threads do the same convolution operation but with different data chunks.

Constant memory in GPU

Constant memory variables are shared by all threads in all blocks. This memory is small but fast. So we should put the filter in it because everyone needs it, and it’s small.

Tiling technique and indexing

Tiling is making use of the shared memory on GPU. Shared memory is shared by all threads in the same block, it is small but fast. So the idea is for every thread in a block, they will operate on data regions that have some overlaps. And we make use of that to copy the small region of the larger picture that these threads need into the shared memory and dealing with them.

We use all threads to copy these data elements. Here is a further illustration of this tiling idea.

Code example



Here is the tiling organization :

Understanding GRUs

This post follows this post last year about vanilla Recurrent Neural Network structures.

One of the ultimate goal of a recurrent neural network is to summarize a sequence of vectors into a single vector representing all the information.

In the simplest case, imagine for the past month, every day  you and your friend did some push-ups. Let’s denote your number as x_i at day i, and your friend’s number as y_i at day i. Now a month has passed and you two want to compare who is more keen on working out.

So what you do you? Directly comparing two vectors is hard, because there is no obvious “greater-than” relations across dimensions? One way is to calculate the simple average of the two sequences, i.e. on average you did \bar{x} and your friend did \bar{y} per day. Whichever number is larger, you can bet that person is a better athlete!

Now in this example, time order does not really matter, because we are treating numbers from 30 days ago as the same as number yesterday. Another example where time does matter is interests from bank savings. Suppose every year you save b_i amount of Yuan into the bank, and the bank pays out interests at the end of year, once per year. How much will your total savings be at the end of the fifth year? You can image the oldest money will receive the highest interests, because of confounding.

With the same logic, we can apply this idea to summarizing sentences. Suppose every word can be represented as a n-D vector in a semantic space. How do we produce one single vector that represent the meaning of this sentence?

Taking average is a way, but note every word will have some influence on the words after it. e.g. “this burger is delicious.” Note how the “burger” constrains word choices after “is” … And that’s why we need recurrent neural network : at every step of the model, the hidden vector (which contains information about previous words) will concatenate with the current word, and becomes the input into producing the next hidden vector. So every word will have a lasting influence in the final output.

However, simple RNN has a vanishing (explosive) gradient problem. Mathematically, the older a word is, the higher order it will has on the multiplication factor of weight matrix. When taking first-order gradients, the weight matrix will have an order of n-1. Now if one term is larger than 1, as n approaches infinity this will will approach infinity too. If one term is smaller than 1, as n approaches infinity this will will approach zero and thus the model will “stop learning” (i.e. weights will not update).

Let’s formulate the intuition above more formally. For a simple RNN, every update we have

h_t =g (W x_t + U h_{t-1} + b)

Here g can be any non-linear activation function, e.g. a RELU or a sigmoid. Now we consider the gradient of h_t with regard to W.


h_t = g (O_t)


O_t =W x_t + U h_{t-1} + b

Using the chain rule we have:

\frac{\partial h_t}{\partial W} =\frac{\partial h_t}{\partial O_t} \frac{\partial O_t}{\partial W}

We also know:

\frac{\partial O_t}{\partial W} = X_t + U_h \frac{\partial h_{t-1}}{\partial W}

So plug in the second equation above into the first equation, we have:

\frac{\partial h_t}{\partial W} = {\partial g} \cdot (X_t + U_h \frac{\partial h_{t-1}}{\partial W})

We can already see a recurrent relation between \frac{\partial h_t}{\partial W} and \frac{\partial h_{t-1}}{\partial W}. To make things clear, write \frac{\partial h_t}{\partial W} = \alpha_t, and expand the equation a few more steps:

\alpha_t = {\partial g} \cdot (X_t + U_h \alpha_{t-1})

\alpha_t = {\partial g} \cdot (X_t + U_h \cdot \partial g \cdot (X_{t-1} + \alpha_{t-2}) )


\alpha_t = C + (\partial g U_h)^{n-1} \alpha_0

Here $C$ represent the other terms (with lower order of \alpha_t-s) in the formula that we omitted for simplicity. So we can see, as n increases, if norm of \partial g U_h is greater than 1 this term will approach infinity, or if is less than 1 it will approach zero.

The main reason is the same term is multiplied n-1 times. i.e. information always flow at the same rate every step. If you think about a sentence, this does not really make sense as words meaning / importance does not really decrease (increase) exponentially w.r.t. it’s distance to the end of the sentence. The first word might very well be very important (e.g. a female name which will influence later pronoun choices “he” or “her”), but this vanilla RNN structure is too simple to capture that. We need more dynamic structures to allow information flow freely between time stamps.

Gated recurrent unit solves the problem by adapting different terms in every update step. There are two key ideas:

  • Introduces an external “memory” vector to capture long distance dependencies.
  • Allow error messages to flow at different strengths depending on inputs.

It achieves this by first computing two “gates”:

update gate : z_t = \sigma (W_z x_t + U_z h_{t-1} + b_z)

reset gate: r_t = \sigma (W_r x_t + U_r h_{t-1} + b_r)

They are both continuous vectors of the same length as the hidden state, constructed by passing the current word x_t and the last hidden vector h_{t-1} through an MLP. The sigmoid function makes every element between 0 and 1, so when used to perform element-wise multiplications \o, these two gates essentially controls how much information “flows” through.

There is also a candidate vector representing “new memory content”:

\tilde{h_t} = \text{tanh} (W_h X_t + r_t \circ (U_h h_{t-1} + b_h)

The usage of tanh and W_h X_t is pretty standard as in all other unites. But note the meaning of reset gate r_t here. If r_t is close to zero, then we “forget” all the previous information in the h_{t-1} vector, and just use the current word information.

An example where the reset gate is useful is sentiment analysis on a movie review, where the author spend many sentences describing and summarizing the movie plot, but conclude that “But the movie is really boring”. With this sentence, all the previous information become useless. This reset gate can help the model “forget” the previous summaries.

Finally, we update the new hidden vector as :

h_t =  z_t \circ h_{t-1} + (1 - z_t) \circ \tilde{h_t}

You can see the update controls how much information is remembered from old state, and how much information is acquired from the new memory content.

A caveat is why do we need the reset gate at all now that we have the update gate? Doesn’t we already have a gate to control how much old information is retained by using an update gate? Doesn’t the update gate also have the ability to eliminate all historical information embedded in h_{t-1}? I did not find satisfactory information about this point online.

Now think about why GRU can solve the vanish gradient problem. The update gate allows us to retain all previous information by setting all elements of z_t to 1, and h_t is exactly the same as h_{t-1}. In this case, we do not have the vanishing gradient problem because no weight matrix is multiplied.

For example, in the same movie review suppose the author says “I love this movie so much” and then goes on to explain the movie plot. Now the first sentence is really important, but with recurrent neural network, we update the gradient at every step and the content will be washed out as time passes. But now the update gate allows the model to retain the first sentence without exponentially decay its content.

Units with short term memories usually have reset gate very activate (i.e. numbers close to 1), and thus forget most about the past…

If reset gate is entirely 1 and update gate is entirely zero, then we just have a standard RNN.

Why tanh and sigmoid?

  • In theory tanh can also be RElu, but in practice it just performs better.
  • Sigmoid will control value between 0 and 1, but again no formal justification.

Finally, here is an illustration that pictures what’s going on in a GRU unit:


pointnet 论文理解

这几天(上周)在看 pointnet 系列的论文,具体研究问题是如何处理点云数据。大的研究领域是机器学习如何处理3D视觉问题,在无人车和机器人领域有应用,因为 depth camera,传感器,激光雷达传回来的图片有些是 point cloud。

点云数据是用点来代表三维图像的一种数据格式,例如一张椅子的三维图。从椅子平面上抽取2400个点,每个点是 (x, y, z) 的一个三维向量,那么把这些点在三维空间里画出来,就会得到飞机的图片。这里有一个可以在线观看的例子。这张图片里的点很多,所以观看效果很逼真。如果点少的话,就会变成这样

不同于2D图片,点云数据是无序的。传统 CNN 通过滑动 kernel 的卷积操作,能提取出 local 像素点之间的依赖关系(例如 max pooling,以及 convolution 是 weighted sum of local points)。但点云数据没有清晰的 local 关系。如何解决呢?1)可以把点云数据复原到三维,把它们分割成一小格(voxel)然后进行3D卷积。2)可以把点云数据投影到二维,代表算法有MV3D 和 AVDO

pointnet 是这系列的重点论文,2016年发表。这篇论文说不需要进行这样的处理,而是直接把点云数据输入到 multi-layer perceptron 中。步骤如下:

  1. 假设 batch size = 32, 每张图片有 1024个点。那么每批次输入的数据形状为 32 * 1024 * 3
  2. 首先把这个数据中每一张图片和一个 3 * 3 的 transform matrix 相乘。整体的形状,就是 32 * 1024 * 3 的 tensor乘上 32 * 3 * 3 的 tensor,使用 tf.matmul 函数,得到的结果是 32 * 1024 * 3。物理意义是把这个点进行坐标变换。这样做的原因是这篇讲  spatial transformer 的论文,还没看。
    • 这个 transform matrix 并不是随机生成,而是一个较复杂的 T-net。它的网络结构和这个大的网络结构一样,也是将输入图片几次升维,非线性变换 + bias,然后降维,得到的矩阵。所以不同的图片虽然分享了网络参数,但是并没有乘以相同系数。
    • 为什么这个网络这么复杂?其实和GRU这类网络有异曲同工之处,都是用当下数据既生成权重,又用这个权重再次处理当下数据。
  3. 图片处理好之后的形状是  32 * 1024 * 3. 接下来是第一次升维。具体操作是,每个点的三个维度,求 weighted sum 获得一个 scalar。然后这样的操作重复64次,得到一个64维的向量。其实也就是长度为3的向量乘以 3 * 64 的矩阵。这样会得到 32 * 1024 * 64 维的数据。将这个数据加上随机的 bias 之后,做 batch normalization,形状不变。
    • 在作者的源代码中,这一步没有用拼接 64 个线性层的方法,而是用 tensorflow 中的 conv2d 实现。通过变换 channel 数来代表64个 fully connected layer。
  4. 接下来又是一次“升维”操作,和上一步一样,但是输出的维度仍然是64.
  5. 现在的数据形状是 32 * 1024 * 64.接下来是 feature transform。通过同样的网络结构,获得一个 32 * 64 * 64 的 tensor。用 tf.matmul 把每一个点再次变换,原理和2)一样。
  6. transform 之后,将图片进行升维。先升到64维,然后是128,最后是 1024.最后我们得到一个 32 * 1024 * 1024 的数据形状。
    • 在作者的源码中,数据形状是32 * 1024 * 1 * 1024,因为要使用 conv2d 操作。
  7. 接下来是关键一步:要概括总结上一步中每个图片中所有的点,得到代表全局信息的向量。这里通过 maxpooling 实现,也就是对每个 channel ,取图片中所有点中的最大值。也就是 element-wise max pooling。
    • 源码中的实现是通过tf.max_pool 函数,输入就是 32 * 1024 * 1 * 1024 的这个数据,kernel size = [1, num_points (1024), 1, 1],输出是 32 * 1 * 1 * 1024。
    • 为什么不用 sum 或者 average?这篇论文针对 max pooling 操作提出了理论性质,我还没看。
  8. 一番折腾后,终于得到了代表每张图片的全局向量。下一步是reshape ,成为形状为 32 * 1024 的一个数据块。把两个多余维度扔掉。


  • classification:只需要全局表征即可。把这个表征通过三层 fully connected layers 生成长度为 40 的向量(因为一共40类),通过 softmax 层得到概率,用 cross entropy 的均值作为 loss function,通过一维逼近得到参数的最优值。
  • segmentation:作者的处理是对每个点做 classification,每个点都贴上这个全局变量后,进行分类。
  • semantic segmentation,不是很懂,略过。

这篇文章的重点在于1)将简单的网络结构直接用于点云数据,2)maxpooling 操作的性质(invariant to input orders)

Writing Python tests: doctest, unittest, and pytest

So I set myself another task the other day: writing tests for an open-source project —- a baby step towards software engineering! This post describes the tools and practices I’ve learnt.

The project I’ve committed to is here. The goal is to write tests for an experiment class, whose objective is to save state of machine learning experiments and quickly retrieve them.

There are two sets of tools to be learnt: 1) conceptually, what is a test and how to  write one? 2) practically, what tools does Python offer for writing and organizing test code, and how to use them? I will answer these two sets of questions below.

What is software testing and how to write it?

Testing is just write code to test if a software behaves as expected. If it does not, then there is a bug, and bugs are bad. Most of the online materials / books on this are kind of fluffy, describing concepts that are hard to define and unnecessarily convoluted. But really I just understand test to be like, this piece of software is supposed to output 1 when I input 3, is that what it is behaving?

There are also edge cases to be tested and taken care of accordingly.

So the real issue seems to be how to write good tests. This is a very broad topic with no definite answer, and I’ll omit discussion in this post. For this project, I’m following this guide.

Testing tools in Python 

  1. doctest
  2. unittest
  3. pytest

doctest  is packaged with Python. The usage is the following:

  • Open a Python interpreter, and interactively test a bunch of results with the function to be tested
  • Copy and paste all output from the interactive session to the docstring of that specific function
  • add a if __name__ == "__main__" line in the script to be tested so as to detect if it is imported or being executed standalone. If it is executed standalone, add one line doctest.testmod() so the doctest will run the test mode..

this module will parse the docstring to get expected results for every input. It will then compare the actual output with the expected input, and return errors if they do not match. If everything runs fine, the module will not output anything.

unittest the biggest difference with doctest is test cases are no longer defined in the original script. Instead, we now create a new script, import the old script and its methods, and test all out test cases there. The good news is now script is separated from tests, the bad news is we need to write more code.

How to use unittest module?

  • First create a new python script, import bothunittest and the original script in it. This script will be run alone, so need to specify the shebang line as well.
  • Create a unittest.TestCase class. syntax is class MyTestCase(unittest.TestCase), i.e. MyTestCase is a subclass of unittest.TestCase.
  • So the real issue is what methods we can use with unittest.TestCase. These methods are used to assert if the expected value is equal to the real reaults. Here is a link to the useful parts of its documentation. Basically, the testing process is a series of assertion processes.
  • The most naive way is just to define a bunch of assertTrue() methods in the class one by one. A more mature way is to use the setUp and tearDown methods to put all test cases into a data structure, e.g. a list, and in the test function body use a for loop to iterate through that data structure. Sample code here :

pytest is a third party package. It is easier to use than the unittest module, because it does not require all the class and method setups.

  • The simplest case, first create a python script with 1) the function to be tested and a function to test that function with assert. e.g. assert func(4) == 5
  • Then, run the scrip with pytest from command line in the folder containing the script. You do not even need to import pytest in!
  • On the other hand, if we do not specify an argument to pytest, it will run test on every file of the form * or test_*.py in the current directory. This is a convention to discover test files. Also see this link.

Finally, adding some good resources on pytest that I just found:

  • A webinar :
  • The accompany git repo:

Some updates on using pytest with PyCharm

  • If you want to run a single test instead of the whole script, see this SO discussion. The pipeline might be broken though.
  • Pycharm professional has a test-runner tab where you could use auto-test and it will re-run a test function after modification. Link here.
  • Add verbose option for pytest. It’s in “edit configurations” — pytest — additional argument. Specify a -v.
  • Put tests in a class so you can run them together. i.e. if they are related and always need to be run together.
  • auto test —- you can have test run continuously when you type

Happy testing!

Using ROS with Conda

For the past two weeks I’ve been playing around with ROS. All was fine until I tried to embed tensorflow and “deep learning” stuff into ROS with conda.

Things went south immediately. Pipelines broke down, what used to work no longer worked anymore. roslaunch showed `not a package` error, import in Python showed package not found even though I just pip installed it…

This post aims to clarify what happens, why my system became a mess, and how to resolve them.

The first problem with error importing modules in python is because ROS conflicts with Conda. This is because ROS sneakly changes the Python path in `~/.bash_rc` file. According to this post by Udacity NanoDegree github:

The problem here is that Python is looking for a package in /opt/ros/kinetic/lib/python2.7/dist-packages/ when it should be looking in the Anaconda installation directories. The reason for this is that with your ROS install, your .bashrc file contains the line source /opt/ros/kinetic/setup.bash, which sets your PYTHONPATH variable to the /opt/ros/kinetic/... location.

The solution is to unset your PYTHONPATH variable in your ~/.bash_rc file, right after sourcing ROS, but before sourcing your Anaconda environment. Use this command unset PYTHONPATH.

The second issue is I used many package installers to install python packages (and possibly python itself). Be careful of online tutorials my dude! I ends up having tens of python paths, scattered around different virtual environments, and things became very convoluted.

Package managers I used include:

  • pip (system wide and within conda; I have five pips …)
  • apt (did not figure out what it did until too late)
  • conda

Where do these package managers install packages?

  • pip:
    • When used without conda environment, will show the base python dir
    • pip when used with virtualenv will generally install packages in the path <virtualenv_name>/lib/<python_ver>/site-packages. For example, I created a test virtualenv named venv_test with Python 2.7, and the django folder is in venv_test/lib/python2.7/site-packages/django.
  • conda will install in the environment’s python site-packages.
    • If there is no Python2 but we want a python2 package, conda seems will install it to the system-wide python site-packages

A detailed comparison about pip and apt:

pip is used to download and install packages directly from PyPI. PyPI is hosted by Python Software Foundation. It is a specialized package manager that only deals with python packages.

apt-get is used to download and install packages from Ubuntu repositories which are hosted by Canonical.

Some of the differences between installing python packages from apt-get and pip are as follows:

  • Canonical only provides packages for selected python modules. Whereas, PyPI hosts a much broader range of python modules. So, there are a lot of python modules which you won’t be able to install using apt-get.
  • Canonical only hosts a single version of any package (generally the latest or the one released in recent past). So, with apt-get we cannot decide the version of python-package that we want. pip helps us in this situation. We can install any version of the package that has previously been uploaded on PyPI. This is extremely helpful in case of conflict in dependencies.
  • apt-get installs python modules in system-wide location. We cannot just install modules in our project virtualenv. pip solves this problem for us. If we are using pip after activating the virtualenv, it is intelligent enough to only install the modules in our project virtualenv. As mentioned in previous point, if there is a version of a particular python package already installed in system-wide location, and one of our project requires an older version of the same python package, in such situations we can use virtualenv and pip to install that older version of python package without any conflicts.
  • As @Radu Rădeanu pointed out in this answer, there would generally be difference in names of packages as well. Canonical usually names Python 2 packages as python-<package_name> and Python 3 packages as python3-<package_name>. Whereas for pip we generally just need to use <package_name> for both Python 2 as well as Python3 packages.

Which one should you use:

Both apt-get and pip are mature package managers which automatically install any other package dependency while installing. You may use anyone as you like. However, if you need to install a particular version of python-package, or install the package in a virtualenv, or install a package which is only hosted on PyPI; only pip would help you solve that issue. Otherwise, if you don’t mind installing the packages in system-wide location it doesn’t really matter whether you use apt-get or pip.

Also, this link that summarizes this problem perfectly:


The third issue is I confused Python packages with ROS packages somewhere in my projects. This is more conceptual but did misguide my actions. Here are some clarifications:


  • ROS is a programming ecosystem/framework that provide middleware for modules (packages) to communicate, especially for the purpose of robotics systems
  • ROS provide distribution in Python and C++
  • ROS code is organized by packages: folders containing src, data, other libraries…
  • ROS has a $ROS_PACKAGE_PATH, which is an environment variable indicating where to find packages for ROS. For example: `echo $ROS_PACKAGE_PATH` will give the output /opt/ros/kinetic/share


  • Python path: environment variables. Where to find import packages… the module search path
  • modules: individual scripts
  • package: a collection of modules
  • Given this structure, if the pkg directory resides in a location where it can be found (in one of the directories contained in sys.path), you can refer to the two modules with dot notation (pkg.mod1, pkg.mod2) and import them with the syntax you are already familiar with
  • The file in a package will import the modules into the package namespace, so that we can directly use imported modules in the main function.


Finally, some useful scripts for inspection and cleaning:

  • Find all Python:  sudo find / -type f -executable -iname 'python*' -exec file -i '{}' \; | awk -F: '/x-executable; charset=binary/ {print $1}' | xargs readlink -f | sort -u | xargs -I % sh -c 'echo -n "%: "; % -V'
  • `whereis pip`   finds all pips