重读 hierarchical attention network for document classification

这篇文章的 key idea 是,把关于文档结构的层级结构信息加入模型,有助于生成更好的文本表征。

这里的文档结构主要是说文章由句子组成,是层级结构。之前的方法是把所有句子连成一起输入一个 RNN 模型,这样其实丢失了段落这样的层级结构。

相应地,另一种方法是把每个句子里的词先分步输入一个 RNN 模型,生成句子表征;再将所有的句子表征输入另一个RNN模型,生成文本表征;最后再用文本表征进行分类。这样的方法,神经网络之间并不共享参数,而且两次输入RNN,故可以捕捉到层级结构。

将文本分层级处理,还涉及到更为灵活的注意力机制的应用。句子和词层面分别实现注意力机制,可以使表征从特定的“重要”元素里获取更多信息。

具体的网络实现有几个特点:

关于注意力机制的实现;

  • 将每个句子里的词向量送入 GRU (see last post on explanation on what is a GRU),收集每一步输出的 hidden state (因而不能直接调用 pytorch nn.GRU 函数而需要稍作改变写个 for-loop 并把结果存起来)
  • 把所有的 hidden state 送入MLP,生成对词向量的表征
  • 随机初始化一个 context vector,这个 context vector 的意义是句子的含义。用它和每个向量的表征求点积,代表 attention score。score 越高,说明两个向量越相似,也就说明这个词在这个句子里有更显著的意义。因此给它的 attention weight 也就应该比较高。
  • 将 attention score 送入 softmax 函数求得权重。用这个权重和原始的 hidden states sequence 求 weighted sum 得到整个句子的表征。

关于层级结构的实现:

  • 我们一共只训练两套 GRU 单元,一个负责总结句子,一个负责总结段落。因此所有的句子必须一样长,所有的段落必须有一样长度的句子。因此在预处理时,过长的句子被剪掉,过短的句子被补足。具体的长度选取可以看训练数据中长度的分布,用 qunatile 选择。
  • 将数据划整齐后,首先用上述方法得到每个句子的表征。
  • 其次,针对每个段落,再把所有的句子表征送入GRU,得到段落表征。
  • 最后就可以用这个表征做分类了。

论文地址:https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf

2D convolution with tiling technique on GPU

This post describes how to write CUDA C code to perform 2D convolution on GPU with tiling technique. It is very brief, only covers basic concepts but with links to code examples. We are not going to use cuDNN, only the bare bones of CUDA. But here is an excellent post if you want to use cuDNN – for this particular task, cuDNN is slower than the code I will present below.

You can find the source code for this post here.

This post will consist of the following sections:

  • Installation CUDA
  • Basic GPU programming model
  • Vanilla convolution on GPU
  • Constant memory in GPU
  • Tiling technique and indexing
  • Link to code example

Install CUDA

First, make sure if you have a NVIDIA GPU on your machine. Here is how. Or just search the model online and ask on reddit 🙂

Next, follow the official NVIDIA guide here to download CUDA Toolkit. Note not every card support every version of CUDA kit.

Note also the download path, as you will need to specify that to the compiler later. If you are using Linux you can type whereis nvcc to find out. nvcc is the NVIDIA C compiler.

It’s easy to get wrong. Just follow carefully every step in the official installation guide. In particular don’t forget to finish the post-installation set-ups! Wrong PATH environment variables can be a real pain.

Basic CUDA Programming model

The most prominent characteristic of CUDA programming is parallization. When you write one line of code in a CUDA kernel (explained later), it will be executed by thousands of threads on different data. This is called the SIMD (same instruction multiple data) parallel paradigm.

With this idea in mind, here is what a typical CUDA C code look like:

  • Setting up the program on the host (CPU) by initializing variables and populate them. Write non-parallel sequential code.
  • Preparing for GPU work by initializing device (GPU) variables that reside in device memory, and copy contents from the corresponding host variables to it.
  • Write a GPU kernel function to deal with the parallel part of computation (e.g. convolution).
  • Set up the parallel configurations by specifying how many blocks and grids are used for the kernel. These two variables will decide how many threads we started for parallel execution.
  • Call the kernel from host code. Store the results in a device variable.
  • Copy contents from device to the host.

Some concepts to clear:

  • How many threads we use in the GPU is controlled by Grid size and Block size. A grid consists of many blocks. Grid and Block are both 3D structures. For example, a grid of dimension <2, 2, 2> will have 8 blocks in total, and a block of dimension <2, 2, 2> will have 8 threads in total. A block cannot have more than 1024 threads.

Here is a basic matrix addition example:

#include<stdio.h>
#include<cuda.h>

// function declarations 
__global__ void vecAddKernel(float * a, float * b, float * c, unsigned int N);

// main function 
int main()
{   
    int N = 10;    // length of vector 
    float * a, * b, * c;  // a and b are vectors. c is the result
    unsigned int size = N * sizeof(float);  // number of bytes to allocate 
    a = (float *)calloc(N, sizeof(float));
    b = (float *)calloc(N, sizeof(float));

    int i = 0;
    float sum = 0;
    for (i = 0; i < N; i++){
        a[i] = (float)i / 0.23 + 1;
        b[i] = (float)i / 5.89 + 9;
        sum += a[i] + b[i];
    }

    c = (float*) malloc(size);
 
    // 1. allocate memory on CUDA
    float * d_a, * d_b, * d_c;   // device memory 
    cudaError_t err1 =  cudaMalloc((void **) & d_a, size);
    cudaError_t err2 = cudaMalloc((void **) & d_b, size);
    cudaError_t err3 = cudaMalloc((void **) & d_c, size);
    if (err1 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err1), __FILE__, __LINE__);
        exit(EXIT_FAILURE);
    }
    if (err2 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
        exit(EXIT_FAILURE);
    }
    if (err3 != cudaSuccess){
        printf("%s in %s at line %d\n", cudaGetErrorString(err3), __FILE__, __LINE__);
        exit(EXIT_FAILURE);
    }
     
     
     
    // copy memory 
    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // 2. operate on kernels 
    vecAddKernel<<<ceil(N/256.0), 256>>>(d_a, d_b, d_c, N);

    // 3. copy the results back to host
    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    cudaDeviceSynchronize();
    cudaError_t error = cudaGetLastError();
    if(error!=cudaSuccess)
    {
        fprintf(stderr,"ERROR: %s\n", cudaGetErrorString(error) );
        exit(-1);
    }
 
    float cuda_res = 0;
    for(i = 0; i < N; i++){
        printf("%.2f\t", c[i]);
        cuda_res += c[i];
    }
 
    printf("Results from host :%.2f\n", sum);
    printf("Results from device:%.2f\n", cuda_res);

    cudaFree(d_a); 
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

__global__
void vecAddKernel(float * a, float * b, float * c, unsigned int N){
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if (i<N)  c[i] = a[i] + b[i];
}

 

Vanilla convolution on GPU

Where to parallelize for convolution? We use every thread to correspond to one single element in the output, and all threads do the same convolution operation but with different data chunks.

Constant memory in GPU

Constant memory variables are shared by all threads in all blocks. This memory is small but fast. So we should put the filter in it because everyone needs it, and it’s small.

Tiling technique and indexing

Tiling is making use of the shared memory on GPU. Shared memory is shared by all threads in the same block, it is small but fast. So the idea is for every thread in a block, they will operate on data regions that have some overlaps. And we make use of that to copy the small region of the larger picture that these threads need into the shared memory and dealing with them.

We use all threads to copy these data elements. Here is a further illustration of this tiling idea.

Code example

Here.

 

Here is the tiling organization :

Understanding GRUs

This post follows this post last year about vanilla Recurrent Neural Network structures.

One of the ultimate goal of a recurrent neural network is to summarize a sequence of vectors into a single vector representing all the information.

In the simplest case, imagine for the past month, every day  you and your friend did some push-ups. Let’s denote your number as x_i at day i, and your friend’s number as y_i at day i. Now a month has passed and you two want to compare who is more keen on working out.

So what you do you? Directly comparing two vectors is hard, because there is no obvious “greater-than” relations across dimensions? One way is to calculate the simple average of the two sequences, i.e. on average you did \bar{x} and your friend did \bar{y} per day. Whichever number is larger, you can bet that person is a better athlete!

Now in this example, time order does not really matter, because we are treating numbers from 30 days ago as the same as number yesterday. Another example where time does matter is interests from bank savings. Suppose every year you save b_i amount of Yuan into the bank, and the bank pays out interests at the end of year, once per year. How much will your total savings be at the end of the fifth year? You can image the oldest money will receive the highest interests, because of confounding.

With the same logic, we can apply this idea to summarizing sentences. Suppose every word can be represented as a n-D vector in a semantic space. How do we produce one single vector that represent the meaning of this sentence?

Taking average is a way, but note every word will have some influence on the words after it. e.g. “this burger is delicious.” Note how the “burger” constrains word choices after “is” … And that’s why we need recurrent neural network : at every step of the model, the hidden vector (which contains information about previous words) will concatenate with the current word, and becomes the input into producing the next hidden vector. So every word will have a lasting influence in the final output.

However, simple RNN has a vanishing (explosive) gradient problem. Mathematically, the older a word is, the higher order it will has on the multiplication factor of weight matrix. When taking first-order gradients, the weight matrix will have an order of n-1. Now if one term is larger than 1, as n approaches infinity this will will approach infinity too. If one term is smaller than 1, as n approaches infinity this will will approach zero and thus the model will “stop learning” (i.e. weights will not update).

Let’s formulate the intuition above more formally. For a simple RNN, every update we have

h_t =g (W x_t + U h_{t-1} + b)

Here g can be any non-linear activation function, e.g. a RELU or a sigmoid. Now we consider the gradient of h_t with regard to W.

Write

h_t = g (O_t)

and

O_t =W x_t + U h_{t-1} + b

Using the chain rule we have:

\frac{\partial h_t}{\partial W} =\frac{\partial h_t}{\partial O_t} \frac{\partial O_t}{\partial W}

We also know:

\frac{\partial O_t}{\partial W} = X_t + U_h \frac{\partial h_{t-1}}{\partial W}

So plug in the second equation above into the first equation, we have:

\frac{\partial h_t}{\partial W} = {\partial g} \cdot (X_t + U_h \frac{\partial h_{t-1}}{\partial W})

We can already see a recurrent relation between \frac{\partial h_t}{\partial W} and \frac{\partial h_{t-1}}{\partial W}. To make things clear, write \frac{\partial h_t}{\partial W} = \alpha_t, and expand the equation a few more steps:

\alpha_t = {\partial g} \cdot (X_t + U_h \alpha_{t-1})

\alpha_t = {\partial g} \cdot (X_t + U_h \cdot \partial g \cdot (X_{t-1} + \alpha_{t-2}) )

\ldots

\alpha_t = C + (\partial g U_h)^{n-1} \alpha_0

Here $C$ represent the other terms (with lower order of \alpha_t-s) in the formula that we omitted for simplicity. So we can see, as n increases, if norm of \partial g U_h is greater than 1 this term will approach infinity, or if is less than 1 it will approach zero.

The main reason is the same term is multiplied n-1 times. i.e. information always flow at the same rate every step. If you think about a sentence, this does not really make sense as words meaning / importance does not really decrease (increase) exponentially w.r.t. it’s distance to the end of the sentence. The first word might very well be very important (e.g. a female name which will influence later pronoun choices “he” or “her”), but this vanilla RNN structure is too simple to capture that. We need more dynamic structures to allow information flow freely between time stamps.

Gated recurrent unit solves the problem by adapting different terms in every update step. There are two key ideas:

  • Introduces an external “memory” vector to capture long distance dependencies.
  • Allow error messages to flow at different strengths depending on inputs.

It achieves this by first computing two “gates”:

update gate : z_t = \sigma (W_z x_t + U_z h_{t-1} + b_z)

reset gate: r_t = \sigma (W_r x_t + U_r h_{t-1} + b_r)

They are both continuous vectors of the same length as the hidden state, constructed by passing the current word x_t and the last hidden vector h_{t-1} through an MLP. The sigmoid function makes every element between 0 and 1, so when used to perform element-wise multiplications \o, these two gates essentially controls how much information “flows” through.

There is also a candidate vector representing “new memory content”:

\tilde{h_t} = \text{tanh} (W_h X_t + r_t \circ (U_h h_{t-1} + b_h)

The usage of tanh and W_h X_t is pretty standard as in all other unites. But note the meaning of reset gate r_t here. If r_t is close to zero, then we “forget” all the previous information in the h_{t-1} vector, and just use the current word information.

An example where the reset gate is useful is sentiment analysis on a movie review, where the author spend many sentences describing and summarizing the movie plot, but conclude that “But the movie is really boring”. With this sentence, all the previous information become useless. This reset gate can help the model “forget” the previous summaries.

Finally, we update the new hidden vector as :

h_t =  z_t \circ h_{t-1} + (1 - z_t) \circ \tilde{h_t}

You can see the update controls how much information is remembered from old state, and how much information is acquired from the new memory content.

A caveat is why do we need the reset gate at all now that we have the update gate? Doesn’t we already have a gate to control how much old information is retained by using an update gate? Doesn’t the update gate also have the ability to eliminate all historical information embedded in h_{t-1}? I did not find satisfactory information about this point online.

Now think about why GRU can solve the vanish gradient problem. The update gate allows us to retain all previous information by setting all elements of z_t to 1, and h_t is exactly the same as h_{t-1}. In this case, we do not have the vanishing gradient problem because no weight matrix is multiplied.

For example, in the same movie review suppose the author says “I love this movie so much” and then goes on to explain the movie plot. Now the first sentence is really important, but with recurrent neural network, we update the gradient at every step and the content will be washed out as time passes. But now the update gate allows the model to retain the first sentence without exponentially decay its content.

Units with short term memories usually have reset gate very activate (i.e. numbers close to 1), and thus forget most about the past…

If reset gate is entirely 1 and update gate is entirely zero, then we just have a standard RNN.

Why tanh and sigmoid?

  • In theory tanh can also be RElu, but in practice it just performs better.
  • Sigmoid will control value between 0 and 1, but again no formal justification.

Finally, here is an illustration that pictures what’s going on in a GRU unit:

Reference:

pointnet 论文理解

这几天(上周)在看 pointnet 系列的论文,具体研究问题是如何处理点云数据。大的研究领域是机器学习如何处理3D视觉问题,在无人车和机器人领域有应用,因为 depth camera,传感器,激光雷达传回来的图片有些是 point cloud。

点云数据是用点来代表三维图像的一种数据格式,例如一张椅子的三维图。从椅子平面上抽取2400个点,每个点是 (x, y, z) 的一个三维向量,那么把这些点在三维空间里画出来,就会得到飞机的图片。这里有一个可以在线观看的例子。这张图片里的点很多,所以观看效果很逼真。如果点少的话,就会变成这样

不同于2D图片,点云数据是无序的。传统 CNN 通过滑动 kernel 的卷积操作,能提取出 local 像素点之间的依赖关系(例如 max pooling,以及 convolution 是 weighted sum of local points)。但点云数据没有清晰的 local 关系。如何解决呢?1)可以把点云数据复原到三维,把它们分割成一小格(voxel)然后进行3D卷积。2)可以把点云数据投影到二维,代表算法有MV3D 和 AVDO

pointnet 是这系列的重点论文,2016年发表。这篇论文说不需要进行这样的处理,而是直接把点云数据输入到 multi-layer perceptron 中。步骤如下:

  1. 假设 batch size = 32, 每张图片有 1024个点。那么每批次输入的数据形状为 32 * 1024 * 3
  2. 首先把这个数据中每一张图片和一个 3 * 3 的 transform matrix 相乘。整体的形状,就是 32 * 1024 * 3 的 tensor乘上 32 * 3 * 3 的 tensor,使用 tf.matmul 函数,得到的结果是 32 * 1024 * 3。物理意义是把这个点进行坐标变换。这样做的原因是这篇讲  spatial transformer 的论文,还没看。
    • 这个 transform matrix 并不是随机生成,而是一个较复杂的 T-net。它的网络结构和这个大的网络结构一样,也是将输入图片几次升维,非线性变换 + bias,然后降维,得到的矩阵。所以不同的图片虽然分享了网络参数,但是并没有乘以相同系数。
    • 为什么这个网络这么复杂?其实和GRU这类网络有异曲同工之处,都是用当下数据既生成权重,又用这个权重再次处理当下数据。
  3. 图片处理好之后的形状是  32 * 1024 * 3. 接下来是第一次升维。具体操作是,每个点的三个维度,求 weighted sum 获得一个 scalar。然后这样的操作重复64次,得到一个64维的向量。其实也就是长度为3的向量乘以 3 * 64 的矩阵。这样会得到 32 * 1024 * 64 维的数据。将这个数据加上随机的 bias 之后,做 batch normalization,形状不变。
    • 在作者的源代码中,这一步没有用拼接 64 个线性层的方法,而是用 tensorflow 中的 conv2d 实现。通过变换 channel 数来代表64个 fully connected layer。
  4. 接下来又是一次“升维”操作,和上一步一样,但是输出的维度仍然是64.
  5. 现在的数据形状是 32 * 1024 * 64.接下来是 feature transform。通过同样的网络结构,获得一个 32 * 64 * 64 的 tensor。用 tf.matmul 把每一个点再次变换,原理和2)一样。
  6. transform 之后,将图片进行升维。先升到64维,然后是128,最后是 1024.最后我们得到一个 32 * 1024 * 1024 的数据形状。
    • 在作者的源码中,数据形状是32 * 1024 * 1 * 1024,因为要使用 conv2d 操作。
  7. 接下来是关键一步:要概括总结上一步中每个图片中所有的点,得到代表全局信息的向量。这里通过 maxpooling 实现,也就是对每个 channel ,取图片中所有点中的最大值。也就是 element-wise max pooling。
    • 源码中的实现是通过tf.max_pool 函数,输入就是 32 * 1024 * 1 * 1024 的这个数据,kernel size = [1, num_points (1024), 1, 1],输出是 32 * 1 * 1 * 1024。
    • 为什么不用 sum 或者 average?这篇论文针对 max pooling 操作提出了理论性质,我还没看。
  8. 一番折腾后,终于得到了代表每张图片的全局向量。下一步是reshape ,成为形状为 32 * 1024 的一个数据块。把两个多余维度扔掉。

上面的步骤是为了学习表征。针对具体的任务,又会有剩下几个步骤:

  • classification:只需要全局表征即可。把这个表征通过三层 fully connected layers 生成长度为 40 的向量(因为一共40类),通过 softmax 层得到概率,用 cross entropy 的均值作为 loss function,通过一维逼近得到参数的最优值。
  • segmentation:作者的处理是对每个点做 classification,每个点都贴上这个全局变量后,进行分类。
  • semantic segmentation,不是很懂,略过。

这篇文章的重点在于1)将简单的网络结构直接用于点云数据,2)maxpooling 操作的性质(invariant to input orders)

Writing Python tests: doctest, unittest, and pytest

So I set myself another task the other day: writing tests for an open-source project —- a baby step towards software engineering! This post describes the tools and practices I’ve learnt.

The project I’ve committed to is here. The goal is to write tests for an experiment class, whose objective is to save state of machine learning experiments and quickly retrieve them.

There are two sets of tools to be learnt: 1) conceptually, what is a test and how to  write one? 2) practically, what tools does Python offer for writing and organizing test code, and how to use them? I will answer these two sets of questions below.

What is software testing and how to write it?

Testing is just write code to test if a software behaves as expected. If it does not, then there is a bug, and bugs are bad. Most of the online materials / books on this are kind of fluffy, describing concepts that are hard to define and unnecessarily convoluted. But really I just understand test to be like, this piece of software is supposed to output 1 when I input 3, is that what it is behaving?

There are also edge cases to be tested and taken care of accordingly.

So the real issue seems to be how to write good tests. This is a very broad topic with no definite answer, and I’ll omit discussion in this post. For this project, I’m following this guide.

Testing tools in Python 

  1. doctest
  2. unittest
  3. pytest

doctest  is packaged with Python. The usage is the following:

  • Open a Python interpreter, and interactively test a bunch of results with the function to be tested
  • Copy and paste all output from the interactive session to the docstring of that specific function
  • add a if __name__ == "__main__" line in the script to be tested so as to detect if it is imported or being executed standalone. If it is executed standalone, add one line doctest.testmod() so the doctest will run the test mode..

this module will parse the docstring to get expected results for every input. It will then compare the actual output with the expected input, and return errors if they do not match. If everything runs fine, the module will not output anything.

unittest the biggest difference with doctest is test cases are no longer defined in the original script. Instead, we now create a new script, import the old script and its methods, and test all out test cases there. The good news is now script is separated from tests, the bad news is we need to write more code.

How to use unittest module?

  • First create a new python script, import bothunittest and the original script in it. This script will be run alone, so need to specify the shebang line as well.
  • Create a unittest.TestCase class. syntax is class MyTestCase(unittest.TestCase), i.e. MyTestCase is a subclass of unittest.TestCase.
  • So the real issue is what methods we can use with unittest.TestCase. These methods are used to assert if the expected value is equal to the real reaults. Here is a link to the useful parts of its documentation. Basically, the testing process is a series of assertion processes.
  • The most naive way is just to define a bunch of assertTrue() methods in the class one by one. A more mature way is to use the setUp and tearDown methods to put all test cases into a data structure, e.g. a list, and in the test function body use a for loop to iterate through that data structure. Sample code here : https://www.python-course.eu/python3_tests.php

pytest is a third party package. It is easier to use than the unittest module, because it does not require all the class and method setups.

  • The simplest case, first create a python script exercise.py with 1) the function to be tested and a function to test that function with assert. e.g. assert func(4) == 5
  • Then, run the scrip with pytest exercise.py from command line in the folder containing the script. You do not even need to import pytest in exercise.py!
  • On the other hand, if we do not specify an argument to pytest, it will run test on every file of the form *_test.py or test_*.py in the current directory. This is a convention to discover test files. Also see this link.

Finally, adding some good resources on pytest that I just found:

  • A webinar : https://www.youtube.com/watch?v=ixqeebhUa-w
  • The accompany git repo: https://github.com/okken/pycharm_2018_Feb

Some updates on using pytest with PyCharm

  • If you want to run a single test instead of the whole script, see this SO discussion. The pipeline might be broken though.
  • Pycharm professional has a test-runner tab where you could use auto-test and it will re-run a test function after modification. Link here.
  • Add verbose option for pytest. It’s in “edit configurations” — pytest — additional argument. Specify a -v.
  • Put tests in a class so you can run them together. i.e. if they are related and always need to be run together.
  • auto test —- you can have test run continuously when you type

Happy testing!

Using ROS with Conda

For the past two weeks I’ve been playing around with ROS. All was fine until I tried to embed tensorflow and “deep learning” stuff into ROS with conda.

Things went south immediately. Pipelines broke down, what used to work no longer worked anymore. roslaunch showed `not a package` error, import in Python showed package not found even though I just pip installed it…

This post aims to clarify what happens, why my system became a mess, and how to resolve them.

The first problem with error importing modules in python is because ROS conflicts with Conda. This is because ROS sneakly changes the Python path in `~/.bash_rc` file. According to this post by Udacity NanoDegree github:

The problem here is that Python is looking for a package in /opt/ros/kinetic/lib/python2.7/dist-packages/ when it should be looking in the Anaconda installation directories. The reason for this is that with your ROS install, your .bashrc file contains the line source /opt/ros/kinetic/setup.bash, which sets your PYTHONPATH variable to the /opt/ros/kinetic/... location.

The solution is to unset your PYTHONPATH variable in your ~/.bash_rc file, right after sourcing ROS, but before sourcing your Anaconda environment. Use this command unset PYTHONPATH.

The second issue is I used many package installers to install python packages (and possibly python itself). Be careful of online tutorials my dude! I ends up having tens of python paths, scattered around different virtual environments, and things became very convoluted.

Package managers I used include:

  • pip (system wide and within conda; I have five pips …)
  • apt (did not figure out what it did until too late)
  • conda

Where do these package managers install packages?

  • pip: https://stackoverflow.com/questions/29980798/where-does-pip-install-its-packages
    • When used without conda environment, will show the base python dir
    • pip when used with virtualenv will generally install packages in the path <virtualenv_name>/lib/<python_ver>/site-packages. For example, I created a test virtualenv named venv_test with Python 2.7, and the django folder is in venv_test/lib/python2.7/site-packages/django.
  • conda will install in the environment’s python site-packages.
    • If there is no Python2 but we want a python2 package, conda seems will install it to the system-wide python site-packages

A detailed comparison about pip and apt:

pip is used to download and install packages directly from PyPI. PyPI is hosted by Python Software Foundation. It is a specialized package manager that only deals with python packages.

apt-get is used to download and install packages from Ubuntu repositories which are hosted by Canonical.

Some of the differences between installing python packages from apt-get and pip are as follows:

  • Canonical only provides packages for selected python modules. Whereas, PyPI hosts a much broader range of python modules. So, there are a lot of python modules which you won’t be able to install using apt-get.
  • Canonical only hosts a single version of any package (generally the latest or the one released in recent past). So, with apt-get we cannot decide the version of python-package that we want. pip helps us in this situation. We can install any version of the package that has previously been uploaded on PyPI. This is extremely helpful in case of conflict in dependencies.
  • apt-get installs python modules in system-wide location. We cannot just install modules in our project virtualenv. pip solves this problem for us. If we are using pip after activating the virtualenv, it is intelligent enough to only install the modules in our project virtualenv. As mentioned in previous point, if there is a version of a particular python package already installed in system-wide location, and one of our project requires an older version of the same python package, in such situations we can use virtualenv and pip to install that older version of python package without any conflicts.
  • As @Radu Rădeanu pointed out in this answer, there would generally be difference in names of packages as well. Canonical usually names Python 2 packages as python-<package_name> and Python 3 packages as python3-<package_name>. Whereas for pip we generally just need to use <package_name> for both Python 2 as well as Python3 packages.

Which one should you use:

Both apt-get and pip are mature package managers which automatically install any other package dependency while installing. You may use anyone as you like. However, if you need to install a particular version of python-package, or install the package in a virtualenv, or install a package which is only hosted on PyPI; only pip would help you solve that issue. Otherwise, if you don’t mind installing the packages in system-wide location it doesn’t really matter whether you use apt-get or pip.

Also, this link that summarizes this problem perfectly: https://stackoverflow.com/questions/14117945/too-many-different-python-versions-on-my-system-and-causing-problem

 

The third issue is I confused Python packages with ROS packages somewhere in my projects. This is more conceptual but did misguide my actions. Here are some clarifications:

ROS

  • ROS is a programming ecosystem/framework that provide middleware for modules (packages) to communicate, especially for the purpose of robotics systems
  • ROS provide distribution in Python and C++
  • ROS code is organized by packages: folders containing src, data, other libraries…
  • ROS has a $ROS_PACKAGE_PATH, which is an environment variable indicating where to find packages for ROS. For example: `echo $ROS_PACKAGE_PATH` will give the output /opt/ros/kinetic/share https://answers.ros.org/question/241528/what-does-environment-set-up-do/

Python

  • Python path: environment variables. Where to find import packages…https://realpython.com/python-modules-packages/ the module search path
  • modules: individual scripts
  • package: a collection of modules https://realpython.com/python-modules-packages/#the-dir-function
  • Given this structure, if the pkg directory resides in a location where it can be found (in one of the directories contained in sys.path), you can refer to the two modules with dot notation (pkg.mod1, pkg.mod2) and import them with the syntax you are already familiar with
  • The __init__.py file in a package will import the modules into the package namespace, so that we can directly use imported modules in the main function.

 

Finally, some useful scripts for inspection and cleaning:

  • Find all Python:  sudo find / -type f -executable -iname 'python*' -exec file -i '{}' \; | awk -F: '/x-executable; charset=binary/ {print $1}' | xargs readlink -f | sort -u | xargs -I % sh -c 'echo -n "%: "; % -V'
  • `whereis pip`   finds all pips

Use binary representation of int to represent subsets of n elements

For the past two months I spent many hours solving algorithmic problems. It’s fun, but very time-consuming. I think the majority of time is spent on trying to find accurate solutions and systematic ways for problem solving. I also spent a few weeks (!) grasping data structure fundamentals.

Anyway, this post just describes a technique I came into when reading this book , page 73, on using integers to find all possible subsets from a list of length n. Example Leetcode problem: https://leetcode.com/problems/subsets/description/

The key insights are:

  • Every int is binary represented in a machine’s memory as a sequence of 0’s and 1’s
  • The range of ints from 0 to 2**n-1, have included all possible combinations of 0’s and 1’s for a list of length n.
  • Think about every position’s value (0 or 1) as an indicator whether element at that position appears or not.
  • So for every int from 0 to 2 **n -1, we only need to extract all the positions that are 1, to get the combination of list elements this int represents
  • With bitwise operators & and shift operator <<, we can easily get whether the i-th element of an int is 1 (turned on) or 0 (turned off). Try X & (1 << i).
  • Looping through all positions from 0 to (n-1) we can get all valid bits for this int. We append the corresponding elements to the answer list corresponding to this int.
  • Finally, we collect every answer list to a big list and return.

Very smart and interesting strategy. Code snippet here: https://codepad.co/snippet/Tm6Z5ZZP

This strategy can used for other exhaustive search problems that requires looping through all possible combinations of n elements, with exponential time complexity ( 2**n). But the very fast bit manipulation makes it evades the time limit.

又一条文献综述吐槽贴

硕士论文还需要最后修改,陆续在看文献。从上周五开始,先后花了两个白天,两个傍晚,两个清晨,才大概看懂 Philip Converse 1964 年的名著 The Nature of Belief Systems in Mass Publics。引用了许多次,今天才真正搞懂。以为是他写得差,拿去问美国人,道是这文章写得太优美,导致理解困难。

头几次看的时候,没有意识到自己不懂,但是行为表现就是水,上网,玩手机,聊天。总之,身体很诚实地在逃避。大概星期六晚上意识到可能是因为这篇文章很难。星期天开始拆分小节,老老实实地各个击破。昨天推进了一点。今天终于搞明白大概在干嘛了。

读通了这篇开山之作,再看后来的论文,马上感觉串起来,知道前因后果了。文科的东西有文科的难法。2015/2016年就该完成的工作,拖到现在。根本原因还是英文阅读能力不够+耐心不足吧?论文写不出来,再怎么找借口,根本原因是积累不够。积累不够的原因是卡在某个难点上就绕路走,兜兜转转半天,哪个学科都无法精进。

为什么这篇文章难呢?首先是这人英文写得百转千回,一个句子要带三个逗号两个从句,因此语意也重重转折,难以分辨主要意思。这当然也说明作者思维缜密。其次是这篇文章结构不是很清晰。当代的社科论文作者已经学乖了,照顾读者耐心,第一句话就提出研究问题。第二三四句话摆出 alternative hypothesis,第五句话说结论,第六句话吹牛。但这篇文章比较像 essay,看不出这么简单粗暴的结构。第三难点是我对政治知识基本抓瞎,只知道个 liberal / conservative,放在作者的分类里就是社会底端的无知大众。第四难点是词汇量。

当然,积极来想,读完一篇难的文章,思维上的收获应当也不小。这两年没怎么看书,说话都变得越来越没文化,还以为自己了不得。

这篇博文还想说两件事。一个是文科训练大概的确会让人变复杂而纠结。因为研究社会事实,工作就是把本来可以简单化的事情搬出来反复咀嚼分析。人的心理啦,对社会走向和未来预期啦。这大半年我的生活里几乎没有什么社会性事件,可想而知情商增长又停滞了。

第二是这篇论文最后提及,政治精英参与 the history of ideas (思想史?)的制定,而他们的决策反过来影响大众。大学学习马克思的时候,好像也看到过类似的思想 (class consciousness?) ,好像也因此树立了一些理想。

知识的制造是政治,是权力。过去两年的挣扎痛苦,可能也是对于不熟悉的意识形态要掌控我的生活的反抗。不愿离开社会学,也曾给自己洗脑说要超越自己的阶级局限。父辈搬砖,我拒绝搬砖!但我现在快乐许多,选择了认知上更为轻松的生活方式。搬砖有搬砖的幸福之处,人际斗争伤神且不制造价值,人的智力应当放在更有审美或实用价值的地方。

津门大饭店今天煎饼果子师傅回家了

(又是一个 draft,放到博客上有点 git add 的意思,比较公开感觉是个 milestone,又不会发到微信太丢脸。writer’s block 可能需要不要脸才能解决,有空会回来扩写这篇文脉不畅的东西……)

津门大饭店,坐落在海边。穿绿衣服的学姐问我,“啥饭店?是不是在那个日本店旁边?” 说着说着面露难色。来港四年,她竟不知津门大饭店,我的心一阵柔软。

我说是,山道麦当劳出去直走;昂首经过右手边百佳宝蓝色招牌;扭头微笑注视街对面日本城人来人往门口一pile大笨象;潇洒转身信踩莲步跨入茶色大门太平洋广场;蹬蹬蹬!三步并作两步直上电梯!刷刷刷!趁热打铁再上一层!

只见眼前豁然开阔,红木白纱屏风,淡墨工笔书画,端的是一片新去处!恰似美猴王发现水帘洞,哥伦布发现新大陆。进门一块朱红牌匾:驰名百年,荆门鱼头泡饼。

这时眼波谄媚的服务员早已迎了上来,客官吃啥?噢不对,小姐几位?一边领着你往里走,伊一边问:吃火锅还是小菜?绕过屏风是十几张桌子错落有致摆在铺红地毯的大厅里,属于猴哥和我的方桌在窗边静静等待!

服务员阿姨涂了淡粉唇膏的嘴唇妩媚一挤,脖子向前一伸,表示欢迎:喝什么茶?

有什么茶?

普洱茉莉菊花香片。

那就香片吧。

说话间另一位阿姨早已殷勤递上菜单两份,酱黄瓜一碟,炒花生一碟。两只白瓷壶,一壶滚水,一壶热茶,滚水浸洗餐具,热茶荡涤灵魂。猴哥和我洗得不亦乐乎。

我有位大学同学是天津人,彼此不熟,聊不到一起去。我觉得她高冷,她觉得我高深。吃饭到十五分钟,她开始找手机,我开始找话题。这么尬了两次,两人很有默契地再也没约过。偶尔去餐厅碰到和其他人约饭,还挺尴尬。虽然无缘,但我信她,因为她身高一米七八。她说津门大饭店不错,那就是不错。我那时第一次听到津门大饭店。老实说,这个饭店位置真是太差了,在街边二楼,窗户上挂个大红招牌,灰溜溜,一般人根本不想进去。为了这家饭店不垮我后来操碎了心。

那时起我就惦记上了津门大饭店,公子哥儿惦记清纯美女那种,把前任麦当劳留在身后,把新欢大饭店藏在心里。

津门大饭店有三宝:烤鸭,包子,煎饼果子。倒不是说多么好吃,主要是香港这个鸟不拉屎又嫌贫爱富的地方,别处要么吃不到,要么吃不起。所谓山中无老虎,猴子称大王。为现实所妥协的味觉,煎饼果子已经是要啥自行车了!

烤鸭学名“驰名津门烤填鸭”,有什么好的呢?服务员会问你要不要一鸭两吃,也就是一盆烤肉之外,再送一盆鸭架汤,熬得雪白,上面飘点油脂。冬瓜么青菜么是没有的。鸭肉片得有薄有厚,师傅可能不是故意的。白色大盘分成花瓣形,中间一团甜面酱,周围一摞切好的黄瓜,一摞红萝卜,一摞大葱,像柴火版堆着。肉斜放着堆。油脂色鸭,鸭皮烤得油亮发光。油脂烤得透明。也没好到全聚德那种入口即化,略有点黄脂,有点粘稠。但总的来说,用香港人喜欢的话,不过不失。

关于煎饼果子和包子又有个故事。去年年底一切都惆怅萧索,猴哥却想吃煎饼果子。一大早就跑去津门大饭店。服务员殷勤地招待她坐下,问,吃包子吗?猴哥问有煎饼果子吗?对方说师傅九点才来。猴哥等到九点。九点到了,早就饿的前胸贴后背。猴哥问煎饼果子有了吗?结果还是煎饼果子没有。猴哥活活饿到十点。最后知道,我的妈,煎饼果子师傅回家了!

这种饭店还开什么开?开什么开?

后来终于有次和朋友一同尝了煎饼果子,哎呀我的妈那是吃得我们男默女泪。那次是两位学妹,一位要去美国读书,一位要去北京做调查。约在津门大饭店吃告别饭,叫了煎饼果子给美国学妹尝鲜。千呼万唤始出来后,原来是黄色蛋皮裹白皮饼子,里面塞根炸得香酥的油条。鸡蛋马虎地打在烤饼上,蛋白蛋黄分得不均,咬在嘴里味道也不太一样。盘里撒几条葱花,配上豆酱,饼子吃进嘴里的味道也丰富了些。

津门大饭店还有啥?豆腐粉丝汤。鲜味一般。端上来,一只大碗,总是缺个口。碗里漂浮几片青菜,豆腐沉在碗底,青菜煮的烂熟,白菜根煮成透明色。汤碗一角堆些干虾仁,嚼起来一股鲜味儿。炸肉段儿。以前看某东北写手写她高中食堂,师傅把肉段儿炸得金黄,漏勺捞起来一把,颠一颠把油滤了,肉段颤颤地堆在勺里,用嘴嘬着吃。赶快趁热夹一段吃,松软,热乎,面包屑脆,猪肉又香又甜。这里的炸肉段儿只怕没有那么香酥,但是也足以满足南方人对东北大块吃肉的想象。

津门大饭店名字是蛮 ambitious 的。香港这种地方,客厅跟我家厕所一样大,占了两层楼,就算是大饭店了。内地大饭店怎么着也得有个五六层楼加一队红衣迎宾小女孩吧,这里啥也没有,坐地起价就能自立门户为坐地成佛就能成大饭店了!

饭后我和猴哥站在海边,猴哥说,这都是香港的原因。你看你在香港,只要一想想未来,想想房价,就恨不得马上去赚钱。我想其实没钱也没关系,我们可以携手去吃津门大饭店,然后晚上在山道麦当劳里扑街。市民的幸福其实相当容易,呼朋唤友胡吃海喝,好像也能高高兴兴地活个十年八年。

又一个迷茫的北美理科生小王准备成为码农

(语句不通,其实是 draft 稿,但是逼迫自己今天的 deadline 跪着也要搞定。故事大半虚构小半真实)

每个年代都有理想破灭的故事。这真叫人悲伤。昨天青年从帐篷里撤离,今天青年收起雨伞回到学校。东边青年说,全世界无产阶级团结起来;西边青年说 we are the 99 percent。文艺青年不再更新,事业青年融不到钱,愤怒青年开始修心养性,左派青年说,丘吉尔说二十五岁之后的自由主义者都没脑子。

说不好。有时是理想出了问题,有时是现实出了问题。可能他们都没问题,但是面对构建理想之破灭后的惨淡生活,需要把不如意归结为两条平行认知交流不畅生出的误会种种。

Anyway,我们这个年代身处北美的理工科青年,理想破灭这件事情的发生,通常伴随着如下行为:
– 对亲友和网友疯狂吐槽原专业并深入讨论转 CS 可能性
– 购买各类编程书籍,例如 Learn Python the Hard Way, Head First Java, Problem Solving With C++
– 订阅 LeetCode VIP 账户
– 注册一亩三分地账号,每天视奸老中和小中 offer 并刷面经

我认识这样的一个青年小王。我俩是大学同学,这人蛮逗的。我俩以前常一起自习,吃饭。当时也没觉得这人有啥不一样。结果这人毕业申请学校的时候,说坚决不当码农。“蝇营狗苟的生活,不是我想要的!” 豪气冲天,壮志凌云,一溜烟跑去读了个艺术史硕士,觉得是其人生超越性的光明开端。

要说搞文艺,那也有几种。微博知乎力争上游当大V,微信勤恳更新疯狂洗稿,都还是有发达可能的。然而小王似乎并没有这个头脑和野心。他转去艺术史后不久我就没有了他的消息。听说谈恋爱了。又听说分手了。又听说状态不好,本来就神神叨叨的人变得愈加 drama 了。临到末了,我们都快毕业了,也没听到他交论文的消息。

突然一个周六下午,我接到小王微信,说是要转行,找我探个路。我说好,约了周末面谈。小王灰溜溜地说,我得转行,不知道转什么。我劝他转码农吧,他又说没想好。我问他你想干什么,他说不知道。我问那你为什么要转呢?小王说原专业没有前途,我想挣钱。

老实说,小王这类人我见过太多了。什么转行的化学博士啊,什么比较文学转码农的呀。我还见过一个物理的,看这人话蛮多,喜欢吹牛,蛮自恋的。当时也是怀揣美国梦一心搞学术的,现在处在边缘进退两难。你要说这些人怂吧,也不完全是怂。说他们贪吧,也不完全是贪。那是为什么呢?可能问题是成熟太晚,可能问题是没有踩到时代的浪潮却有颗躁动的心吧。

我接着听小王说。“没办法,这里房子实在是太贵了。我们学校旁边有几片豪宅,最贵的那片叫宝翠园。门口站着穿蓝色制服的保安,拿橘黄色警棍,走下楼梯,人就拿警棍把你往外赶。偶尔进小区的红色的士不知道载着谁,你就得给站在路边,给人家行注目礼,因为这是人家私人路。”

小王有次心情不好,喝多了跟我说,妈的,不就是挣钱吗,老子真想搞,也不会比你丫差,看给你牛的。我想他的生活可能出现一些危机。我想也许哪个富二代给了他刺激。也许他自己没意识到他说这话时,小气,势利,酸,俗,毫不文艺。

布迪厄讲,审美是有阶级性的。有些人的钱来得较为容易,但小王没有那种洒脱。他得在红尘里摸爬滚打二十年,才能有那种养尊处优、举重若轻的风度。我想小王终究没跳出他那个格局,穷过的人,关键时刻,还是一股精致的利己主义。

他后来酒醒了,估计是忘记了,也可能是羞耻,再没提过这茬话。

我也没跟他提这些。要我看,小王这种人,最适合幻想,以及跟幻想搭边儿的行业。刚毕业就创业的。去西藏的。离婚的。生孩子的。总的来说,这类人太贪心。对新鲜、刺激的需求太甚,没有耐心忍受日常的平庸。他们不知道日子也是一天天过出来的。小王呀小王,你今年二十五岁,总得学会两只脚站在地上,踩踏实了。

小王看来是已经学到了这堂人生课。他说,没办法,我妈老了。本科的时候,她一个人来香港,不带电话,从罗湖找到我们学校,又在我们学校找到我。这次她来香港,天下着雨,黑乎乎的,结果她穿凉鞋,一脚踩进臭水沟里。我妈是护士,最怕脏水,回去还得洗呢。

他说我意识到我想当个正经人,就没法儿这么胡乱搞,你明白吗?我在那个状态里,整天神神叨叨的,我待不下去。我不能不搬砖,搬砖使我心情宁静觉得被需要,我是条贱命呐。

他也有点犹豫是否要转码农。“机械而被异化的生活,有什么意义呢?”

我无语。小王啊小王,长安米贵,居大不易呀。你还说什么灵魂!还谈什么意义?你说你要写文章,你写了吗?你说你要为社会做贡献,你做了吗?你瞅瞅你满嘴跑的噼里啪啦绿皮大火车哟。

这天下午,我终于看着小王自作主张地,苍凉地和自己的野心道别。他几乎落下两行泪,觉得人生不过如此,他曾幻想自己是XX,是XX,是XX,现在这些幻想都没有了。他亲手杀死了无限可能。那些腥风血雨,那些运筹帷幄,从今天起都跟他小王没关系了。

其实本来也没什么关系,但是他今天必须打起精神面对事实。从今天起,做一个老实巴交的码农。给每一个科技公司起一个温暖的名字,谷歌是狗家,Facebook 是F家,亚马逊是玄学大厂。他坐在敞亮的教室里,这样决定了。

我劝他注册个 LeetCode 的 VIP,认真上好 Java 课程,打开网站,勤勤恳恳做题,就跟高中一样。他会跟周围人一起讨论,哇 XXX 去了谷歌,好厉害啊!XX 去了微软,又是多么牛逼。哇 XXXX三个月从文科转到码农,真是…

我相信小王能搞定。但等他搞定,成为 professional 了,只怕也不再文艺,不再是我曾经认识的那个可爱的小王了。

饭吃完了。我也没啥好说的。说了,小王会觉得我刻薄。谁年轻时还没个理想呢。崔健的歌词,以前听政治学老师唱的,送给小王。

瞧你丫那德行怎么变成了这样儿了
前几年你穷的时候你还挺有理想的
如今刚过了几天儿你刚挣了几个钱儿
我看你比世界变得快多了要么是漏馅儿了
你挺会开玩笑的你挺会招人喜欢
你过去的理想如今已变成工具了
你说这就是生活你说这才有味道
你干脆说你愿意在失落中保持微笑