薄扶林爱情故事(五)

姚遥吃了一惊,她从没听郭晏晏抱怨过男朋友,只当两人状态稳定。现在看来只是郭晏晏话放在心里。她不说话,等郭晏晏接着说。郭晏晏却又提起筷子慢慢吃面。又等了一会儿,姚遥问:“为什么呢?”

郭晏晏答,“也没什么原因,就是不喜欢了。” 说完又继续吃面。姚遥心里困惑,但也没说什么。情书的事情当然是不能提了,叫她更困惑的是,两个人四年感情,怎么会说不喜欢就分手呢?郭晏晏越是轻描淡写,姚遥就越搞不懂。吃完饭,姚遥问,要不要去喝糖水?其实是想郭晏晏如果要找她倒苦水,今晚还有个机会。结果郭晏晏说想回去呆着,就低头拖着步子回去了。

郭晏晏一走,姚遥无事,就转身上山回学校。大四上学期,九月到十二月她已一周周计划好。十月初要把布鲁诺夏天布置的项目做完,十月中要准备期中考试,十一月第一周要求推荐信,十二月要填完申请,寒假又开始新的项目。这是开学第一周,所以还有空和郭晏晏吃饭。郭晏晏要是十一月分手,只怕姚遥已焦头烂额无暇理会。既然郭晏晏不提,姚遥也不打算多问,毕竟事不关己。眼下她更在意布鲁诺实验室新来的学生。

下午开组会,姚遥注意到组里来了个不认识的女生。个儿不高,脸上堆笑,坐在眼镜师兄旁边。布鲁诺介绍叫 Alice,大三,现在帮眼镜师兄做点杂事。散会后姚遥正在清东西,听见声甜腻的 “学姐好”, 就看 Alice 笑着迎了过来,说,“早就听说学姐也在这个组,终于见到啦。学姐最近忙不忙,在做什么项目啊?” 姚遥答,“随便做点杂事。” Alice 又接着问,“那学姐之后想申请出国还是直接工作呢?” 姚遥抬腿往外走,一边说,“申请吧。” Alice 也跟了上来,接过话说,“哇!学姐这么厉害肯定没问题的,等学姐去了斯坦福伯克利带我们飞噢。” 姚遥 “哪里哪里” 地答应着,急忙走出门外推脱有事从 Alice 处脱身了。

(
1. 郭晏晏既然分手,上一节一开始就不该兴冲冲地叫姚遥下来吃火锅。之后要改……)
2. Alice 将存心给姚遥捣乱。)

薄扶林爱情故事(四)

过了个暑假,可把郭晏晏无聊坏了。在上海实习了三个月,刚认识了一圈小姐妹就会香港了。开学第一个周五,刚下课她就蹭到姚遥旁边:晚上去吃大家乐小火锅吧!

郭晏晏江苏人,生得唇红齿白,眉目清秀,性格却大大咧咧像小伙。大一到大四,每年暑假必丢一个苹果手机。大一那年去南京玩,急着去夫子庙吃粉丝汤,挤着挤着手机就没了。大二在上海做地铁,五角场站有大妈找她问路,回头过来荷包空了。大三在香港维多利亚港跨年,仰头看烟火呵呵傻笑准备拿手机自拍,一掏手机又没了。手机从苹果四丢到苹果六,今年大四这个可得看紧点。结果下课拉着姚遥一溜烟冲进西宝城,饭菜刚端上桌,啪地就把手机拍在桌上,抄起筷子吃饭。

姚遥坐她对面直担忧。她和郭晏晏军训时住隔壁寝室。郭晏晏眼镜是白框,姚遥眼镜是黑框,军训大太阳晒着没几天,两人都申请成了文艺兵,相遇在吹着空调的政委办公室里。姚遥写稿子,郭晏晏审稿子再加上四处聊天。有天郭晏晏出主意,“诶姚遥要不我们去西校区边玩吧?” 晏晏胆大心黑,跟审稿组长打招呼,“我俩去训练场采风”。又跟训练场的班长打招呼,“我俩去办公室审稿了”。回寝室迷彩服一脱,就是两个自由人了。怕被同学看见,两人从后街绕了老大圈,付钱吃食物也跟做贼似的。从此好孩子姚遥人生记录上就不那么清白,全怪交友不慎。

两人混到大四还是好姐妹。姚遥生活单调无聊,郭晏晏每每给她带来许多欢乐。 姚遥十次去铜锣湾,倒有九次是她陪着。郭晏晏活泼贪玩,姚遥倒不敢造次给她规劝。郭晏晏自从大二有门考试考得不好,也就一路破罐破摔了。好在她也不是削尖脑袋往上冲的个性,只认为和家人团聚和乐过生活最重要。郭晏晏有个青梅竹马的男朋友在上海,两人高中隔壁班,高考完的暑假,郭晏晏先表白,两人就成了情侣,到现在也快四年了。郭晏晏生得美,四年里追求者多,但她嘴上说要认识富二代,暗地里却都回了。原来是个重情人,姚遥想,要不和她讲讲我收到匿名情书的事。

正待开口,郭晏晏头从碗里抬起来,喝一口奶茶,说,“姚遥,我跟我男朋友分手了。”

(故事很拖沓…女二出场了,是个喜剧角色。以及算了算,500字两天一更,一年也才八万字,删删改改提炼下情节估计留四万字就不错了。两年才能成个中篇…)

薄扶林爱情故事(三)

毕业后的第一个寒假,姚遥回来找无羊。两人便出了东闸一路走下去。姚遥走在他左边,无羊看看路边有车开过,就放缓了几步,又走到姚遥旁边,把她护在马路里面。姚遥意识到的时候,两人便都有些尴尬。但也不说话,接着走。无羊感到傍晚的风吹来不再凉了,带上她手臂的温度。两人毕业了,每周末联系,但也慢慢就各自有事,下次。下次见面也不知道什么时候。对姚遥回来找他这件事,无羊也有些惊讶。

“你想吃什么?” 无羊说。

“都行,看看吧。”

姚遥喜欢吃什么?无羊无数次思考过这个问题。之前大伙儿一起吃饭,他常观察她点的菜。一般是最简单的两菜一汤,有时是烧味。姚遥口味很轻,不会点满碗飘着红辣椒的川菜。有次无羊把她叫去麦当劳,全是卡路里的菜单上姚遥也能拣出最健康的几样:汉堡、橙汁、沙拉。姚遥是那种保持身材也会立下目标的人。高街上餐厅一家家看过去,无羊想,她会不会喜欢呢?越想心里就越忐忑。最后还是姚遥做主,两人走进家泰国菜餐厅,叫了饭菜。

无羊表白被拒后,两人之间的关系反而更坦诚了。无羊从前小心翼翼不敢说的,不能说的,现在既然摊牌,也可以和姚遥谈起了。无羊自从体会到了这样从胸腔钻到全身的心痛,痛到头顶发麻,手指冰凉,几乎死过去又活过来之后,也没有什么好害怕的了。而姚遥既然接受了无羊的秘密,旁观了他的痛苦,也就成了他的同伴。纵然不是无羊期待的那种,但也算是两人共享的时间和经历,是无羊掏出了心,硬从姚遥那里索取来的一点关注。不管他是否知足,事情必须到此为止。

姚遥也苦恼。她当真不愿意伤害无羊,但她面对他那惊天动地的痛苦也无能为力。两人面对这同一份沉甸甸的痛苦,都红了眼眶。还是姚遥先开口。无羊像个小孩,蹲在他坍塌世界的一角,而她需要把两人从这份无望里拉回现实。“你后悔吗?”

无羊说,“不后悔”。

姚遥说,“你是个峰值”。无羊讲,“你也是我的峰值,我之后没有那个胆子了”。再没有胆子爱一个人到这份上,爱驻扎在心里,成了沙漠里绿色的一片骆驼刺。多年之后,两人再也不说话了,无羊还是会常常想起她,还是年轻,削瘦,蓝色上衣,项链上串个戒指当吊坠,眉毛画得太浓,着急成长的女郎。想着想着,也还像踩在刺上一般痛。或许还要更痛,姚遥所有的样子他都爱,化妆的,不化妆的,生气的,冷淡的,抿紧嘴唇不甘心,仰头大笑到脸红的样子,爱到骨子里,而越爱,越想念,也就越痛。

(比较后面的一个场景,还需要细细雕琢渲染…)

薄扶林爱情故事(二)

李无羊搬进莫里逊时,室友已经把房间占了一半。行李箱锁起来随意地放在书桌下面,衣柜门开着,两副套着干洗袋的西服整齐地挂在里面,其他各色衬衫叠的齐整,分类放得好好的。一瓶用了一半的古龙水放在角落。几本书放在桌上,绿皮的《金融数学》,和很厚的一本会计书,还很新。不久后他会看到这些书被挂在许奋飞的朋友圈里以原价的九成卖给学弟学妹,不过这时他还没见到许奋飞,只是从桌上摆的教材里知道,噢,新室友是学金融的。室友的床铺已经铺上深蓝色床单,看来已经收拾的差不多,就剩一床还没来得及套上的棉絮和发黄的枕头扔在他光秃秃的床铺上。无羊为难了。不把室友的东西挪开,他背上胡乱打包的被褥无处可放。挪开室友的东西,他又有擅自碰人家东西的负罪感。最后他只好把被褥堆在桌上,蹲下来去拾掇他废了半条命拖来的那箱书,一本本往柜子上放,放到一半发现手上一把灰,原来忘了先擦柜子。他又去厨房拿纸,往柜子上一抹搞得自己一头灰。他烦不过,干脆出门去上课,打算回来再收拾。

无羊回来时,室友在宿舍,被他突然开门吓了一跳,两人打了个照面。室友肤色黝黑,胡茬剃得光滑,梳个油头,却戴个金丝边眼镜。无羊还没开口,室友眼睛一眯,嘴角堆笑,说,“Hello! 你是李无羊吧?我叫许奋飞,读金融,听说你很久了。你是读社科的吧?” 停了停,又说,“很高兴和你做室友,希望能相处愉快。” 伸出一只手来。无羊一愣,赶忙伸出手去握了握,说,”是是是,我叫李无羊。幸会幸会。”正好看到他床上的棉絮已经被拿走了。许奋飞接着说,“我作息比较晚,有点洁癖,这学期也很忙,还请你多担待噢。”无羊赶快说,“好的好的。” 许奋飞便回头接着看他的电脑去了。无羊则继续收拾他一桌子乱七八糟的什物。看许奋飞沉默做事,没有跟他继续搭话的意思,无羊也识相地轻手轻脚起来。但东西太多,越是小心翼翼就越手忙脚路,结果吧唧一声,他刚刚在桌上辛苦砌起的书堆塌方了,马克思、韦伯、福柯、齐美尔滑得七零八落。他只好有一本本扶起来。这时许奋飞合上电脑,站起身来走出房门,门关得啪地一响。

薄扶林爱情故事(一)

(目前我并不知道怎么写小说,也从未写过小说。但这事不能再拖。今年会把第一稿写完,不论多差、多不满意,大致的情节和人物必须有个框架。如果今年还有时间,再第二轮删改、调整叙事结构、重新写某些场景、添加细节。第三轮需要雕琢语言和风格,以及继续微调叙事。在博客上先发草稿。每篇500字,发100篇的话也就有5万字,足够当第一稿了。每两天一更500字,需半年时间,差不多九月完成。祈祷,这次可不要烂尾呐。这个事情做完,就知道自己差在哪里了。)

每周五上午九点二十,姚遥下统计课,走出明华103教室。她会推开右手边那扇红色大门,迅速走下两层楼梯,再穿过从山坡顺延下来刷着黄漆的马路,来到邵逸夫楼背后的人行道上。这是她在校园里最喜欢的路,因为一路走一路就能看到左手边的繁茂树林。学校其实在半山腰,人把山截掉一块,再用高高的水泥墙拦住,就硬生生开辟出个校园。然而树林不知道家园失守,仍然探头探脑地把枝叶长过围栏,夏天在墙下走的学生头顶就有了片荫凉。

从明华到黄克兢,姚遥要走六分钟。到了黄克兢一楼贴着暗红瓷砖片的电梯口,时间还早,没人排队。坐电梯到地下三楼需要一分钟。那里有长长一排黄色,是工学院分配给学生的储物柜。姚遥的柜子是6673号,转弯后第三排从上往下数第二个。九点二十七分,姚遥会把装讲义的文件夹塞进柜子,再把装球鞋和运动衣的黑色袋子拿出来。十一点她要去四楼和布鲁诺和他的博士生们开组会,中间这一个半小时,她要去学校里的健身房锻炼,冲凉,收拾下,然后提前五分钟出现在会议室里,坐在角落等师兄师姐们来。她喜欢这样的周五上午,规律,紧凑,一切在计划当中。

现在是九点二十七分,姚遥刚打开她的柜子,一封信掉了出来。她捡起来,白色信封上没写名字,打开一看发现的确是给她的。

“姚遥你好!从夏天开始,关注你很久了,不过你可能没有注意到我。觉得你身上有种特别的气质,是那种安静又独立的女孩子。当然,你也很好看,每次看到你都觉得紧张。我很喜欢你。这学期你应该很忙吧。PS,我的字有些拿不出手,只好尽量写得端正了,哈哈。”

没有署名。姚遥不是第一次收到情书,但这没署名的还是头一次,勾起了她的好奇心。是谁呢?“从夏天开始”,说明刚刚认识不久,或者至少熟悉不久。知道这是她的柜子,说明是身边很熟的同学。是那个念商科,梳个油头穿西装的许奋飞?最近和他一起上统计课,经常讨论作业到很晚,莫不是被这厮看上了。但他应该不是给人写情书的风格呀。都什么年代了,还写情书,说明是个文艺青年。莫非是许奋飞那个室友李无羊?这家伙老是穿个格子衬衫很忧郁,一脸落魄文人样。要么就是隔壁实验室里那个老实巴交的张硕了,“看到女生就紧张”,倒是很像他呢。姚遥胡思乱想了一会儿,就到了十点多。计划被打乱,她干脆直接去到四楼那个会议室,拿出电脑开始写作业,不过怎么也没法儿专注,心里都是对那封情书作者的好奇。

ResNet: annotated version

This post annotate PyTorch’s implementation of ResNet.

ResNet is one of the most widely used network structure for image tasks in the industry.

The motivation for the original idea is deep neural networks, if we simply stack layers together, sometimes perform worse than shallow networks. For example, a network of 52 layers has higher training errors than a network of 36 layers. This is strange because in theory, deep networks should always be no worse than shallow networks: a constructed solution is to make all the extra layers identity matrix, and the two networks should perform the same.

  • The difficulty in optimization is NOT because of the gradient vanishing problem. 1) the authors use batch normalization after every conv layer, so forward propagated signals have non-zero variances. 2) the authors tested norms of backprop gradients and ensured that they exhibit healthy norms. Also the 34-layer plain net still have competitive accuracy so the gradients do not vanish.
  • The training error cannot be reduced simply by adding more iterations.
  • The conjecture is deep neural nets have exponentially low convergence rates.

To solve this problem, this paper’s key idea is to let some of the layers learn the “residuals” of a function instead of the original function. There is no rigorous math proofs. But the intuition is, we already one of the solutions to make sure that deep networks performs no worse than shallow layers is to make all the extra layers identity transformations. So we can pass this information to these layers to reduce its optimization efforts, by manually adding an identity transformation to their results.Thus, these layers only need to learn the “residuals” of the original transformation minus the identity transformation. Mathematically, suppose the original “real” transformation that these layers are trying to learn is H(x), now they only need to learn H(x) – X.

Now practically, should we assume H(x) and X are always the same dimension? (Here “dimension seems to be particularly referring to number of channels, as shown in figure 3 of the original paper) The answer is no. The authors provide two options for handling missing dimensions.

  1. Perform a linear project W_s in the shortcut connection so that we got H(x) = F(x) + W_s X. i.e. X is passed through a linear combination to make sure dimensions stay the same.
  2. Still use the identity mapping but make all extra channels zero paddings.

Theoretically, we can also do this when the dimensions are the same, i.e. transform X before adding it to the residuals. But this is unnecessary as identity matrix alone is sufficient, shown by the experiments. Using projective shortcut in models with all conv layers with channels increase is only marginally better than using identity shortcut in these models, and this is likely due to increase of parameters.

Note also practically F(x), i.e. the residuals, are implemented as 2 ~ 3 linear or conv layers. In the implementation we are going to see below they are conv layers.

Another insight for me is “deep” networks are really deep and complex. I only coded basic building blocks for these networks, e.g. convolution, LSTM, but I in fact have not tried building 100 layers of such building blocks! ResNet proposes one model of 152 layers. A practical question is how do we implement it when there are so many layers? Also, when the network is this deep there are many design decisions to make. e.g. how large should a filter’s kernel size be? What should the stride and padding be? How about dilation? I do not have experience with tuning all these hyper-parameters.

Some of the design principles I read from the ResNet paper, inspired by VGG nets:

  • most filters are 3 by 3. In fact, 3 conv3x3 layers has the same receptive field with one conv7x7 layer, but they have fewer parameters (27 vs 49). So stack many small filters is a more economic solution than using one big filter (with the price of more dependencies between output feature maps).
  • Number of features * feature map size should be roughly constant.
    • For example, for the same output feature map size, the conv layers have the same number of channels.
    • If we halve the output feature map size by using a conv layer, we usually double the number of filters to preserve the time complexity per layer.
  • conv layer can also be understood as a downsampling layer. e.g. a conv layer with stride=2 can halve the feature map size.
  • shortcut networks do not increase the number of parameters or computation complexity (only by a constant addition)

Fortunately, PyTorch offers a ready-made implementation in torch.vision package. Here is my annotated version of the code.

import torch.nn as nn
import torch.utils.model_zoo as model_zoo


__all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
           'resnet152']


model_urls = {
    'resnet18': 'https://download.pytorch.org/models/resnet18-5c106cde.pth',
    'resnet34': 'https://download.pytorch.org/models/resnet34-333f7ec4.pth',
    'resnet50': 'https://download.pytorch.org/models/resnet50-19c8e357.pth',
    'resnet101': 'https://download.pytorch.org/models/resnet101-5d3b4d8f.pth',
    'resnet152': 'https://download.pytorch.org/models/resnet152-b121ed2d.pth',
}

This is just the boilerplate code. Note the `model_urls` list stores pre-trained model weights for some network structures. “resnet18” means “Residual Net with 18 layers”.


def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=1, bias=False)


def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

These are two most basic convolution filters used in ResNet. Notice ResNet does not use filters of other sizes, and all default stride sizes are 1. Filter size 1 is for reshape data channel dimensions. For example, a data might have 64 input channels and 28×28 pixel each. Then a conv1x1 layer, if it is to output 256 channels, will first sum over all channels to get one number, then replicate that number 256 times with different weights. It can also be used to reduce number of channels. Since the the residual operation need the input and output of same dimension, we will need such an operation to adjust data shapes.

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

Here `inplanes` and `planes` are simply input channels and output channels. Note this is a basic module to be stacked in ResNet. This module consists of two conv 3×3 layers, each followed by a batch normalization layer. The first conv layer is also passed through a ReLU for nonlinearility. (Note ReLU will cap all input greater than a certain value into a zero, so if you plan to use that output as a divisor be careful not dividing anything by zero.)

Then there is an option for downsampling or not. If you look at the `_make_layers` function in `ResNet` class, you will notice that `downsample` is an `nn.Sequential` consists of 1) a conv1x1 layer that aims to expand the number of channels of input data, and 2) a batch normalization layer that follows the conv1x1 layer. The downsample option is automatically enabled when the input dimension does not satisfy certain conditions (explained in the `ResNet` class annotation), or the stride is greater than 1.

class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = conv1x1(inplanes, planes)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = conv3x3(planes, planes, stride)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = conv1x1(planes, planes * self.expansion)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

Here the Bottleneck class defines a three layer network. Here are the meanings of the parameters:

  • inplanes: input channel to the first conv layer
  • planes: the number of channels for the intermediate conv layer.
  • The final number of output channels will be planes * 4, because there is an expansion factor = 4.

This whole module is first change input channels from “inplanes” to “planes”, then shrinkage the size of feature map, finally expand the output channels to 4 * planes. Why do we need such a structure? It’s mainly for computation efficiencies and reducing number of parameters. Compare two structures 1) two 3×3 conv layers with 256 channels and 2) one 1×1 conv with 64 channels, one 3×3 conv with 64 channels, and one 1×1 conv with 256 channels. The first : parameters = 3*3*256*2, the second 64 + 3*3*64+256.


class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False):
        super(ResNet, self).__init__()
        self.inplanes = 64
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                nn.BatchNorm2d(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

The key to understand this chunk of code is to the `_make_layer` function. It takes the following parameters:

  • block: can be either a `BasicBlock` object or a Bottleneck object. The first consists of two conv layers
  • planes: planes * self.expansion is the number of channels for the intermediate layers. (Q: why do we need the self.expansion parameter at all?)
  • blocks: number of basic blocks (e.g. a basicblock or a bottleneck block)
  • stride=1: stride for blocks
def resnet18(pretrained=False, **kwargs):
    """Constructs a ResNet-18 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(BasicBlock, [2, 2, 2, 2], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet18']))
    return model


def resnet34(pretrained=False, **kwargs):
    """Constructs a ResNet-34 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(BasicBlock, [3, 4, 6, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet34']))
    return model


def resnet50(pretrained=False, **kwargs):
    """Constructs a ResNet-50 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(Bottleneck, [3, 4, 6, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet50']))
    return model


def resnet101(pretrained=False, **kwargs):
    """Constructs a ResNet-101 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(Bottleneck, [3, 4, 23, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet101']))
    return model


def resnet152(pretrained=False, **kwargs):
    """Constructs a ResNet-152 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(Bottleneck, [3, 8, 36, 3], **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['resnet152']))
    return model

The above are just different configurations for different ResNet structures.

Various SGDs

This post briefly documents variations of optimization algorithms and best practices. It’s a summary of this source . Also this course note is helpful for a review.

Tricks in checking if a gradient is implemented correctly: use centered difference instead of finite difference, because the first one has an error of O^2 if you try expand it by Taylor’s theorem. It’s more precise.

Momentum 

Interpretation: the momentum factor is multiplied on the past update velocity and decreases it. Momentum is more like a friction of the previous update scale. Also plus the newly learnt gradient. As a result if gradient goes to another direction it will be slowed down. If gradient goes to the same direction the scale of change will become faster and faster.

Nesterov Momentum

Similar to Momentum, but the update step, instead of calculating gradient at J(theta), we calculate it as theta + r * v, as a “look ahead step”. we know our momentum factor will take us to a different location and we get the gradient from that location.

Exponential decay of learning rate

As learning approaching the end, it makes sense to decrease the learning rate and go in smaller steps.

Adagrad

cache += dx ** 2

x += lr * dx / (sqrt(cache) + eps)

eps is a smoothing factor to avoid division by zero

Understanding: 1) the scaling factor is square root of sums for all previous gradients but in a per dimension fashion so that every dimension has its own learning rate 2) as a result, for more sparse parameters the learning rate is increased because the sum of gradients are smaller, and for dense parameters the learning rate is decreased because the sum of gradients are larger

RMSprop

It’s an adjustment to Adam in that the learning factor no longer goes monotonically decreasing (because the cache is always increasing). Instead, the cache here is “leaky” think of LSTM gates. The formula is :

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

Question: what’s the initial cache value for RMSprop? some implementation states it’s zero. 

Adam

Adam looks like a mixture of RMSprop and momentum.

m = beta1 * m + (1 – beta1) * dx

v = beta2 * v + (1 – beta2) * (dx**2)

x += -lr * m / (sqrt(v) + eps)

Note here 1) the m and v are leaky just as in RMSprop and 2) the update step is also exactly like RMSprop except that we now us a “smooth” version m instead of the raw gradient dx and 3) the construction of m looks like momentum.

 

Some points to help really understand VAE

It took me a long while (~ 4 days?) to understand the theories of Variation autoencoder (VAE) and how to actually implement it. And it’s not entirely my fault, because:

  • The original paper (Auto-Encoding Variational Bayes) explains things from a probabilistic perspective, which
    • requires knowledge on Bayesian statistics to know the math is correct
    • requires knowledge on variational inference, or general statistical inference to fully understand motivations of introducing distributional assumptions, and meanings of equations. (In fact, I should be familiar with this because I did something related in my final year of college, but I forgot most of it..)
    • only offers a general and abstract framework for the major parts of the paper, but leaves results for specific distributions in the appendix.
  • Online tutorials over-simplify this model using neural net languages
    • Most online tutorials just add a “sampling random noise from normal distribution” step in the auto-encoder framework. But often these tutorials cannot hold up in their writing and have logic loopholes. e.g. they never explain why the loss term have a KL-divergence between sampled and true z distributions, and how to calculate that distribution using empirical data (this is a key).
    • Most online implementations miss a point: if reconstructed x is sampled, how can we use MSE loss between original data and reconstructed data? They are not even the same data point. Some tries to explain this by acknowledging the reconstructed x is sampled from conditional P(x|z_i) and thus establishing the correspondence, but still, reconstructed x still has a randomness in it, and cannot use MSE loss.
    • Even the lecture on VAE from Stanford  does not explicitly explain this point above.
    • One will feel the online tutorial and the original paper do not match up.

In other words, most online tutorials do not bridge the gap from the general variational lower bound to the actual implementation equation, because they do not introduce three crucial distributional assumptions.

In fact, one just cannot derive the loss function in the original paper from a neural net perspective, because neural net does not make distributional assumptions. Also, the original paper is not about neural networks.

So, this post just tries point out the key ideas in understanding the original paper:

  1. The whole paper sets in the probabilistic framework. We do not have loss functions here. Instead, we make an assumption for p(x), and the goal is to find parameters for p to maximize likelihood to account for the current observed data x. Once we find these parameters, we can sample new data from this distribution.
  2. We introduce a latent variable z in order to better estimate p(x)
  3. We don’t know the conditional distribution of z|x. So we assume it follows normal, i.e. q(z|x)
  4. After introducing q(z|x), while we still can’t analytically tackle p(x), we can at least get a lower bound, which is - D_{KL} (q(z|x_i) || p(z)) + E_q ( log p(x|z)) You can find how to get this lower bound from many online tutorials, e.g. this.
  5. With the important lower bound we got from 4, now it’s time to plug in some concrete distributional assumptions. In the original paper appendix B, an analytical solution for - D_{KL} (q(z|x_i) || p(z)) is given when p(z) and q(x|z) are both Gaussian
  6. But then still we need to get $latex E_q ( log p(x|z))$. This is simplified to l2 loss when x is assumed to follow i.i.d. Gaussian. To see why, think about linear regression and how MLE is the same as minimizing the total sum of squares in terms of getting estimates for regression coefficients. Also this slide from UIUC!
  7. So now we got a concrete instance for VAE under Gaussian assumptions. How does this apply to neural network and autoencoders? Here is a glossary:
    1. encoder: output z|x, and the neural net is to simulate sampling from p(z|x). Used to get empirical observations of z. we have a loss term here in that we want to minimize the KL divergence between this output z|x and our assumption that z|x follows normal.
    2. decoder: output x|z. the neural net is to simulate sampling from p(x|z)

Preparing 3D building data from CityGML for machine learning projects

Introduction

I spent the last week of 2018 and the first week of 2019 preparing CityGML data for a machine learning project. In particular, I need to extract 3D point cloud representation of individual buildings in New York (and Berlin and Zurich) so they can be the training / validation / test data.

During the process, I had to pick up some knowledge on 1) the Linux file system / disk management, 2) PostgreSQL, 3) CityGML (a data format for 3D city maps) and 4) FME (a data warehouse software for ETL) and 5) PCL command tools and 6) shell programming. I acquired these bits of knowledge by asking on Stackoverflow, FME forums, and the issue forum of corresponding open source tools on Github. Most importantly, thanks to the help from people all around the world (!), I am finally able to figure it out. This post documents the necessary steps to finish this task.

Since there are many details involved, this post will mostly point the readers to the tools I used and places to find how to use them. It covers the following sections:

  • CityGML : what is it and database setup
  • FME: the angel that does the heavy lifting for you for free, from CityGML to mesh
  • PCL command tools: a not-so-smart way (i.e. the Engineering way) from mesh to point cloud

1. CityGML Setup

1.1 What is CityGML

CityGML is a markup format to document 3D city maps. It is a text document, basically telling you where a building is, what a building is composed of, where a bridge is, what’s the coordinate of a wall surface, where are the lines of a wall, etc…

CityGML is based on XML and GML (geometric markup language). XML specifies document encoding scheme, e.g. class relations, labels and tags… kind of like HTML. GML extends XML by adding sets of primitives, like topology, features, geometry. Finally, CityGML is based on GML but with more constraints that are specific to cities, e.g.

A 3D city model can have various level of details (LoD). LoD1 means the map is only 2D. LoD2 means 3D objects will be extruded from 2D maps into 3D shapes. LoD3 means a building will have windows, roofs and other openings. LoD4 means a building will have interior furnitures.

For example, part of a sample CityGML file might look like this :

This specifies several surfaces of a building.

A sample visualization of a CityGML LoD4 model might look like this:

You can see furnitures of the building in the main Window.

References to know more about CityGML:

  • Groger and Plumer (2012) CityGML – Interoperable semantic 3D city models. Link. This paper goes over details of CityGML and is a more organized guide than the official CityGML website.
  • CityGML website for downloading data for specific cities. Many of the Germany cities are available in CityGML format.

1.2 3D City Database

3D City Database is a tool to store geographic databases, and it provides good support for CityGML data.

Now why do we need a database to store CityGML data? Because a city map might be very large and structured, e.g. New York LoD2 city model is ~30G. Such a large file cannot be easily manipulated by text processing tools (e.g. gedit) or visualized because of memory constraint. More, the 3D city database can easily parse modules and objects in the CityGML file and store them in structured ways. So if someone asks the question, how many buildings are there in New York? This question is hard to answer with only the CityGML file, but very easy to answer with a SQL query run on the 3D city database.

To use the 3D City Database, we need to go over the following steps:

1. Set up a PostgreSQL database locally or on a remote server. This database tool also supports Oracle, but I did not make that work. PostgreSQL is a free database tool available for Linux. The documentation is good for specific queries, but if you are new to practical database management, this book might be more helpful in offering a general road trip.

Specifically, on Linux you need to first make sure you have a disk that has at least 50 GB of free space, and your current user can own directory on that disk. I spent 4 days trouble shooting this because the hard disk on my Laptop has a file system that is actually Windows NTFS, so my sudo user cannot own directories there. Commands like chmod or chown did not work. To solve that problem, I had to back up everything (compress and upload to Google Drive) and reformat that disk into Linux File system. Useful commands and tools here:

  • parted : a tool to partition a new disk. A disk has to be partitioned before it can be mounted and used.
  • mount : add a disk’s file system to the current operation system file system tree
  • lsblk : list block and check their file systems
  • mkfs: make filesystem. ext4 is a linux format.

Linux wrapper tools useful for PostgreSQL actions:

  • pg_createcluster: create a cluster. Used because the default PostgreSQL directories lie somewhere under /home, and this disk is usually small. If we want a cluster in another location should use this command. Note the path should be absolute path instead of relative path.
  • pg_cltcluster : start / stop / drop a cluster.
  • psql: used to connect to a running server. Note PostgreSQL has a default user “postgres”, and 1) you need to set up the password for it before a database can be connected from other clients (e.g. 3D database importer / exporter) and 2) this “postgres” user is not normally logged into directly, rather use “sudo -u postgres psql -p” to only use it to log into postgres server.

2. Connect the 3D City Database tool to the PostgreSQL database. Download the 3D City DB tools here. After running the install .jar file, you will notice there are basically a few sets of tools, of which we need to use two:

  • The SQL and Shell scripts used for setting up a geographic database
  • The importer / exporter used for importing data into the database

Before diving into details, here are two helpful resources you should check out for specific questions:

  • this documentation is very helpful and one actually needs to read it to proceed… i.e. no other better online documents…
  • This online Q&A on Github is actually active, the developers will answer questions there. I discovered it too late !

After getting the documentation, these sections are helpful:

  • Step 1 is to set up a PostgreSQL database using the scripts included in this 3D DB tool to set up all the schemas, this starts from page 102 / 318 of the version 4.0 documentation. Note you will need to create a postgis extension, details here. On Ubuntu you can get it with “apt install postgis” or something like this.
  • Step 2 is to use the importer / exporter to connect to that database and import data. The importer/exporter located in the “bin” folder, and you just run the .sh file to start it. Details can be found on page 125 / 318 of the version 4.0 documentation.

3. create tiled version of NYC data Some CityGML data is too large to load for other softwares, and the importer / exporter can help create tiles. Details on this are on Documentation Page 142 / 318 and 171 / 318. Essentially, we need to first read everything into a database, then output different tiles of the map. Note we should set up one database for one city.

On a high level, to set up the tiled export a user needs to 1) activated spatial indices under the “database” tab and 2) specified number of bounding boxes in the “preference” -> “Bounding Box” tab and 3) specified the bounding area in the “Export” tab. These would be enough for the program to start tiled export.

The documentation is not very clear on the details. Here are also two Q&As of help I got from other users:

  • https://knowledge.safe.com/questions/84869/output-buildings-into-individual-obj-files.html?childToView=85184#comment-85184
  • https://github.com/3dcitydb/importer-exporter/issues/73

2. FME Workbench: from cityGML to mesh

After the last section, we should already have a bunch of .gml files that are maps of different regions of a city. You can visualize a .gml file using FME software.  FME is a data warehouse tool and has two tools: FME workbench and FME data inspector. Data inspector is just for looking at data, while workbench can be used to edit data. You will need a license to open the software, but a student account can apply for free trial. The application process takes a few hours. Note you need to register your code on the FME licensing assistant and download the license file, or else the activation will void next time you open it.

For example, this is a region of Zurich looks like:

The next step is to extract individual buildings from it, as well as converting them into mesh files like .obj. The basic idea is:

  • First identify and output individual buildings from a single .gml file. This is done by a fanout operation.  You will also need an aggregator because a “building” consists of 1) roof surfaces, 2) wall surfaces, and 3) ground surfaces. Buildings can be identified by a parent id.
  • Second you will need to automate the process with all .gml files. This is done through setting up a file and document reader… i.e. you will need another FME workspace to read in different files in the same directory, and call the sub-workspace from this parent workspace.

FME also has a little bit of learning curve. To understand the above two operations, you might need to know:

Finally, here is a Q&A I posted on FME forum and people helped pointing me to the correct places.

Eventually, individual buildings will look like this:

 

3. PCL command line tool: from mesh to point cloud

The last step is to create point cloud from mesh, because we want to test methods specifically applicable to point cloud. I found the following tools :

  • This tool that someone gave me: https://github.com/simbaforrest/trimesh2pointcloud It uses Poisson disk sampling. This tool have trouble processing 2D meshes (some building data seems to be dirty) as the function won’t return. I tried python package “signal” and “multiprocessing” to kill timeout function but neither work. So I gave up with this tool.
  • Cloud Compare: An open source file to edit cloud files. This tool has a command line tool but reports error when I try to save point cloud… So I gave up with this tool too.
  • PCL : Point Cloud Library.
  • pyntcloud: a python library. It seemed bit troublesome to read .obj instead of .ply files. It seems it mainly supports .ply files so I gave up with this tool

I consulted this Stackoverflow Q&A 

I eventually settled with use pcl_mesh_sampling. It can be installed on Ubuntu 16.04 with apt-get pcltools or something like that. The command is very easy to use: the first argument is the input file name the second argument is the output file name. Then you specify the number of points to sample.

A remaining issue is how to automate it. Since it is a command line a natural way is to use bash script, so you need some string manipulation in bash. A bigger problem is every time this tool generates and saves a point cloud, it will visualize it using a window. Until you close that Window, the process won’t finish. So we need to automatically “close” the window with anther shell script that calls xdotool at a fixed time interval to automatically close the specified Window. Note we cannot use the “windowkill ” option for xdotool but need to simulate the key stroke alt+F4 (i do not know why). Full command is

xdotool search “$WINDOWNAME” windowactivate –sync key –window 0 –clearmodifiers alt+F4

This Q&A is helpful (others are less).

Again, any questions please direct to greenstone1564@gmail.com…

Distributed Stochastic Gradient Descent with MPI and PyTorch

This post describes how to implement stochastic gradient in a distributed fashion with MPI. It will cover the following topics in a high-level fashion, as it is challenging to cover every details in a single post. I will point to other resources helpful in understanding key concepts.

  • What is MPI
  • How to do distributed SGD
  • PyTorch modules necessary for writing distributed SGD and how to design the program
  • Engineering caveats for building PyTorch with MPI backend 

What is MPI

I understand MPI as a programming framework that handles communication between computers. 

It’s like the lower-level software underneath MapReduce, another programming framework for distributed computing. In MapReduce you need to specify a map operation and a reduce operation and that’s all. The system take care of the allocation of workers and server and memory, which can of course taken care of by meta-information input into a shell script. The main challenge is designing the parallel algorithms at the algorithmic level, for example how to parse the two matrices when doing matrix multiplication, or how to design the mapper and reducer for min-hash and shingling documents. 

But in MPI, the programmer needs to specify the worker and server by their ID, and actively coordinate the computers, which probably requires another set of algorithms. For example, if many workers need to communicate with the server at the same time but the server cannot serve them all, there needs to be a scheduling algorithm e.g. round-robin. In MPI, the programmer needs to code the scheduling by themselves so that messages from different workers won’t be mixed. We’ll see an example later in the asynchronous SGD. 

MPI has many concrete implementations, for example OpenMPI and MPICH. 

MPI has the following basic operations:

  • point-to-point communications: blocking and non-blocking. This enables a single computer to send message to another, and another computer to receive message. blocking means the process is blocked until the send / receive operation is finished, while block means the process is returned immediately, not wait till the process is finished.
  • collective communications : examples are reduce and gather.

Here is a good place to learn basic operations of MPI. You need to go through all tutorials on that website to be able to understand and write a project that does distributed SGD with PyTorch.

How to do distributed SGD

There are two types of distributed SGD, depending on when to synchronize gradients computed on individual workers.

  •  synchronous SGD:
    • all workers have a copy of the model. they are all the same
    • at every mini-batch, all workers compute their share of gradient, and then compute average
    • the model update on every worker
    • then move on to the next batch
    • if there is a server,
      • the server has a copy of the model
      • the workers also have copies of the model
      • workers are assigned data to calculate forward and backward
      • gradients are sent to the server to take care of the averaging
      • workers receive updated gradients from the server
    • if there is no server, the gradient is calculated in all_reduce. all_reduce can (but not necessarily) be implemented in the ring algorithm: every model send the results to its neighbor ?
  • asynchronous SGD:
    • all workers have a copy of the model, but they are not the same. the model is inconsistent across workers
    • at every mini-batch:
      • workers get parameters from the server
      • workers get data from its own data loader, or randomly selected dataset (is there an epoch concept anymore?)
      • workers calculate forward and gradient
      • once the calculation is done, gradient is sent to the server,
      • the server updates the parameters

Now we need to conceptually understand how this workflow can be implemented using data structure and APIs provided by PyTorch. To do this, let’s first analyze how a single machine SGD is implemented in PyTorch.

In particular, in a single machine SGD implemented with PyTorch, 1) the output of a mini-batch is calculated by calling the forward function of the model, 2) the loss is calculated by sending the output and the target to a loss function, and 3) the gradient is calculated by calling loss.backward, and 4) the update to parameters is done by calling optimizer.step(), and we would pass the model parameters to the optimizer beforehand.

Now for this single-machine workflow, key functions are:

  • We calculate the gradients in the loss.backward() step
  • The gradients are stored in model.parameters()
  • The model parameters are updated when we call optimizer.step()

So, in a multi-machine scenario, the following questions need to be considered:

  1. How many copies of the model should be created?
  2. What should the server store?
  3. What should the slaves store?
  4. What should they communicate with each other?

To see why 1) is important, note the way we deploy MPI is by having the same script sending to every machine, but use MPI_rank to identify servers and slaves, and use if-else condition to decide which block of code should be run on which machine. So, we can theoretically create some objects when we encounter a server, but do not create these objects when we encounter a slave. To answer 1), an obvious answer is everybody has its own copy of a whole model (i.e. the whole computation graph created by model = myModelClass()), but is this necessary?

It turned out it is. Why? Although server only need the parameter value and gradient values to perform updates, and theoretically we only need to put the optimizer on server, and model parameters on slaves, this is not doable in operation because in PyTorch optimizer is tied to a set of model parameters. More, the whole computation graph contains more than data, but also relations. So, we must initialize a model and a optimizer on both server and slaves, and use communication to make sure their values are on the same page.

To answer question 4, here is the logic flow:

  • a slave establishes connection with the server
  • a slave fetches initial parameter values from the server
  • a slave breaks connection with the server
  • a slave calculate the loss on its own batch of data
  • a slave calculates the gradients by calling loss.backward()
  • a slave establishes connection with the server
  • a slave send the gradients and loss to the server
  • a slave breaks connection with the server
  • the server updates its parameter value.

Note here we have multiple steps concerning how to set up connections. In effect, parameter values have many layers corresponding to different layers of a neural network, and we have multiple salves. So if multiple salves trying to send the same set of parameters to the server, data from different sets might be messed up. In other words, MPI needs to know who sends data and programmers need to use a protocol to ensure that during the transmission, only one pair of connection is in effect.

Note MPI does not allow one slave to “block” other slaves, so we need to code this “block” up by using “handshake” technique in TCP. The idea is a slave first send a “hello” message to the server, and when a server is busy this message will wait in a queue until being received. And when the server is idle, and it receives the “hello” message, it will realize someone else is calling it and will wait for this particular person by only receiving message from the same ID for this round. It will also send a message to this ID, telling it “I’m ready for you and no one else is able to mess around”. When a slave receives this confirmation from the server, it can go on to send the important gradient information to the server. After it finishes, it will receive the information send from the server. And when the server finished sending update information to the slave, it will be idle again and able to receive “hello” from another slave. This is a very interesting process to establish connections!

Engineering caveats for building PyTorch with MPI backend 

Some caveats to watch out for:

  • To use PyTorch with MPI, it cannot be installed from Conda. You have to build it yourself. But PyTorch is a large package so check your disk space and act accordingly. It took me 4 days to build it successfully….

Some links:

  • PyTorch tutorial for distributed SGD. Note this guy uses a synchronized SGD so the process is easier – you can just run an all-reduce and you do not have to worry about slaves mess up with each other. https://pytorch.org/tutorials/intermediate/dist_tuto.html#advanced-topics
  • My implementation of the distributed SGD process… https://github.com/yuqli/hpc/blob/master/lab4/src/lab4_multiplenode.py

Finally, this blog post is not very clearly written. If someone really reads this and has questions, please email me at greenstone1564@gmail.com.