memo: Fully Convolutional Networks by Chainer 〜 revisit 1 〜

Introduction

　In the previous post, a simplified Fully Convolutional Network (FCN) was implemented by means of Chainer. Its accuracy is lower than that of the original FCN. In this post, I have remarkably improved the accuracy by the following modifications:

The original implementation of the FCN is written with Caffe. Caffe supports a "CropLayer" which the original code uses. Unfortunately Chainer does not support layers corresponding to the "CropLayer." I have noticed, however, that the similar outcome is achieved by utilizing an argument "outsize" which is passed to a function "chainer.functions.deconvolution_2d." The use of the function allows my code to accept inputs of any size. In the previous post, an input was made square with 224$\times$224 in advance.
In the previous post, I didn't understand the fact that an image with information on labels in VOC dataset is not RGB mode, but index one. Therefore, I spent time to implement a function by which RGB is converted to an integer value. In this post, an index image is read using index mode directly.
The procedures of the construction of the original FCN is fcn32s $\rightarrow$ fcn16s $\rightarrow$ fcn8s. The accuracy of the FCN depends mostly on that of the fcn32s, and the contributions from the fcn16s and the fcn8s are at most 1%. Though I showed the implementation of the fcn8s in the previous post, in this post the code of the fcn32s is demonstrated.
As a mini batch learning was employed in the previous post, "chainer.links.BatchNormalization" was able to be introduced. In this post, there is no room to introduce the normalization because of the use of a one-by-one learning.
In the previous post, weights in deconvolutional layers were learned using "chainer.links.Deconvolution2D." In this post, the deconvolutional layers are fixed to the simple bilinear interpolation.

Learning Curve

　The resultant learning curves are as follows:

Accuracy

Loss

Both curves have the ideal behaviour. The accuracy for test images achieves about 92%, which is much better than the previous accuracy. (If the trained model of fcn32s is succeeded by fcn16s and fcn8s, the accuracy will be about 93%.) More detail is provided below.

Computation Environment

　The same instance as the previous post, g2.2xlarge in the Amazon EC2, is used. As input images have various sizes, there is no choice but to run the one-by-one learning. The mini batch learning is not able to be used. It takes 3.5 days for the above learning curves to converge.

Dataset

　Just as before, the FCN is trained on a dataset VOC2012 which includes ground truth segmentations. The number of images is 2913. I divided them by the split ratio 4:1. The former set of images corresponds to a training dataset, and the latter a testing one.

number of train	number of test
2330	583

As described above, all images are not resized in advance. The number of the categories is 20+1(background).

label	category
0	background
1	aeroplane
2	bicycle
3	bird
4	boat
5	bottle
6	bus
7	car
8	cat
9	chair
10	cow
11	diningtable
12	dog
13	horse
14	motorbike
15	person
16	potted plant
17	sheep
18	sofa
19	train
20	tv/monitor

Network Structure

　The detailed structure of a network is as follows:

In the original FCN, there are some layers that follow the layer pool5. For simplicity, these layers are again removed. Under the assumption that an image is a square with 224 $\times$ 224, one side length of an image is written to a column "input" of the table shown above. It should be noticed that the FCN implemented in this post is actually permitted to accept inputs of any size. Each column of the table indicates:

name: a layer name
input: a size of an input feature map
in_channels: the number of input feature maps
out_channels: the number of output feature maps
ksize: a kernel size
stride: a stride size
pad: a padding size
output: a size of an output feature map

Because the fcn32s is implemented, only score-pool5 following pool5 is required. I refer to an output of score-pool5 as p5. The number of feature maps of p5 is equal to that of categories (21).

Though a column "input" of that figure also shows one side length of an image under the same assumption as shown above, inputs of any size are permitted. After upsampling p5 back to input size, it is compared with a label image.

Implementation of Network

　The network can be written in Chainer like this:

-- myfcn_32s_with_any_size.py -- By assigning a label of -1 to pixels on the borderline between objects, we can ignore contribution from those pixels when calculating softmax_cross_entropy (see the Chainer's specification for details). Using an argument "ignore_label" passed to a function "F.accuracy" allows us to calculate the accuracy without contribution from pixels on the borderlines. The function "F.deconvolution_2d" shown from the 99th line to the 100th line upsamples p5 back to the input size.

Training

　The script for training is as follows:

-- train_32s_with_any_size.py -- I used copy_model and VGGNet.py described in the previous post. The procedures shown in the script train_32s_with_any_size.py are as follows:

make an instance of the type MiniBatchLoaderWithAnySize
make an instance of the type VGGNet
make an instance of the type MyFcn32sWithAnySize
copy parameters of the instance of VGGNet to those of that of MyFcn32sWithAnySize
select MomentumSGD as an optimization algorithm
run a loop to train the net

The script mini_batch_loader_with_any_size.py is written like this:

-- mini_batch_loader_with_any_size.py --
The function load_voc_label shown from the 91th line to the 96th line loads a label image using the index mode and replaces 255 with -1. To load all the training data on the GPU memory at a time causes the error "cudaErrorMemoryAllocation: out of memory" to occur. Therefore, only one pair of an image and a label is loaded every time a training procedure requires it.

Prediction Results

　The left column shows prediction images overlaid on testing images. A value under each image indicates an accuracy which is defined as the percentage of the pixels which have labels classified correctly that are not on the borderlines between objects. The right column shows the ground truths.

I think that if fcn16s and fcn8s follow fcn32s, outlines of the objects will be much finer.

Download

　You can download tag 2016-07-09

memo

2016年7月16日土曜日

Fully Convolutional Networks by Chainer 〜 revisit 1 〜