## 2016年7月16日土曜日

### Fully Convolutional Networks by Chainer 〜 revisit 1 〜

in Japanese

#### Introduction

In the previous post, a simplified Fully Convolutional Network (FCN) was implemented by means of Chainer. Its accuracy is lower than that of the original FCN. In this post, I have remarkably improved the accuracy by the following modifications:
1. The original implementation of the FCN is written with Caffe. Caffe supports a "CropLayer" which the original code uses. Unfortunately Chainer does not support layers corresponding to the "CropLayer." I have noticed, however, that the similar outcome is achieved by utilizing an argument "outsize" which is passed to a function "chainer.functions.deconvolution_2d." The use of the function allows my code to accept inputs of any size. In the previous post, an input was made square with 224$\times$224 in advance.
2. In the previous post, I didn't understand the fact that an image with information on labels in VOC dataset is not RGB mode, but index one. Therefore, I spent time to implement a function by which RGB is converted to an integer value. In this post, an index image is read using index mode directly.
3. The procedures of the construction of the original FCN is fcn32s $\rightarrow$ fcn16s $\rightarrow$ fcn8s. The accuracy of the FCN depends mostly on that of the fcn32s, and the contributions from the fcn16s and the fcn8s are at most 1%. Though I showed the implementation of the fcn8s in the previous post, in this post the code of the fcn32s is demonstrated.
4. As a mini batch learning was employed in the previous post, "chainer.links.BatchNormalization" was able to be introduced. In this post, there is no room to introduce the normalization because of the use of a one-by-one learning.
5. In the previous post, weights in deconvolutional layers were learned using "chainer.links.Deconvolution2D." In this post, the deconvolutional layers are fixed to the simple bilinear interpolation.

#### Learning Curve

The resultant learning curves are as follows:

Accuracy
Loss
Both curves have the ideal behaviour. The accuracy for test images achieves about 92%, which is much better than the previous accuracy. (If the trained model of fcn32s is succeeded by fcn16s and fcn8s, the accuracy will be about 93%.) More detail is provided below.

#### Computation Environment

The same instance as the previous post, g2.2xlarge in the Amazon EC2, is used. As input images have various sizes, there is no choice but to run the one-by-one learning. The mini batch learning is not able to be used. It takes 3.5 days for the above learning curves to converge.

#### Dataset

Just as before, the FCN is trained on a dataset VOC2012 which includes ground truth segmentations. The number of images is 2913. I divided them by the split ratio 4:1. The former set of images corresponds to a training dataset, and the latter a testing one.

number of train number of test
2330 583

As described above, all images are not resized in advance. The number of the categories is 20+1(background).

label category
0 background
1 aeroplane
2 bicycle
3 bird
4 boat
5 bottle
6 bus
7 car
8 cat
9 chair
10 cow
11 diningtable
12 dog
13 horse
14 motorbike
15 person
16 potted plant
17 sheep
18 sofa
19 train
20 tv/monitor

#### Network Structure

The detailed structure of a network is as follows:
In the original FCN, there are some layers that follow the layer pool5. For simplicity, these layers are again removed. Under the assumption that an image is a square with 224 $\times$ 224, one side length of an image is written to a column "input" of the table shown above. It should be noticed that the FCN implemented in this post is actually permitted to accept inputs of any size. Each column of the table indicates:
1. name: a layer name
2. input: a size of an input feature map
3. in_channels: the number of input feature maps
4. out_channels: the number of output feature maps
5. ksize: a kernel size
6. stride: a stride size
8. output: a size of an output feature map
Because the fcn32s is implemented, only score-pool5 following pool5 is required. I refer to an output of score-pool5 as p5. The number of feature maps of p5 is equal to that of categories (21).
Though a column "input" of that figure also shows one side length of an image under the same assumption as shown above, inputs of any size are permitted. After upsampling p5 back to input size, it is compared with a label image.

#### Implementation of Network

The network can be written in Chainer like this:

-- myfcn_32s_with_any_size.py -- By assigning a label of -1 to pixels on the borderline between objects, we can ignore contribution from those pixels when calculating softmax_cross_entropy (see the Chainer's specification for details). Using an argument "ignore_label" passed to a function "F.accuracy" allows us to calculate the accuracy without contribution from pixels on the borderlines. The function "F.deconvolution_2d" shown from the 99th line to the 100th line upsamples p5 back to the input size.

#### Training

The script for training is as follows:

-- train_32s_with_any_size.py -- I used copy_model and VGGNet.py described in the previous post. The procedures shown in the script train_32s_with_any_size.py are as follows:
1. make an instance of the type MiniBatchLoaderWithAnySize
2. make an instance of the type VGGNet
3. make an instance of the type MyFcn32sWithAnySize
4. copy parameters of the instance of VGGNet to those of that of MyFcn32sWithAnySize
5. select MomentumSGD as an optimization algorithm
6. run a loop to train the net
The script mini_batch_loader_with_any_size.py is written like this:

The function load_voc_label shown from the 91th line to the 96th line loads a label image using the index mode and replaces 255 with -1. To load all the training data on the GPU memory at a time causes the error "cudaErrorMemoryAllocation: out of memory" to occur. Therefore, only one pair of an image and a label is loaded every time a training procedure requires it.

#### Prediction Results

The left column shows prediction images overlaid on testing images. A value under each image indicates an accuracy which is defined as the percentage of the pixels which have labels classified correctly that are not on the borderlines between objects. The right column shows the ground truths.

I think that if fcn16s and fcn8s follow fcn32s, outlines of the objects will be much finer.