memo: Fully Convolutional Networks by Chainer

Introduction

　In the previous page, a scene recognition (15-class classification) was performed by using Chainer. In this page, a Fully Convolutional Network (FCN) is simplified and implemented by means of Chainer.

Computation Environment

　The same instance as the previous page, g2.2xlarge in the Amazon EC2, is used.

Dataset

　In this work an FCN is trained on a dataset VOC2012 which includes ground truth segmentations. The examples are given below:

The number of images is 2913. I divided them by the split ratio 4:1. The former set of images corresponds to a training dataset, and the latter a testing one.

number of train	number of test
2330	580

The number of the training (testing) images is rounded off to make it a multiple of 10 (5). 10 (5) is a mini batch size for the training (testing) images. Though the original FCN operates on an input of any size, for simplicity the FCN in this work takes a fixed-sized (224$\times$224) input. Therefore all images are resized in advance as shown below:

I cropped a maximum square centered on an image and resized it to 224$\times$224.
　VOC2012 provides 21 classes as specified below:

label	category
0	background
1	aeroplane
2	bicycle
3	bird
4	boat
5	bottle
6	bus
7	car
8	cat
9	chair
10	cow
11	diningtable
12	dog
13	horse
14	motorbike
15	person
16	potted plant
17	sheep
18	sofa
19	train
20	tv/monitor

Network Structure

　The detailed structure of a network used in this work is as follows:

In the original FCN, there are also some layers that follow the layer pool5. For simplicity, these layers are removed in this work.

name: a layer name
input: a size of an input feature map
in_channels: the number of input feature maps
out_channels: the number of output feature maps
ksize: a kernel size
stride: a stride size
pad: a padding size
output: a size of an output feature map

pool3, pool4, and pool5 are followed by score-pool3, score-pool4, and score-pool5, respectively. Their parameters are shown below:

I call their (score-pool3, score-pool4, and score-pool5) outputs p3, p4, and p5.

The layers, upsampled_pool4 and upsampled_pool5 shown in the table above, are applied to p4 and p5, respectively. After summing their (upsample_pool4 and upsample_pool5) outputs and p3, the sum is upsampled back to the original image size through the layer upsample_final shown in the table. This net structure corresponds to the FCN-8s described in the original FCN.

Implementation of Network

　The network can be written in Chainer like this:

-- myfcn.py -- I assigned pixels on the borderline between objects and a background with a label of -1. To do so leads to no contribution from those pixels when calculating softmax_cross_entropy (see the Chainer's specification for details). The function calculate_accuracy also throws away contribution from the pixels on the borderline. The function add is defined as:

-- add.py --

Training

　The script to train is as follows:

-- train.py -- -- mini_batch_loader.py --
I used copy_model described in the page. The file VGGNet.py was downloaded from the page. The procedures expressed in the script train.py are as follows:

make an instance of the type MiniBatchLoader
make an instance of the type VGGNet
make an instance of the type MyFcn
copy parameters of the VGGNet instance to those of MyFcn one
select MomentumSGD as an optimization algorithm
run a loop to train the net

To load all the training data on the GPU memory at a time causes the error "cudaErrorMemoryAllocation: out of memory" to occur. Therefore, only the minibatch-size data is loaded every time the training procedure requires it.

Results

　Iterations are terminated at the 62nd epoch. The accuracies and the losses for the training and testing datasets are shown below.

-- Training Dataset --

-- Testing Dataset --

　The accuracy is defined as the average percentage of those pixels per image, which have labels classified correctly, over the training or the testing dataset. The behavior of the accuracy and the loss for the training dataset is what we would expect it to be, while for the testing dataset the loss curve surges at some point. It seems to indicate overfitting. The final accuracies are 99% for the training dataset and 82% for the testing dataset. A sample set of an input image, a predicted segmentation, and a ground truth is shown with the accuracy below:

Other samples of predicted and groundtruth images are given below:

-- Training Dataset --

-- Testing Dataset --

Because the accuracy benefits from the background pixels, I think that in order to get satisfied results the accuracy has to achieve over 90%.

2 件のコメント:

匿名2016年4月25日 12:50
Hey, Thank you for sharing great article!!!
I found a mistake in function load_data of MiniBatchLoader. In following section :
img = cv2.imread(path)
if img is None:
raise RuntimeError("invalid image: {i}".format(i=path))
xs[i, :, :, :] = ((img - self.mean)/255).transpose(2, 0, 1)
I think image type must be float32 so,
'img = cv2.imread(path)'
should be
'img = cv2.imread(path).astype(np.float32)'
. What do you think?
返信削除
返信

コメントを追加

2016年4月4日月曜日

Fully Convolutional Networks by Chainer