Introduction
In this page, I perform a scene recognition by means of the library
Caffe. It is shown that with the pre-training model that Caffe provides and its fine-tuning by scene images, the recognition accuracy achieves about 95% .
Computation Environment
I used that instance g2.2xlarge in the Amazon EC2 which mounts the GPU device.
Dataset
I trained the CNN using the dataset
LSP15 in
this page. The dataset consists of the 15 directories as follows:
- MITcoast
- MITforest
- MIThighway
- MITinsidecity
- MITmountain
- MITopencountry
- MITstreet
- MITtallbuilding
- bedroom
- CALsuburb
- industrial
- kitchen
- livingroom
- PARoffice
- store
The name of the directory represents the category of the scene.
Each directory contains about 200 to 300 images which belong to their category.
Data Augmentation
In order to augment dataset, I added the mirror images to it. Moreover, the images are split into two groups "train" and "test." The size of the image is 256 $\times$ 256, and the number of the channel is 3. The number of the images in each category is as follows:
label |
name |
number of train |
number of test |
0 |
MITcoast |
610 |
100 |
1 |
MIThighway |
440 |
70 |
2 |
MITmountain |
630 |
100 |
3 |
MITstreet |
490 |
80 |
4 |
MITforest |
550 |
90 |
5 |
MITinsidecity |
520 |
80 |
6 |
MITopencountry |
690 |
110 |
7 |
MITtallbuilding |
600 |
100 |
8 |
bedroom |
360 |
60 |
9 |
CALsuburb |
400 |
60 |
10 |
industrial |
520 |
80 |
11 |
kitchen |
360 |
60 |
12 |
livingroom |
490 |
80 |
13 |
PARoffice |
360 |
60 |
14 |
store |
540 |
90 |
|
|
7560 |
1220 |
Dataset for Caffe
Caffe requires the following directories and files:
- a directory which contains training images
- a directory which contains test images
- a text file in which names and labels of training images are described
- a text file in which names and labels of test images are described
In my environment, they are put in the following paths:
- /home/ubuntu/data/caffe_256_15/train/
- /home/ubuntu/data/caffe_256_15/test/
- /home/ubuntu/data/caffe_256_15/train.txt
- /home/ubuntu/data/caffe_256_15/test.txt
The contents of the file "test.txt" are as follows:
MITstreet_image_0179_flipped.jpg 3
MITtallbuilding_image_0173_flipped.jpg 7
MITcoast_image_0126.jpg 0
store_image_0158_flipped.jpg 14
MITinsidecity_image_0102_flipped.jpg 5
MITforest_image_0200_flipped.jpg 4
industrial_image_0189_flipped.jpg 10
MITcoast_image_0142.jpg 0
kitchen_image_0019_flipped.jpg 11
bedroom_image_0210_flipped.jpg 8
bedroom_image_0116_flipped.jpg 8
livingroom_image_0008_flipped.jpg 12
kitchen_image_0051_flipped.jpg 11
MITstreet_image_0167_flipped.jpg 3
MITcoast_image_0315.jpg 0
....
The contents of the file "train.txt" are as follows:
industrial_image_0190.jpg 10
CALsuburb_image_0103_flipped.jpg 9
bedroom_image_0022_flipped.jpg 8
MITopencountry_image_0222.jpg 6
MITstreet_image_0040.jpg 3
MIThighway_image_0053_flipped.jpg 1
livingroom_image_0063_flipped.jpg 12
store_image_0106_flipped.jpg 14
industrial_image_0144.jpg 10
kitchen_image_0085_flipped.jpg 11
bedroom_image_0040.jpg 8
MIThighway_image_0088_flipped.jpg 1
industrial_image_0264.jpg 10
bedroom_image_0117_flipped.jpg 8
MITcoast_image_0021_flipped.jpg 0
...
After storing the images specified in the files "test.txt" and "train.txt" in the directories "test" and "train" respectively, this script is run to create the dataset for Caffe.
"test_leveldb" and "train_leveldb" which are the inputs for Caffe are output as shown below.
Definition of CNN
I defined the structure of the CNN in the file named "model/scene_recognition/train_val.prototxt" as:
The file is based on the file "/home/ubuntu/buildspace/caffe-master/models/bvlc_reference_caffenet/train_val.prototxt." In the layers "data" and "fc8," there are differences between the original and my own files. I replaced the layer "fc8" with the new layer "scene_fc8." Moreover, in accordance with the explanation in
this page, parameters in the layer "scene_fc8" were modified as shown above.
Definition of Solver
Based on the file "models/bvlc_reference_caffenet/solver.prototxt," the text file used for training the CNN is defined as follows:
The path of that file is "model/scene_recognition/solver.prototxt."
Training
This script is run to train the CNN.
The pre-training model which Caffe provides is "models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel" which is passed as the argument of the command option "-weights." The script fine-tunes the pre-training model by using the current dataset.
Result
The x-axis indicates the iteration number and the y-axis the recognition accuracy. Because in the current case the total iteration number is 80,000 and the solver is designed to output the accuracy once per 500 iterations, the maximum value of the x-axis is 160(=80,000/500). The recognition accuracy reaches about 95%.
Construction of Classifier
After the training, the file "scene_train_iter_80000.caffemodel" is created. The file contains the information of the fine-tuned CNN. In order to construct the classifier from the model file, the following file is needed.
That file is named "deploy.ptototxt." It is made from the file "model/scene_recognition/train_val.prototxt" according to the following procedures.
-
Remove the layer "data," and add the four lines as shown below.
-
Remove the layers "loss" and "accuracy", and add this layer.
The four lines with which the layer "data" is replaced means:
- input_dim: 20 --- batch size
- input_dim: 3 --- channel number
- input_dim: 227 --- width of an image
- input_dim: 227 --- height of an image
The code to classify the image is implemented as follows:
It is named "classifier.py."
Now I can classify the images.