2015年11月22日日曜日

3D Object Recognition by CNN and SVM

Introduction


 A method for 3D object recognition, which is based on the work by H. Su et al., has been already proposed and evaluated on two 3D object datasets of the ModelNet10 and the ModelNet40. The classification accuracy for the ModelNet10 dataset has been shown here and that for the ModelNet40 dataset here. The method makes use of a convolutional neural network (CNN) as both a feature extractor and a classifier. In this page the method is extended to apply the CNN as only a feature extractor and to perform a classification by means of a support vector machine (SVM). It is demonstrated that the classification accuracy of the method is a little higher than those of the previous methods.

Classification Accuracies


 The following table shows classification accuracies for the previous and the proposed methods. The previous results are quoted from the Princeton ModelNet page.

Algorithm Accuracy(ModelNet10) Accuracy(ModelNet40)
MVCNN 90.1%
VoxNet 92% 80.3%
DeepPano 85.45% 77.63%
3DShapeNets 83.5% 77%
the proposed method 92.84% 90.92%

Proposed Method


 In the proposed method the CNN is used as a feature extractor and the SVM is applied as a classifier.

Feature Extractor:

 The CNN model (bvlc_reference_caffenet.caffemodel) that Caffe provides is fine-tuned on the training set of ModelNet10 or the ModelNet40. The detailed information on the fine-tuning procedures is described in this page. After training, output of the layer "fc7" is used as a feature vector (4096 dimensions). As explained here, in the proposed method a 3D object yields 20 gray images which are converted from depth ones. The fine-tuned CNN used as the feature extractor is applied to each of them and 20 feature vectors are obtained per 3D object.


According to the work by H. Su et al., element-wise maximum value across the 20 vectors is taken to make a single feature vector with 4096 dimensions as shown below.


Classifier:

 Since one feature vector per 3D object is made, it is straightforward to perform a classification by the SVM. In the current work the library LIBLINEAR in which linear SVMs are implemented is used. For the ModelNet10 training set the following command is run to train the SVM, and for the ModelNet40 training set the command is as follows, It is worthwhile to note that in order to improve the accuracy, square root of all elements of the vector is computed before applying the linear SVM. In other words, a Hellinger kernel classifier is employed. After training, the prediction command is run for the testing sets of both the ModelNet10 and the ModelNet40 as, The programs "train" and "predict" are offered by LIBLINEAR.

References

  1. MVCNN: Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller, Multi-view Convolutional Neural Networks for 3D Shape Recognition
  2. VoxNet: D. Maturana and S. Scherer. VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. IROS2015.
  3. DeepPano: B Shi, S Bai, Z Zhou, X Bai. DeepPano: Deep Panoramic Representation for 3-D Shape Recognition. Signal Processing Letters 2015.
  4. 3DShapeNets: Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao. 3D ShapeNets: A Deep Representation for Volumetric Shapes. CVPR2015.

2015年11月14日土曜日

3D Object Recognition by Caffe 〜 40-class classification 〜

Introduction


 In the previous page, a new method for a 3D object recognition was proposed. The method is based on the work of H. Su et al. and was applied to a 10-class classification by means of the ModelNet10 dataset. A high classification accuracy of 90.2% was achieved. In this page, the method is evaluated on the ModelNet40 dataset. It is shown that a classification accuracy of 87.16% is reached.

Dataset


 The CNN model (bvlc_reference_caffenet.caffemodel) that Caffe provides is fine-tuned using the ModelNet40 dataset which consists of 40 categories. Each category has training and testing 3D models which are in Object File Format (OFF). The numbers of the models are as follows:

the number of trainings the number of testing
9843 2468

Fine-tuning Result


 The following figure shows a fine-tuning process.
The x-axis indicates the iteration number and the y-axis the recognition accuracy. Because in the current case the total iteration number is 90,000 and the solver is designed to output the accuracy once per 1000 iterations, the maximum value of the x-axis is 90(=90,000/1000). It can be seen that the recognition accuracy reaches about 83.6%.

Classification Accuracy


 As described in the previous page, a 3D model yields 20 gray images. The fine-tuned CNN is applied to each of them and 20 labels are obtained per 3D model. The final label is decided by majority vote. This algorithm is evaluated on the 3D models belonging to the test phase . The number of them is 2468 as shown above. The classification accuracy reaches 87.16%(=2151/2468). The following table is quoted from the Princeton ModelNet page. The algorithm proposed here is not bad in spite of the simple procedures.

Algorithm Accuracy
MVCNN 90.1%
VoxNet 80.3%
DeepPano 77.63%
3DShapeNets 77%
the proposed algorithm 87.16%

The detailed information on accuracy for each category is shown below.

References

  1. Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller, Multi-view Convolutional Neural Networks for 3D Shape Recognition
  2. D. Maturana and S. Scherer. VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition. IROS2015.
  3. B Shi, S Bai, Z Zhou, X Bai. DeepPano: Deep Panoramic Representation for 3-D Shape Recognition. Signal Processing Letters 2015.
  4. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang and J. Xiao. 3D ShapeNets: A Deep Representation for Volumetric Shapes. CVPR2015.

2015年11月11日水曜日

3D Object Recognition by Caffe 〜 20-class classification 〜


Introduction


 In the previous page, a new method for a 3D object recognition was proposed. The method was applied to a 10-class classification and a classification accuracy of 90.2% was achieved. In this page, the method is evaluated on a 20-class classification. It is shown that a high classification accuracy of 95.3% is reached.

Dataset


 The pre-trained CNN model (bvlc_reference_caffenet.caffemodel) that Caffe provides is fine-tuned using a dataset consisting of 20 categories which are chosen among the ModelNet40 dataset. These categories are shown below.
  1. airplane
  2. bathtub
  3. bed
  4. bench
  5. bookshelf
  6. bottle
  7. bowl
  8. car
  9. chair
  10. cone
  11. cup
  12. curtain
  13. desk
  14. door
  15. dresser
  16. flower_pot
  17. glass_box
  18. guitar
  19. keyboard
  20. lamp
Each category has training and testing 3D models which are in Object File Format (OFF). The numbers of the models in categories are as follows:
label name the number of trainings the number of testings
0 airplane 626 100
1 bathtub 106 50
2 bed 515 100
3 bench 173 20
4 bookshelf 572 100
5 bottle 335 100
6 bowl 64 20
7 car 197 100
8 chair 889 100
9 cone 167 20
10 cup 79 20
11 curtain 138 20
12 desk 200 86
13 door 109 20
14 dresser 200 86
15 flower_pot 149 20
16 glass_box 171 100
17 guitar 155 100
18 keyboard 145 20
19 lamp 124 20
5114 1202

Results of CNN and Classification


 The fine-tuning yields a high recognition accuracy of 93% as shown below.
As described in the previous page, a 3D model yields 20 gray images. The fine-tuned CNN is applied to each of them and 20 labels are obtained per 3D model. The final label is decided by majority vote. This algorithm is evaluated on those 3D models belonging to the test phase whose number is 1202 as shown above. The classification accuracy of 95.3% is reached.

2015年11月5日木曜日

my own memo: cpplinq

Sample code


Output