Object Detection using Deep Neural Network WanRu Lin
Object Detection using Deep Neural Network Wan-Ru, Lin 2016/10/27
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
Introduction • Object detection has long been an interesting task in computer vision ü Location (x, y, w, h) ü Classification
Introduction • Before fast R-CNN (2015)… Region proposal cat Feature extraction • After fast R-CNN … Classifier cat Region proposal Feature extraction Classifier [R. Girshick, “Fast R-CNN, ” in IEEE International Conference on Computer Vision (ICCV), 2015]
Introduction (2014) YOLO (2015)
Background • Convolution Neural Network (CNN) Feature extractor • Convolution • Nonlinearity – (sigmoid , Re. LU) • Pooling classifier
Background • Pooling • reduce the spatial size • translation invariant • Loss function • Error backpropagation
Background • PASCAL VOC • Location • Class Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor PASCAL VOC 2007 PASCAL VOC 2010 PASCAL VOC 2012 9, 963 training images 10, 103 training images 11, 530 training images 20 classes
Background • Pre-training • ILSVRC dataset ~ 120 W images • Fine-tuning • PASCAL VOC 2012
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
R-CNN • Multi-stage Selective Search SVM
R-CNN • Selective Search • Generate possible object locations
R-CNN • Training • Supervised pre-training : ILSVRC 2012 • Domain-specific fine-tuning : • warp input • output number : 1000 -> 20 + 1(ground truth) • SVM • Separate data with hyperplane
R-CNN • Disadvantage of R-CNN • Distortion due to warping • Training is a multi-stage pipeline • Training is expensive in space and time • Object detection is slow • VGG takes 47 s/image
R-CNN VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
SPPnet •
SPPnet • Share feature maps speed up R-CNN • Achieve comparable m. AP with R-CNN SPP (1 -scale) SPP (5 -scale) R-CNN (ZF-5) m. AP 58. 0 59. 2 Conv time Fc time 0. 053 s 0. 089 s 0. 293 s 0. 089 s 14. 37 s 0. 089 s Total time (GPU) 0. 142 s 0. 382 s 14. 46 s Speedup (vs. RCNN) 102 x 38 x -
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
Fast R-CNN 1 -scale SPP layer (7 x 7) • Single-stage training Selective Search ~2 K • Training can update all network layer
Fast R-CNN •
Fast R-CNN • Contributions • Higher m. AP than R-CNN and SPPnet • Training is single-stage, using multi-task loss • Training can update all network layers • No disk storage is required for feature caching
Fast R-CNN VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
Faster R-CNN • Selective search consumes much running time • Fast R-CNN • Region proposal network (RPN)
Faster R-CNN • Region proposal network (RPN) • Pick top-ranked 100 proposal at test time
Faster R-CNN • Timing(ms) Model System Conv Proposal VGG SS + Fast R-CNN 146 1510 VGG RPN + Fast R-CNN 141 10 Region-wise Total rate 174 1830 0. 5 fps 47 198 5 fps
Faster R-CNN • Contribution • Present RPNs for efficient and accurate region proposal generation • Sharing convolutional features for region proposal and object detection VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8
Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)
YOLO • Use features from the entire image to predict each bounding box • Single neural network • • Region proposal Feature extraction Classification Bounding box regression
YOLO •
YOLO • IOU = 0. 8 IOU = 0. 3
YOLO • VOC 2007 Train m. AP FPS YOLO 2007+2012 63. 4 45 Fast R-CNN 2007+2012 70 0. 5 Faster R-CNN VGG 2007+2012 73. 2 7
YOLO VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8
YOLO • Limitation • Struggle with small objects that appear in groups • Struggle to generalize to objects in new or unusual aspect ratios or configurations
Reference [1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation. " Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. [2] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2): 154– 171, 2013. [3] R. B. Girshick. Fast R-CNN. Co. RR, abs/1504. 08083, 2015 [4] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. ar. Xiv preprint ar. Xiv: 1506. 01497, 2015 [5] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection. " ar. Xiv preprint ar. Xiv: 1506. 02640 (2015). [6] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition. " European Conference on Computer Vision. Springer International Publishing, 2014.
- Slides: 36