Object Detection using Deep Neural Network WanRu Lin

Object Detection using Deep Neural Network Wan-Ru, Lin 2016/10/27

Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)

Introduction • Object detection has long been an interesting task in computer vision ü Location (x, y, w, h) ü Classification

Introduction • Before fast R-CNN (2015)… Region proposal cat Feature extraction • After fast R-CNN … Classifier cat Region proposal Feature extraction Classifier [R. Girshick, “Fast R-CNN, ” in IEEE International Conference on Computer Vision (ICCV), 2015]

Introduction (2014) YOLO (2015)

Background • Convolution Neural Network (CNN) Feature extractor • Convolution • Nonlinearity – (sigmoid , Re. LU) • Pooling classifier

Background • Pooling • reduce the spatial size • translation invariant • Loss function • Error backpropagation

Background • PASCAL VOC • Location • Class Person: person Animal: bird, cat, cow, dog, horse, sheep Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor PASCAL VOC 2007 PASCAL VOC 2010 PASCAL VOC 2012 9, 963 training images 10, 103 training images 11, 530 training images 20 classes

Background • Pre-training • ILSVRC dataset ~ 120 W images • Fine-tuning • PASCAL VOC 2012

Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)

R-CNN • Multi-stage Selective Search SVM

R-CNN • Selective Search • Generate possible object locations

R-CNN • Training • Supervised pre-training : ILSVRC 2012 • Domain-specific fine-tuning : • warp input • output number : 1000 -> 20 + 1(ground truth) • SVM • Separate data with hyperplane

R-CNN • Disadvantage of R-CNN • Distortion due to warping • Training is a multi-stage pipeline • Training is expensive in space and time • Object detection is slow • VGG takes 47 s/image

R-CNN VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8

Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)

SPPnet •

SPPnet • Share feature maps speed up R-CNN • Achieve comparable m. AP with R-CNN SPP (1 -scale) SPP (5 -scale) R-CNN (ZF-5) m. AP 58. 0 59. 2 Conv time Fc time 0. 053 s 0. 089 s 0. 293 s 0. 089 s 14. 37 s 0. 089 s Total time (GPU) 0. 142 s 0. 382 s 14. 46 s Speedup (vs. RCNN) 102 x 38 x -

Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)

Fast R-CNN 1 -scale SPP layer (7 x 7) • Single-stage training Selective Search ~2 K • Training can update all network layer

Fast R-CNN •

Fast R-CNN • Contributions • Higher m. AP than R-CNN and SPPnet • Training is single-stage, using multi-task loss • Training can update all network layers • No disk storage is required for feature caching

Fast R-CNN VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8

Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)

Faster R-CNN • Selective search consumes much running time • Fast R-CNN • Region proposal network (RPN)

Faster R-CNN • Region proposal network (RPN) • Pick top-ranked 100 proposal at test time

Faster R-CNN • Timing(ms) Model System Conv Proposal VGG SS + Fast R-CNN 146 1510 VGG RPN + Fast R-CNN 141 10 Region-wise Total rate 174 1830 0. 5 fps 47 198 5 fps

Faster R-CNN • Contribution • Present RPNs for efficient and accurate region proposal generation • Sharing convolutional features for region proposal and object detection VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8

Outline • Introduction • Background • R-CNN (2014) • SPPnet (2014) – speedup R-CNN • Fast R-CNN (2015) • Faster R-CNN (2015) • YOLO (2015)

YOLO • Use features from the entire image to predict each bounding box • Single neural network • • Region proposal Feature extraction Classification Bounding box regression

YOLO •

YOLO • IOU = 0. 8 IOU = 0. 3

YOLO • VOC 2007 Train m. AP FPS YOLO 2007+2012 63. 4 45 Fast R-CNN 2007+2012 70 0. 5 Faster R-CNN VGG 2007+2012 73. 2 7

YOLO VOC 2012 m. AP Aero Bike Bird Boat Bottle Bus Car Cat Chair Cow Table dog Horse Mbike Person Plant Sheep Sofa Train tv R-CNN VGG 59. 2 76. 8 70. 9 56. 6 37. 5 36. 9 62. 9 63. 6 81. 1 35. 7 64. 3 43. 9 80. 4 71. 6 74. 0 60. 0 30. 8 63. 4 52. 0 63. 5 58. 7 Fast R-CNN 68. 4 82. 3 78. 4 70. 8 52. 3 38. 7 77. 8 71. 6 89. 3 44. 2 73. 0 55. 0 87. 5 80. 8 72. 0 35. 1 68. 3 65. 7 80. 4 64. 2 Faster R-CNN 70. 4 84. 9 79. 8 74. 3 53. 9 49. 8 77. 5 75. 9 88. 5 45. 6 77. 1 55. 3 86. 9 81. 7 80. 9 79. 6 40. 1 72. 6 60. 9 81. 2 61. 5 Fast R-CNN +YOLO 70. 7 83. 4 78. 5 73. 5 55. 8 43. 4 79. 1 73. 1 89. 4 49. 4 75. 5 57. 0 87. 5 80. 9 81. 0 74. 7 41. 8 71. 5 68. 5 82. 1 67. 2 YOLO 57. 9 77. 0 67. 2 57. 7 38. 3 22. 7 68. 3 55. 9 81. 4 36. 2 60. 8 48. 5 77. 2 72. 3 71. 3 63. 5 28. 9 52. 2 54. 8 73. 9 50. 8

YOLO • Limitation • Struggle with small objects that appear in groups • Struggle to generalize to objects in new or unusual aspect ratios or configurations

Reference [1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation. " Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. [2] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2): 154– 171, 2013. [3] R. B. Girshick. Fast R-CNN. Co. RR, abs/1504. 08083, 2015 [4] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. ar. Xiv preprint ar. Xiv: 1506. 01497, 2015 [5] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection. " ar. Xiv preprint ar. Xiv: 1506. 02640 (2015). [6] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition. " European Conference on Computer Vision. Springer International Publishing, 2014.