CNNs for object detection Sliding window detector using

CNNs for object detection

Sliding window detector using object classifier for each window • Take a set of image patches • For each patch, classify what object the patch contains, using an object classifier trained with hand-crafted features

Brute force method, high latency to improve recall • Do not know the location of the object of interest - sliding window at small strides • Do not know the size/ aspect ratio of the object of interest - windows of different sizes

Non-maximum suppression (to improve precision) Sample heuristics: • Remove all boxes, if confidence < T 1 • For each object class, pick box with highest confidence, remove all boxes with Io. U > T 2, stop when all boxes are considered.

R-CNN (Girschick et. al. 2013) Regions • • CNN Region proposals reduce brute force search as in sliding window Each proposed region is warped to match input size expected by CNN gives more discriminatory features, compared to hand-crafted Multi-class SVM trained with CNN features

Regions from selective search

Bounding box regressor During training, a bounding box regressor is learnt minimizing regression loss compared to ground truth bounding boxes.

Bounding box regressor

R-CNN Number of CNN runs = number of proposals high latency

Fast R-CNN (Girschick et. al. 2015) 1 CNN run for whole image proposed regions extracted from CNN feature map, instead of image

New operation: Region of Interest (ROI) pooling Necessary to convert variable sized ROIs to fixed size inputs expected by the following FC layers (similar to warping original image patches in R-CNN for fixed sized inputs to conv layers)

ROI pooling 1. 8 x 8 conv feature map 2. Region of Interest (ROI) Look at Fast-RCNN github code to see how the ROI is divided 3. 2 x 2 intended pooling output 4. Output

Faster R-CNN (Ren et. al. 2016)

Region Proposal Network (RPN) Instead of external region proposals in Fast R-CNN, Faster R-CNN generates proposals re-using the same initial convolutional feature map.

Region Proposal Network (RPN) foreground vs. background probabilities x, y, width, height decides the location of the region proposal decides the shape/size of the region proposal (adjust if we are only looking for faces (square), or cars (wide-short) or pedestrians (narrow-tall))

Yolo: You Only Look Once (Redmon et. al. 2016) R-CNN, Faster R-CNN • propose regions with foreground/background labels • give class labels for each foreground region Yolo: propose regions with class labels

Yolo: CNN output for 2 bounding boxes (similar to anchor boxes in RPN) y size: 3 (#grid rows) x 3 (#grid columns) x 2 (#bounding boxes) x 8 (confidence 1, coord 4, class 3)

Yolo: Non-maximum suppression Very similar to sliding window detectors with NMS (CNN features instead of handcrafted features, feature extraction pipeline shared instead of per window).

Yolo: accuracy vs. latency Precision for class C: TP(c): a proposal for class c has IOU > T with a ground truth label FP(c): a proposal for class c has IOU > T with no ground truth label FN(c): a ground truth label has IOU > T with no proposal affects recall Mean Average Precision (MAP)

SSD: Single Shot Multibox Detector (Liu et. al. 2016) Uses multiple feature map resolutions (similar to different sized image patches in sliding window detections to detect objects at different aspect ratios).

SSD vs. Yolo Smaller input image size and lack of FC layer improves performance. More conv layers for different feature map sizes improve accuracy.

Object classification CNNs summary Google. Net 2014 Alexnet, 2012 VGGNet 2014 First CNN based classifier – 8 layers Smaller conv filters with same effective receptive field - 16 or 19 layers Resnet 2015 Multiple conv filters in each layer (inception module), bottleneck layers to reduce computation and dimensions - 22 layers Shortcut connections to help train deeper networks – upto 152 layers Object detection CNNs summary R-CNN, 2013 Fast R-CNN 2015 Faster R-CNN 2016 Gets region proposals externally, run CNN on each proposal Gets region proposals externally, CNN run once over whole image Generates region proposals, shares conv layers between proposal and classification Yolo 2016 No proposals, only CNN based classification of fixed bounding boxes SSD 2016 Fixed bounding boxes (different resolutions)

Android CNN projects • How many credit students? • How many students will do project? • Android programming experience? • Next week: – Recurrent Neural Networks, LSTM (Tue) – Metrics and trade-offs (Fri)