Fully Convolutional Networks for Semantic Segmentation Jonathan Long

Overview • Reinterpret standard classification convnets as “Fully convolutional” networks (FCN) for semantic segmentation

pixels in, pixels out monocular depth estimation (Liu et al. 2015) semantic segmentation Slide

convnets perform classification < 1 millisecond “tabby cat” 1000 dim vector end to end

R-CNN does detection many seconds “dog” R CNN Slide credit: Jonathan Long “cat” 5

R-CNN figure: Girshick et al. Slide credit: Jonathan Long 6

< 1/5 second ? ? ? end to end learning Slide credit: Jonathan Long

a classification network “tabby cat” Slide credit: Jonathan Long 8

becoming fully convolutional Slide credit: Jonathan Long 9

becoming fully convolutional Slide credit: Jonathan Long 10

upsampling output Slide credit: Jonathan Long 11

end-to-end, pixels-to-pixels network upsampling conv, pool, nonlinearity Slide credit: Jonathan Long pixelwise output +

Dense Predictions • • Shift and stitch: trick that yields dense predictions without interpolation

Classifier to Dense FCN • Convolutionalize proven classification architectures: Alex. Net, VGG, and Goog.

Classifier to Dense FCN Cast ILSVRC classifiers into FCNs and compare performance on validation

spectrum of deep features combine where (local, shallow) with what (global, deep) fuse features

skip layers interp + sum sk ip to fu s e end-to-end, joint learning

Comparison of skip FCNs Results on subset of validation set of PASCAL VOC 2011

skip layer refinement input image stride 32 no skips Slide credit: Jonathan Long stride

training + testing train full image at a time without patch sampling reshape network

Results – PASCAL VOC 2011/12 VOC 2011: 8498 training images (from additional labeled data

Results – NYUDv 2 1449 RGB D images with pixelwise labels 40 categories 23

Results – SIFT Flow 2688 images with pixel labels 33 semantic categories, 3 geometric

FCN SDS* Truth Input Relative to prior state of the art SDS: Slide credit:

leaderboard FCN FCN FCN == segmentation with Caffe FCN FCN Slide credit: Jonathan Long

conclusion fully convolutional networks are fast, end to end models for pixelwise problems code

Slides: 27

Download presentation

Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Slide credit: Jonathan Long Evan Shelhamer* Trevor Darrell UC Berkeley Presented by: Gordon Christie 1

Overview • Reinterpret standard classification convnets as “Fully convolutional” networks (FCN) for semantic segmentation • Use Alex. Net, VGG, and Google. Net in experiments • Novel architecture: combine information from different layers for segmentation • State of the art segmentation for PASCAL VOC 2011/2012, NYUDv 2, and SIFT Flow at the time • Inference less than one fifth of a second for a typical image Slide credit: Jonathan Long 2

pixels in, pixels out monocular depth estimation (Liu et al. 2015) semantic segmentation Slide credit: Jonathan Long boundary prediction (Xie & Tu 2015) 3

convnets perform classification < 1 millisecond “tabby cat” 1000 dim vector end to end learning Slide credit: Jonathan Long 4

R-CNN does detection many seconds “dog” R CNN Slide credit: Jonathan Long “cat” 5

R-CNN figure: Girshick et al. Slide credit: Jonathan Long 6

< 1/5 second ? ? ? end to end learning Slide credit: Jonathan Long 7

a classification network “tabby cat” Slide credit: Jonathan Long 8

becoming fully convolutional Slide credit: Jonathan Long 9

becoming fully convolutional Slide credit: Jonathan Long 10

upsampling output Slide credit: Jonathan Long 11

end-to-end, pixels-to-pixels network upsampling conv, pool, nonlinearity Slide credit: Jonathan Long pixelwise output + loss 12

Dense Predictions • • Shift and stitch: trick that yields dense predictions without interpolation Upsampling via deconvolution Shift and stitch used in preliminary experiments, but not included in final model Upsampling found to be more effective and efficient 13

Classifier to Dense FCN • Convolutionalize proven classification architectures: Alex. Net, VGG, and Goog. Le. Net (reimplementation) • Remove classification layer and convert all fully connected layers to convolutions • Append 1 x 1 convolution with channel dimensions and predict scores at each of the coarse output locations (21 categories + background for PASCAL) 14

Classifier to Dense FCN Cast ILSVRC classifiers into FCNs and compare performance on validation set of PASCAL 2011 15

spectrum of deep features combine where (local, shallow) with what (global, deep) fuse features into deep jet Slide credit: Jonathan Long (cf. Hariharan et al. CVPR 15 “hypercolumn”) 16

skip layers interp + sum sk ip to fu s e end-to-end, joint learning of semantics and location Slide credit: Jonathan Long la ye interp + sum rs ! dense output 17

skip layers 18

Comparison of skip FCNs Results on subset of validation set of PASCAL VOC 2011 19

skip layer refinement input image stride 32 no skips Slide credit: Jonathan Long stride 16 stride 8 1 skip 2 skips ground truth 20

training + testing train full image at a time without patch sampling reshape network to take input of any size forward time is ~150 ms for 500 x 21 output Slide credit: Jonathan Long 21

Results – PASCAL VOC 2011/12 VOC 2011: 8498 training images (from additional labeled data 22

Results – NYUDv 2 1449 RGB D images with pixelwise labels 40 categories 23

Results – SIFT Flow 2688 images with pixel labels 33 semantic categories, 3 geometric categories Learn both label spaces jointly learning and inference have similar performance and computation as independent models 24

FCN SDS* Truth Input Relative to prior state of the art SDS: Slide credit: Jonathan Long 20% relative improvement for mean Io. U 286× faster *Simultaneous Detection and Segmentation 25 Hariharan et al. ECCV 14

leaderboard FCN FCN FCN == segmentation with Caffe FCN FCN Slide credit: Jonathan Long 26

conclusion fully convolutional networks are fast, end to end models for pixelwise problems code in Caffe branch (merged soon) models for PASCAL VOC, NYUDv 2, caffe. berkeleyvision. org SIFT Flow, PASCAL Context fcn. berkeleyvision. org Slide credit: Jonathan Long github. com/BVLC/caffe 27