EndtoEnd Text Recognition with Convolutional Neural Networks Tao

  • Slides: 24
Download presentation
End-to-End Text Recognition with Convolutional Neural Networks Tao Wang*, David J. Wu*, Adam Coates,

End-to-End Text Recognition with Convolutional Neural Networks Tao Wang*, David J. Wu*, Adam Coates, Andrew Y. Ng Computer Science Department Stanford University * Denotes equal contribution Tao Wang

Scene Text Recognition Overview • Text “in the wild” are hard to recognize •

Scene Text Recognition Overview • Text “in the wild” are hard to recognize • Wide range of variations in backgrounds, textures, fonts, and lighting conditions ICDAR 2003 Dataset S. Lucas et al. , 2003 Street View Text Dataset K. Wang et al. , 2011 Tao Wang 2

Two-Stage Framework Detection/Classification High-level Inference “HOTEL” Tao Wang 3

Two-Stage Framework Detection/Classification High-level Inference “HOTEL” Tao Wang 3

Works Classification and detection High-level inference Weinman et al. , 2008 Appearance + Geometry

Works Classification and detection High-level inference Weinman et al. , 2008 Appearance + Geometry Semi-Markov CRF K. Wang et al. , 2011 HOG + Random Ferns Pictorial Structure Mishra et al. , 2012 HOG + SVM with RBF Kernel CRF + N-gram model Neumann and Matas, 2012 MSER + SVM with RBF Kernel Exhaustive Graph Search Tao Wang 4

Classification and detection Most other approaches Our approach High-level inference Hand-designed features Graph based

Classification and detection Most other approaches Our approach High-level inference Hand-designed features Graph based + off-the-shelf classifier inference models Learnt features + -layer CNN Simple 2 -the-shelf heuristics Tao Wang off 5

Various Benchmarks Detection/Classification ICDAR 62 -way cropped character classification End-to-end system after high-level inference

Various Benchmarks Detection/Classification ICDAR 62 -way cropped character classification End-to-end system after high-level inference ICDAR and SVT end-to-end text recognition SOTA ICDAR and SVT Cropped word recognition SOTA on ICDAR Lexicon SOTA Tao Wang 6

Unsupervised Feature Learning Contrast Normalization + ZCA whitening K-Means Coates et al. , 2011

Unsupervised Feature Learning Contrast Normalization + ZCA whitening K-Means Coates et al. , 2011 Tao Wang 7

~10 K parameters for detection ~50 K parameters for classification √ Text × Non-Text

~10 K parameters for detection ~50 K parameters for classification √ Text × Non-Text Large representation but not enough data. Overfitting? 96 L 2 -SVM Classifier 256 Convolution Spatial Pooling 1 st layer Convolution Spatial Pooling 2 nd layer Backpropagation Tao Wang 8

Synthetic Data Real Data Unrealistic Synthetic Data Java. Font + Natural backgrounds Color Statistics

Synthetic Data Real Data Unrealistic Synthetic Data Java. Font + Natural backgrounds Color Statistics Synthetic “hard negatives” Tao Wang 9

Detector Performance Tao Wang 10

Detector Performance Tao Wang 10

Text Line Bounding boxes Candidate spaces Tao Wang 11

Text Line Bounding boxes Candidate spaces Tao Wang 11

Classifier Performance 62 -way classification accuracy on ICDAR cropped characters Higher is better Accuracy(%)

Classifier Performance 62 -way classification accuracy on ICDAR cropped characters Higher is better Accuracy(%) 83. 9 (on ICDAR-Sample characters) Tao Wang 12

Tao Wang 13

Tao Wang 13

Char Class Sliding window position Tao Wang 14

Char Class Sliding window position Tao Wang 14

Word Recognition Lexicon: SERIES max ∑ … -5. 45 MAKE SERIES 7. 82 ESTATE

Word Recognition Lexicon: SERIES max ∑ … -5. 45 MAKE SERIES 7. 82 ESTATE -1. 74 POKER -9. 02 …Tao Wang 15

Accuracy(%) Cropped Word Recognition Accuracy Higher is better Cropped Words Benchmarks Tao Wang 16

Accuracy(%) Cropped Word Recognition Accuracy Higher is better Cropped Words Benchmarks Tao Wang 16

Candidate spaces generated by detector … … Tao Wang 17

Candidate spaces generated by detector … … Tao Wang 17

Tao Wang 18

Tao Wang 18

F-Score End-to-end text recognition results Higher is better End-to-end Benchmarks Tao Wang 19

F-Score End-to-end text recognition results Higher is better End-to-end Benchmarks Tao Wang 19

Sample Output Images from SVT Tao Wang 20

Sample Output Images from SVT Tao Wang 20

Sample Output Images from ICDAR-FULL Tao Wang 21

Sample Output Images from ICDAR-FULL Tao Wang 21

c -- “confidence margin” Suggested Words POSE POST PEOPLE PISTOL … POST PEOSTEL LEXICON

c -- “confidence margin” Suggested Words POSE POST PEOPLE PISTOL … POST PEOSTEL LEXICON Our F-score: 0. 38 Hunspell Neumann and Matas, 2010: 0. 40 Tao Wang 22

Conclusion • • Learnt features + 2 -layer CNN for+ character detection and classification

Conclusion • • Learnt features + 2 -layer CNN for+ character detection and classification Simple heuristics to build end-to-end scene text recognition system State-of-the-art performances on - ICDAR cropped character classification - ICDAR cropped word recognition - Lexicon based end-to-end recognition on ICDAR and SVT Extensible to more general lexicon with off-the-shelf spelling checker Tao Wang 23

Tao Wang 24

Tao Wang 24