Video Human Action Classification Group No 14 Linjun

Background ● Object of Action Classification: Identify actions from the video and assign the

Background Action detection is widely used in ● Human–computer interaction ● Medical diagnosis ●

Why Convolutional Neural Networks (CNN) ● Feature learning ● End-to-end training ● Weight sharing

Two-stream Models These models are commonly used in learning task which is required to

Literature Survey ● Simonyan et al. [1] first proposed the two-stream network which combined

Dataset ● UCF 101 ○ ○ ○ ● HMDB 51 13320 videos from 101

Feature ● RGB Frames ○ ○ Images taken from the video with RGB channels

Optical Flow Stacking ● Optical Flow Frames ○ ○ ○ d_t(p) denotes the vector

Model: Baseline UC San Diego | Jacobs School of Engineering

Improved Model: Overview UC San Diego | Jacobs School of Engineering

Improved Model: Details of block 5 UC San Diego | Jacobs School of Engineering

Results and Observations ● Mean classification accuracy on 101 classes from UCF 101 ●

Results and Observations ● Confusion matrix on 101 classes ● Large values are highlighted

Future Work ● Investigate different fusion types and structures ● Reproduce experiments on HMDB

References [1]. Simonyan, K. and Zisserman, A. , 2014. Two-stream convolutional networks for action

Slides: 16

Download presentation

Video Human Action Classification Group No. 14 Linjun Li, Yifan Cao, and Zheng Zhou

Background ● Object of Action Classification: Identify actions from the video and assign the corresponding labels to them. UC San Diego | Jacobs School of Engineering

Background Action detection is widely used in ● Human–computer interaction ● Medical diagnosis ● Sports analytics The related spatial-temporal techniques also have great potential in other physical problems. UC San Diego | Jacobs School of Engineering

Why Convolutional Neural Networks (CNN) ● Feature learning ● End-to-end training ● Weight sharing to reduce parameters ● Achieved great success in various image-based tasks UC San Diego | Jacobs School of Engineering

Two-stream Models These models are commonly used in learning task which is required to capture the information from different domains. ● Extract the features from spatial and temporal domains individually; ● Fuse the features and get the prediction in a single run. UC San Diego | Jacobs School of Engineering

Literature Survey ● Simonyan et al. [1] first proposed the two-stream network which combined the appearance and the motion stream together ● Wang et al. [2] proposed a temporal segment network which first divided the video sequence into several segments and then fed them to the two-stream network. ● Gammulle et al. [3] integrated recurrent neural networks (RNN) into the twostream network to enhance the accuracy. [1]. Simonyan, K. and Zisserman, A. , 2014. Two-stream convolutional networks for action recognition in videos. [2]. Wang, L. , Xiong, Y. , Wang, Z. , Qiao, Y. , Lin, D. , Tang, X. and Van Gool, L. , 2016, October. Temporal segment networks: Towards good practices for deep action recognition. [3]. Gammulle, H. , Denman, S. , Sridharan, S. and Fookes, C. , 2017, March. Two stream lstm: A deep fusion framework for human action recognition. UC San Diego | Jacobs School of Engineering

Dataset ● UCF 101 ○ ○ ○ ● HMDB 51 13320 videos from 101 action categories Human-Object Interaction Body-Motion Only Human-Human Interaction Playing Musical Instruments Sports ○ ○ ○ 6849 videos from 51 action categories General Facial Actions with Object Manipulation General Body Movements with Object Interaction Body Movements for Human Interaction UC San Diego | Jacobs School of Engineering

Feature ● RGB Frames ○ ○ Images taken from the video with RGB channels For each video, uniformly sample 10 frames ● Optical Flow Frames ○ ○ Stacked optical flow images representing motion 10 optical flow frames corresponding to the 10 RGB frames UC San Diego | Jacobs School of Engineering

Optical Flow Stacking ● Optical Flow Frames ○ ○ ○ d_t(p) denotes the vector at point p in frame t, which moves the point to the position in t+1 Decompose d_t into horizontal component and vertical component Stack L frames which gives 2 L channels UC San Diego | Jacobs School of Engineering

Model: Baseline UC San Diego | Jacobs School of Engineering

Improved Model: Overview UC San Diego | Jacobs School of Engineering

Improved Model: Details of block 5 UC San Diego | Jacobs School of Engineering

Results and Observations ● Mean classification accuracy on 101 classes from UCF 101 ● The performance is evaluated by top-1 accuracy and top-5 accuracy Model Top-1 Accuracy Model Top-5 Accuracy One-stream CNN Baseline [4] 72. 85% One-stream CNN Baseline 82. 93% Two-stream CNN Baseline [5] 81. 71% Two-stream CNN Baseline 93. 49% ST-Mul Net 85. 43% ST-Mul Net 98. 20% UC San Diego | Jacobs School of Engineering

Results and Observations ● Confusion matrix on 101 classes ● Large values are highlighted ● Diagonal entries indicates high classification accuracy UC San Diego | Jacobs School of Engineering

Future Work ● Investigate different fusion types and structures ● Reproduce experiments on HMDB 51 dataset ● Extend 2 D Convolutional kernels to 3 D kernels; ● Incorporate the CNN-based features with Long short-term memory (LSTM) to enhance temporal support; ● Extract the skeleton of human body to capture tiny movements. UC San Diego | Jacobs School of Engineering

References [1]. Simonyan, K. and Zisserman, A. , 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568 -576). [2]. Wang, L. , Xiong, Y. , Wang, Z. , Qiao, Y. , Lin, D. , Tang, X. and Van Gool, L. , 2016, October. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20 -36). Springer, Cham. [3]. Gammulle, H. , Denman, S. , Sridharan, S. and Fookes, C. , 2017, March. Two stream lstm: A deep fusion framework for human action recognition. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 177 -186). IEEE. [4]. Yao, L. , Torabi, A. , Cho, K. , Ballas, N. , Pal, C. , Larochelle, H. and Courville, A. , 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision (pp. 4507 -4515). [5]. Feichtenhofer, C. , Pinz, A. and Zisserman, A. , 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933 -1941). UC San Diego | Jacobs School of Engineering