Understanding and Describing Tennis Videos Mohak Kumar Sukhwani

Understanding and Describing Tennis Videos Mohak Kumar Sukhwani 201307583 IIIT Hyderabad Advisor: Prof. C. V. Jawahar Center for Visual Information Technology, IIIT-Hyderabad, India

Sports Video Analysis Cricket: Temporal segmentation and annotation of actions with semantic descriptions. IIIT Hyderabad Snooker and volley ball: (Left) Analysis of shot trajectories and stroke analysis. (Right) Player identification and action recognition.

Ice-Hockey: Player recognition and tracking on field. IIIT Hyderabad Soccer: Real-time football analysis include automatic game summarization, player tracking, highlight extraction

Handball: Trajectory-based handball video understanding. IIIT Hyderabad Basketball: Tracking players under global appearance constraints.

Computer Vision and Language Processing < video slide – motivation > IIIT Hyderabad How will you describe it?

Visual-Semantic Alignments (Varied Approaches) Tags • { tennis, green, net, player, court} Phrases • Playing tennis on court. Captions • Group of people on tennis court playing lawn tennis. IIIT Hyderabad

h pproa c Our A IIIT Hyderabad

Tennis Data IN, Winner: Sharapova!!! Fine serve placed at T, Halep delivers a backhand return, couple of shots exchanged, Halep overhits a backhand in the rally IN, Winner: Djokovic!!! Slice serve, Tsonga fails to put it back. IN, Winner: Venus!!! Williams arrows a good serve at T, Sharapova is unable to return it. New Video IN, Winner: Federer!!! Quick serve, Delpotro returns a quick forehand return, couple of shots exchanged, Delpotro nets a forehand down the line. IIIT Hyderabad

Does confining the domain help? Frequency comparison of unrestricted tennis text (tennis news, blogs, etc. - denoted by `*’) with tennis commentaries. IIIT Hyderabad

Phrase Recognition Description Retrieval IN, Winner: Halep!!! Fine serve placed at T, Halep returns a quick forehand return, couple of shots exchanged, Sharapova fails to keep a cross-court forehand in the play IN, Winner: Sharapova!!! Fine serve placed at T, Halep delivers a backhand return, couple of shots exchanged, Halep over-hits a backhand in the rally. " IN, Winner: Sharapova!!! Fine serve, Halep is unable to return it IIIT Hyderabad Action Recognition Action Localization

Dataset Name Contents Role Annotated-action 250 videos & phrases Classification & Training Videocommentary 710 videos & commentaries Testing Tennis Text 435 K commentary lines IIIT Hyderabad (a) Annotated-action. Dictionary Learning & Retrieval (b) Video commentary.

Text Corpus IIIT Hyderabad Source: Tennis Earth - http: //www. tennisearth. com/.

Action Localization Player Detection IIIT Hyderabad Player Detection on test videos. Phrase recognition accuracy averaged over top 5 retrieval.

Player Recognition color based descriptors (MPEG-7 SCD, CLD) edge based descriptor (MPEG-7 EHD) Colour • Jersey information Edge color and texture information (MPEG-7 -like CEDD) • Player’s gait IIIT Hyderabad

Weak Learners for Action recognition Activity Action level of semantics forehand, backhand, volley Feature Extraction(Dense Trajectory) waits for ball, serves a good one, crafts a forehand return Encoding and Pooling ( Bag of Words) Discriminative Classifier (Multiclass SVM) IIIT Hyderabad

Improved Dense Trajectories as a feature vector ! Dense Sampling in each spatial scale Feature tracking Trajectory-aligned descriptors - Capture the intrinsic dynamic structures in video - MBH is robust to camera motion IIIT Hyderabad - Detect human body to remove spurious trajectories

What's with Camera motion ? Separate models for upper and lower action ! IIIT Hyderabad

We are already done with Training ! IIIT Hyderabad

IIIT Hyderabad SVM score MRF based Temporal Smoothing. Pairwise phrase cohesion We test on tennis point videos. Retrieval Module

How about, joint model for phrase classification? IIIT Hyderabad - Semi automatic process for phrase alignment. - No manual shot sampling. - No tiring action annotations.

IIIT Hyderabad

Commentary Text 9 phrase encodings (subject), (object), (subject; verb), (object; verb), (subject; prep; object), (object; prep; object), (attribute; subject), (attribute; object) and (verb; prep; object). IN, Winner: Serena!!! Huge serve. Ace !!! Commentary Phrases Generated <winner Serena>, <huge serve> <ace> IN, Winner: Sharapova!!! IN, Winner: Zvonareva !!! Quick serve, Sharapova Good serve in the middle, crafts a forehand return, Williams returns a quick Serena goes for a forehand return, short forehand down the line but rally, Serena cross-court catches the net fails to clear the net in the middle. IIIT Hyderabad <winner Sharapova>, <quick serve>, <Sharapova craft return>, <Serena catch net>, <Serena go> <winner Zvonareva >, <quick return>, <short rally>, <Williams return>, <cross-court fail>

IN, Winner: Serena!!! Huge serve. Ace !!! <winner Serena>, <huge serve>, <ace> IN, Winner: Zvonareva !!! Good serve in the middle, Williams returns a quick forehand return, short rally, Serena cross-court fails to clear the net in the middle. IIIT Hyderabad <winner Zvonareva >, <quick return>, <short rally>, <Williams return>, <cross-court fail>

Probabilistic Label Consistent KSVD Action Trajectory matrix Phrase label matrix Optimal dictionary PC-3 H= PC-4 : : : IIIT Hyderabad : PC - Phrase cluster Tennis point video PC-2 Sliding window PC-1 Y= Sparse code

For test videos, IIIT Hyderabad

Commentary generation Commentary Collection Phrases + Players Online Offline Representation [ tf-idf/LSI ] Query Representation Document Representation Comparison Function [ TF-IDF/LSI ] Index IIIT Hyderabad Voila !

LSI for better text retrieval TF-IDF LSI W 1 U 1 W 2 U 2 W 3 SVD U 3 : : Wn Uk Term based indexing n>k Latent Concept based indexing IIIT Hyderabad • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in this latent semantic space

Illustration of the approach IIIT Hyderabad Input sequence of videos is first translated into a set of phrases, which are then used to produce the final description

Quantitative Comparisons Template based CNN + RNN CCA + Semantic Correlation matching CCA + SSVM IIIT Hyderabad

The premise is indeed true. IIIT Hyderabad Confined domain does help !

Qualitative Results IIIT Hyderabad

Qualitative Comparisons Youtube 2 Text : S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell & K. Saenko. Youtube 2 text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013. IIIT Hyderabad RNN: A. Karpathy and L. Fei-Fei. Deep visual-vemantic alignments for generating image descriptions. In CVPR, 2015.

Human Evaluation IIIT Hyderabad

Contribution Tennis dataset • Over 1000 lawn tennis game and commentary aligned videos. • 400 K unaligned commentary text lines. Fine grained action recognition • Joint model to assign video frames into appropriate phrase bins under weak label supervision. • Probabilistic label consistent dictionary learning for sparse modelling and classification. Content Analysis & Commentary Generation • Player and Court Recognition: use of domain information for action localization and player identification. • Use of action phrases for fine grained retrieval which assists in creation of verbose commentary text. IIIT Hyderabad

IIIT Hyderabad < video slide >

Other Applications 1. Smart theatrics: Narration generation for dance dramas. Ballet Kabuki Kathak 2. Sports: Other sporting events. IIIT Hyderabad Volley ball Cricket Baseball

Possible Extensions! g on er xt e T L IIIT Hyderabad More realistic and exhaustive game description. (Requires better topic modelling and retrieval methods) Data collection a challenge – too much of variations.

ing k c a l Tr Bal IIIT Hyderabad Tried simple kalman filtering. How about RNNs ? Will it actually help and add to content understanding ?

Related Publications § Mohak Sukhwani and C. V. Jawahar, Tennis Vid 2 Text : Fine-Grained Descriptions for Domain Specific Videos, Proceedings of the 26 th British Machine Vision Conference (BMVC), 07 -10 Sep 2015, Swansea, UK. § Mohak Sukhwani and C. V. Jawahar, Frame level Annotations for Tennis Videos, 23 rd International Conference on Pattern Recognition, ICPR 2016 (Under Review) IIIT Hyderabad

IIIT Hyderabad < video slide – human evaluation >