SequencetoSegments Networks for Segment Detection Zijun Wei 1

- Slides: 1
Sequence-to-Segments Networks for Segment Detection Zijun Wei 1, Boyu Wang 1, Minh Hoai 1, Jianming Zhang 2, Xiaohui Shen 3, Zhe Lin 2, Radomír Měch 2, Dimitris Samaras 1 zijwei@cs. stonybrook. edu 1 Stony Brook University, 2 Adobe Research, 3 Byte. Dance AI Lab Detecting Segments of Interest S 2 N Model Overview Experimental Results: Action Proposals Task: Given an input sequence, finding segments of interest. Applications: • Video summarization; • Video action proposal in untrimmed videos; Challenges: • Global dependency: The interestingness of segments also depend on the whole sequence; • Interdependency: Segments are not independent; • Efficiency: Segment search space increases exponentially Our contributions Sequence-to-Segment Network (S 2 N), : • an end-to-end network architecture for detecting segments in a sequence. Hungarian matching: • customized for matching multiple predictions with ground truth Earth Mover’s Distance: • models segment localization loss State-of-the-art performance: • on both video summarization and video action proposal tasks. Problem Formulation *Frequency: Proposals per Second Overall: Out segments: start, end, score Experimental Results: Video Summarization Global dependency: representing sequence by encoding stage Interdependency: decoding segments sequentially Efficiency: pointing to starting and ending positions directly (vs. sliding window) F 1 Score on Sum. Me dataset SDU: • GRU for state update • Pointer Network Modules for boundary localization • MLP regression/classification for confidence prediction bj /dj= argmax(i) g (hj, ei) where g(hj, ei) = v. T tanh (W 1 ei + W 2 hj) Qualitative Visualization Training S 2 N is trained end-to-end: • localization loss: Earth Mover’s Distance Hungarian matching: • matching G(ground truth) and S(proposal) Input sequence: Ground truth segments: Quantitative Results (THUMOS 14) Cross Entropy EMD • • A, B: ground truth 1, 2, 3, 4: predicted segments Future Directions • • Modify GRU to record longer sequences (e. g. Ind. RNN) Explore other applications (EEG, NLP, etc. ) Base S 2 N on a fully convolutional Encoder-Decoder (seq. CNN) Apply S 2 N to action detection in untrimmed videos Acknowledgements: This project was partially supported by NSF-CNS-1718014, NSF-IIS-1763981, NSF-IIS-1566248, the Partner University Fund, the SUNY 2020 Infrastructure Transportation Security Center, and a gift from Adobe