Multiperson Articulated Tracking with Spatial and Temporal Embeddings
Multi-person Articulated Tracking with Spatial and Temporal Embeddings Sheng Jin Wentao Liu Wanli Ouyang Chen Qian Sense. Time Research, Tsinghua University, The University of Sydney CVPR 2019 1
Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. 2
Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation:
Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: 1) Top-Down 2) Bottom-Up 4
Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: 1) Top-Down 2) Bottom-Up 5
Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: 1) Top-Down 2) Bottom-Up 6
Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: Pose tracking: T T+1 7
Method o The method proposed in this paper Spatial. Net Temporal. Net 8
Spatial. Net 9
Spatial. Net CNN 10
Spatial. Net CNN 11
Spatial. Net CNN 12
Spatial. Net o Heatmaps: CNN
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN heatmap of Right hand 14
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN Person-1 Person-2 heatmap of Right hand 15
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN Right hand Left hand 16
Spatial. Net o Heatmaps: one joint <-> one heatmap J joints <-> J heatmaps CNN m Nu of jo : ts in J 17
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap 18
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap Define: 19
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap Define: 20
Spatial. Net o Heatmaps: one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap Define: Loss: 21
Spatial. Net o Heatmaps:(for detecting body parts) CNN 22
Spatial. Net o Heatmaps:(for detecting body parts) o KE(Keypoint-Embedding):(for grouping body parts) CNN 23
Spatial. Net o Heatmaps:(for detecting body parts) o KE(Keypoint-Embedding):(for grouping body parts) Goal: joint parts from same person, Embedding the same CNN joint parts from different person, Embedding different 24
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN 25
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN (one-to-one correspondence) Num of KE maps is J 26
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN KE map Right hand 27
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN KE map Right hand Left hand 28
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person Right hand Left hand 29
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand 30
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand Ground truth is the average embedding value of person K 31
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand Ground truth is the average embedding value of person K Loss: Pull the embedding value of all joints to the person‘s average embedding value 32
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand Ground truth is the average embedding value of person K Loss: Pull the embedding value of all joints to the person‘s average embedding value 33
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person Right hand 34
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person The embedding value should be different Right hand 35
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person The embedding value should be different Right hand Loss: 36
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person The embedding value should be different Right hand Loss: KE will be used for judging whether the joints are from this person (according to the embedding value) 37
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) CNN 38
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) o SVF:(for grouping body parts) CNN Spatial Vector Fields is dense offset map 39
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) o SVF:(for grouping body parts) CNN Spatial Vector Fields is dense offset map Relative displacement from the human center to its absolute location p. X direction Y direction 40
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) o SVF:(for grouping body parts) CNN Spatial Vector Fields is dense offset map Relative displacement from the human center to its absolute location p. X direction Y direction Loss: 41
Spatial. Net o Heatmaps:(for detecting body parts) o KE:(for grouping body parts) o SVF:(for grouping body parts) CNN 42
Spatial. Net CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) 43
Spatial. Net CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) auxiliary training SVF:(for grouping body parts) for KE Ordinal maps:(for grouping body parts) 44
Spatial. Net CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) six auxiliary tasks <-> six maps E. g. , left-to-right (l 2 r) 45
Spatial. Net CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) six auxiliary tasks <-> six maps E. g. , left-to-right (l 2 r) The same with KE ( Joint parts from the same person. The embedding value should be as close as possible ) 46
Spatial. Net CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) six auxiliary tasks <-> six maps E. g. , left-to-right (l 2 r) If kth person is on the left of k′th person, then Ord = 1, otherwise Ord = − 1. 47
Spatial. Net CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) 48
Pose Guided Grouping (PGG) Module CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) 49
Pose Guided Grouping (PGG) Module CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) Spatial Instance Embedding 50
Pose Guided Grouping (PGG) Module CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) IE 51
Pose Guided Grouping (PGG) Module CNN o o Heatmaps:(for detecting body parts) KE:(for grouping body parts) SVF:(for grouping body parts) Ordinal maps:(for grouping body parts) IE 52
Pose Guided Grouping (PGG) Module CNN IE 53
Pose Guided Grouping (PGG) Module CNN Concat( ) = X’ ∈ IE 54
Pose Guided Grouping (PGG) Module CNN Concat( ) = X’ ∈ Generate Mask IE 55
Pose Guided Grouping (PGG) Module CNN Concat( ) Generate Mask = X’ ∈ M IE 56
Pose Guided Grouping (PGG) Module CNN Concat( ) Generate Mask = X’ ∈ M M • X’ -> reshape -> X ∈ IE 57
Pose Guided Grouping (PGG) Module CNN Concat( ) Generate Mask = X’ ∈ M M • X’ -> reshape -> X ∈ IE 58
Pose Guided Grouping (PGG) Module M • X’ -> reshape -> X ∈ CNN out final IE 59
Temporal. Net CNN IE 60
Temporal. Net t-1 t o HE : Human Embedding pulling HE of the same instance closer, pushing apart embeddings of different instances. CNN Loss: (HE is trained with triplet loss) 61
Temporal. Net t-1 t o HE : Human Embedding o TIE : Temporal Instance Embedding CNN encodes the relative displacement from the human center in (t − 1)-th frame to body parts in the t-th frame represents the offset from current t-th frame body center to body parts in the previous frame Loss: 62
Experiments o ICCV’ 17 Pose. Track Challenge Pose Estimation Tracking
Experiments o Ablation Study
Comments o Pros High performance o Cons The combination of previous work Only report the performance on validation set
Thanks!
- Slides: 66