Multiperson Articulated Tracking with Spatial and Temporal Embeddings

Multi-person Articulated Tracking with Spatial and Temporal Embeddings Sheng Jin Wentao Liu Wanli Ouyang Chen Qian Sense. Time Research, Tsinghua University, The University of Sydney CVPR 2019 1

Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. 2

Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation:

Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: 1) Top-Down 2) Bottom-Up 4

Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: 1) Top-Down 2) Bottom-Up 5

Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: 1) Top-Down 2) Bottom-Up 6

Background o What is the task of Pose Tracking? It is an extension of the task of pose estimation. Pose estimation: Pose tracking: T T+1 7

Method o The method proposed in this paper Spatial. Net Temporal. Net 8

Spatial. Net 9

Spatial. Net CNN 10

Spatial. Net CNN 11

Spatial. Net CNN 12

Spatial. Net o Heatmaps： CNN

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN heatmap of Right hand 14

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN Person-1 Person-2 heatmap of Right hand 15

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN Right hand Left hand 16

Spatial. Net o Heatmaps： one joint <-> one heatmap J joints <-> J heatmaps CNN m Nu of jo : ts in J 17

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap 18

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap Define: 19

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap Define: 20

Spatial. Net o Heatmaps： one joint <-> one heatmap CNN For joint j, Ground truth is the Gaussian confidence heatmap Define: Loss: 21

Spatial. Net o Heatmaps：(for detecting body parts) CNN 22

Spatial. Net o Heatmaps：(for detecting body parts) o KE(Keypoint-Embedding)：(for grouping body parts) CNN 23

Spatial. Net o Heatmaps：(for detecting body parts) o KE(Keypoint-Embedding)：(for grouping body parts) Goal: joint parts from same person, Embedding the same CNN joint parts from different person, Embedding different 24

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN 25

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN (one-to-one correspondence) Num of KE maps is J 26

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN KE map Right hand 27

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN KE map Right hand Left hand 28

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person Right hand Left hand 29

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand 30

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand Ground truth is the average embedding value of person K 31

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand Ground truth is the average embedding value of person K Loss： Pull the embedding value of all joints to the person‘s average embedding value 32

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are from the same person The embedding value should be as close as possible Right hand Left hand Ground truth is the average embedding value of person K Loss： Pull the embedding value of all joints to the person‘s average embedding value 33

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person Right hand 34

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person The embedding value should be different Right hand 35

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person The embedding value should be different Right hand Loss： 36

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) one joint <-> one heatmap <-> one KE map CNN They are not from the same person The embedding value should be different Right hand Loss： KE will be used for judging whether the joints are from this person （according to the embedding value） 37

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) CNN 38

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) o SVF：(for grouping body parts) CNN Spatial Vector Fields is dense offset map 39

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) o SVF：(for grouping body parts) CNN Spatial Vector Fields is dense offset map Relative displacement from the human center to its absolute location p. X direction Y direction 40

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) o SVF：(for grouping body parts) CNN Spatial Vector Fields is dense offset map Relative displacement from the human center to its absolute location p. X direction Y direction Loss： 41

Spatial. Net o Heatmaps：(for detecting body parts) o KE：(for grouping body parts) o SVF：(for grouping body parts) CNN 42

Spatial. Net CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) 43

Spatial. Net CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) auxiliary training SVF：(for grouping body parts) for KE Ordinal maps：(for grouping body parts) 44

Spatial. Net CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) six auxiliary tasks <-> six maps E. g. , left-to-right (l 2 r) 45

Spatial. Net CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) six auxiliary tasks <-> six maps E. g. , left-to-right (l 2 r) The same with KE ( Joint parts from the same person. The embedding value should be as close as possible ) 46

Spatial. Net CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) six auxiliary tasks <-> six maps E. g. , left-to-right (l 2 r) If kth person is on the left of k′th person, then Ord = 1, otherwise Ord = − 1. 47

Spatial. Net CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) 48

Pose Guided Grouping (PGG) Module CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) 49

Pose Guided Grouping (PGG) Module CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) Spatial Instance Embedding 50

Pose Guided Grouping (PGG) Module CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) IE 51

Pose Guided Grouping (PGG) Module CNN o o Heatmaps：(for detecting body parts) KE：(for grouping body parts) SVF：(for grouping body parts) Ordinal maps：(for grouping body parts) IE 52

Pose Guided Grouping (PGG) Module CNN IE 53

Pose Guided Grouping (PGG) Module CNN Concat( ) = X’ ∈ IE 54

Pose Guided Grouping (PGG) Module CNN Concat( ) = X’ ∈ Generate Mask IE 55

Pose Guided Grouping (PGG) Module CNN Concat( ) Generate Mask = X’ ∈ M IE 56

Pose Guided Grouping (PGG) Module CNN Concat( ) Generate Mask = X’ ∈ M M • X’ -> reshape -> X ∈ IE 57

Pose Guided Grouping (PGG) Module CNN Concat( ) Generate Mask = X’ ∈ M M • X’ -> reshape -> X ∈ IE 58

Pose Guided Grouping (PGG) Module M • X’ -> reshape -> X ∈ CNN out final IE 59

Temporal. Net CNN IE 60

Temporal. Net t-1 t o HE : Human Embedding pulling HE of the same instance closer, pushing apart embeddings of different instances. CNN Loss： (HE is trained with triplet loss) 61

Temporal. Net t-1 t o HE : Human Embedding o TIE : Temporal Instance Embedding CNN encodes the relative displacement from the human center in (t − 1)-th frame to body parts in the t-th frame represents the offset from current t-th frame body center to body parts in the previous frame Loss： 62