Edge Templates TemplateBased Tracking and Recognition CSE 4310

Edge Templates Template-Based Tracking and Recognition CSE 4310 – Computer Vision Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

Contours • A contour is a curve/line (typically not straight) that delineates the boundary of a region, or between regions. 2

Shapes Without Texture • Letters/numbers. • Contours. • Edge templates. 3

Detecting Shapes Without Texture • Normalized correlation does not work well. • Slight misalignments have a great impact on the correlation score. star 1 star 3 combined 4

Chamfer Distance • For each edge pixel in star 1: – How far is it from the nearest edge pixel in star 3? • The average of all those answers is the directed chamfer distance from star 1 to star 3. 5

Chamfer Distance • For each edge pixel in star 1: – How far is it from the nearest edge pixel in star 3? • The average of all those answers is the directed chamfer distance from star 1 to star 3. 6

Chamfer Distance • For each edge pixel in star 1: – How far is it from the nearest edge pixel in star 3? • The average of all those answers is the directed chamfer distance from star 1 to star 3. 7

Chamfer Distance • For each edge pixel in star 1: – How far is it from the nearest edge pixel in star 3? • The average of all those answers is the directed chamfer distance from star 1 to star 3. 8

Chamfer Distance • For each edge pixel in star 3: – How far is it from the nearest edge pixel in star 1? • The average of all those answers is the directed chamfer distance from star 3 to star 1. 9

Directed Chamfer Distance • Input: two sets of points. – red, green. • Directed chamfer distance c(red, green): – Average distance from each red point to nearest green point. links connect each red point to the nearest green point. 10

Directed Chamfer Distance • Input: two sets of points. – red, green. • Directed chamfer distance c(red, green): – Average distance from each red point to nearest green point. • Directed chamfer distance c(green, red): links connect each green point to the nearest red point. – Average distance from each green point to nearest red point. 11

(Undirected) Chamfer Distance • Input: two sets of points. – red, green. • c(red, green): – Average distance from each red point to nearest green point. • c(green, red): links connect each green point to the nearest red point. – Average distance from each green point to nearest red point. • Chamfer distance C(red, green), also known as “undirected chamfer distance”: – C(red, green) = c(red, green) + c(green, red) – The undirected distance is the sum of the two directed distances. 12

Chamfer Distance • On two stars: – 31 pixels are nonzero in both images. • On star and crescent: – 33 pixels are nonzero in both images. • Correlation scores can be misleading. 13

Chamfer Distance • Chamfer distance is much smaller between the two stars than between the star and the crescent. 14

Detecting Hands Input image Template. • Problem: hands are highly deformable. • Normalized correlation does not work as well. • Alternative: use edges. 15

Detecting Hands window template • Compute chamfer distance, at all windows, all scales, with template. • Which version? Directed or undirected? • We want small distance with correct window, large distance with incorrect windows. 16

Direction Matters window template • Chamfer distance from window to template: problems? 17

Direction Matters window template • Chamfer distance from window to template: problems? • Clutter (edges not belonging to the hand) cause the distance to be high. 18

Direction Matters window template • Chamfer distance from template to window: problems? 19

Direction Matters window template • Chamfer distance from template to window: problems? • What happens when comparing to a window with lots of edges? 20

Direction Matters window template • Chamfer distance from template to window: problems? • What happens when comparing to a window with lots of edges? Score is low. 21

Choice of Direction window template • For detection, we compute chamfer distance from template to window. • Being robust to clutter is a big plus, ensures the correct results will be included. • Incorrect detections can be discarded with additional 22 checks.

Computing the Chamfer Distance • Compute chamfer distance, at all windows, all scales, with template. • Can be very time consuming. 23

Distance Transform Edge image e 1 Distance transform d 1 • For every pixel, compute distance to nearest edge pixel. d 1 = bwdist(e 1); 24

Distance Transform t 1 Edge image e 1 Distance transform d 1 • If template t 1 is of size (r, c): • Chamfer distance with a window (i: (i+r-1), (j: (j+c-1)) of e 1 can be written as: 25

Distance Transform t 1 Edge image e 1 Distance transform d 1 • If template t 1 is of size (r, c): • Chamfer distance with a window (i: (i+r-1), (j: (j+c-1)) of e 1 can be written as: window = d 1(i: (i+r-1), j: (j+c-1)); sum(t 1. * window)) • Computing image of chamfer scores for one scale: 26

Distance Transform t 1 Edge image e 1 Distance transform d 1 • Computing image of chamfer scores for one scale s resized = imresize(image, s, 'bilinear'); resized_edges = canny(resized, 7); resized_dt = bwdist(resized_edges); chamfer_scores = imfilter(resized_dt, t 1, 'symmetric'); figure(3); imshow(chamfer_scores, []); • How long does that take? Can it be more efficient? 27

Improving Efficiency t 1 Edge image e 1 Distance transform d 1 • Which parts of the template contribute to the score of each window? 28

Improving Efficiency t 1 Edge image e 1 Distance transform d 1 • Which parts of the template contribute to the score of each window? • Just the nonzero parts. • How can we use that? 29

Improving Efficiency t 1 Edge image e 1 Distance transform d 1 • Which parts of the template contribute to the score of each window? Just the nonzero parts. • How can we use that? • Compute a list of non-zero pixels in the template. • Consider only those pixels when computing the sum for each window. 30

Results for Single Scale Search • What is causing the false result? 31

Results for Single Scale Search • What is causing the false result? – Window with lots of edges. • How can we refine these results? 32

Results for Single Scale Search • What is causing the false result? – Window with lots of edges. • How can we refine these results? – Skin color, or background subtraction 33

What Is Tracking? 34

What Is Tracking? • We are given: – the state of one or more objects in the previous frame. • We want to estimate: – the state of those objects in the current frame. 35

What Is Tracking? • We are given: – the state of one or more objects in the previous frame. • We want to estimate: – the state of those objects in the current frame. • “State” can include: – Bounding box. This will be the default case in our class, unless we specify otherwise. – Velocity (2 D vector, motion along y axis per frame and motion along x axis per frame). – Precise pixel-by-pixel shape. – Orientation, scale, 3 D orientation, 3 D position, … 36

Why Do We Care About Tracking? 37

Why Do We Care About Tracking? • Improves speed. – We do not have to run detection at all locations, all scales, all orientations. 38

Why Do We Care About Tracking? • Improves speed. – We do not have to run detection at all locations, all scales, all orientations. • Allows us to establish correspondences across frames. – Provides representations such as “the person moved left”, as opposed to “there is a person at (i 1, j 1) at frame 1, and there is a person at (i 2, j 2) at frame 2”. – Needed in order to recognize gestures, actions, activity. 39

Example Applications • Activity recognition/surveillance. – Figure out if people are coming out of a car, or loading a truck. • Gesture recognition. – Respond to commands given via gestures. – Recognize sign language. • Traffic monitoring. – Figure out if any car is approaching a traffic light. – Figure out if a street/highway is congested. • In all these cases, we must track objects across multiple frames. 40

Estimating Motion of a Block • What is a block? – A rectangular region in the image. – In other words, an image window that is specified by a bounding box. • Given a block at frame t, how can we figure out where the block moved to at frame t+1? 41

Estimating Motion of a Block • What is a block? – A rectangular region in the image. – In other words, an image window that is specified by a bounding box. • Given a block at frame t, how can we figure out where the block moved to at frame t+1? • Simplest method: normalized correlation. 42

Main Loop for Block Tracking Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • What is missing to make this framework fully automatic? 43

Main Loop for Block Tracking Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • What is missing to make this framework fully automatic? – Detection/initialization: find the object, obtain an initial object 44 description.

Main Loop for Block Tracking Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • Tracking methods ignore the initialization problem. • Any detection method can be used to address that problem. 45

Source of Efficiency Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • Why exactly is tracking more efficient than detection? In what lines of the pseudocode is efficiency improved? 46

Source of Efficiency Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • Why exactly is tracking more efficient than detection? In what lines of the pseudocode is efficiency improved? – Line 2. We search fewer locations/scales/orientations. 47

Updating Object Description Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • How can we update the block description in Step 3? – The simplest approach is to simply store the image subwindow corresponding to the bounding box that was found in step 2. 48

Drifting Input: block extracted from previous frame. 1. read current frame. 2. find best match of block in current frame (here we can search the entire image or just a region close to the location in the previous frame). 3. (optional) update description of block to match the appearance in the current frame. 4. advance frame counter. 5. goto 1. • The estimate can be off by a pixel or so at each frame. – Sometimes larger errors occur. • If we update the appearance, errors can accumulate. 49

Changing Appearance • Sometimes the appearance of an object changes from frame to frame. – Example: left foot and right foot in walkstraight sequence. • There is a fundamental dilemma between avoid drift and updating the appearance of the object. – If we do not update the object description, at some point the description is not good enough. – If we update the object description at each frame, the slightest tracking error means that we update the description using the wrong bounding box, which can lead to more tracking errors in subsequent frames. 50

Occlusion • The object we track can temporarily be occluded (fully or partially) by other objects. • If appearance is updated at each frame, when the object is occluded it is unlikely to be found again. 51

Improving Tracking Stability • Check every match using a detector. – If we track a face, then the best match, in addition to matching the correlation score, should also have a good detection score using a general face detector. – If the face is occluded, the tracker can figure that out, because no face is detected. – When the face reappears, the detector will find it again. 52

Improving Tracking Stability • Remembering appearance history. – An object may have a small number of possible appearances. • The appearance of the head depends on the viewing angle. – If we remember each appearance, we can have less drifting. • When the current appearance is similar to a stored appearance, we do not need to make any updates. 53

Improving Tracking Stability • Multiple hypothesis tracking. – Real-world systems almost always maintain multiple hypotheses. – This way, when the right answer is not clear (e. g. , because of occlusions), the system does not have to commit to a single answer. – Instead, the system maintains a list of most likely correct answers, and it updates that list when it sees a new frame. 54

Recognition • By “recognition”, we refer to the problem of recognizing the type of object that we see in a specific image subwindow. • Typically, before we do “recognition”, we need to do detection to identify the image subwindow containing the object that we want to recognize. • For example: given a photograph, – First we do face detection, to find bounding boxes of faces. – Then we do face recognition, to identify the person that each face is associated with. • Another example: given a video, process each frame by: – Detecting the hand. – Recognizing the handshape of the hand. 55

Example: Face Recognition • First we do face detection, to find bounding boxes of faces. • Then we do face recognition, to identify the person that each face is associated with. input window Training examples of faces of several individuals 56

Example: Handshape Recognition • We may have an application that responds to a specific (and relatively small) number of handshapes. – This can be part of a game-playing interface. – Or, it can be an interface that allows the user to interact with some program. input window Training examples of various handshapes 57

Example: Digit Recognition • For example, this can be useful for recognizing zip codes in postal addresses. – First, detect the location of each digit of the ZIP code. – Then, recognize each digit. input window Training examples of various digits 58

Template-Based Recognition • In template-based recognition, we have a training set of templates, that show examples for each class that we want to recognize. – For each training template, we also know the ground truth, i. e. , the class that the template belongs to. input window Training examples of various handshapes 59

Template-Based Recognition • Given an input window to classify, we do nearest neighbor classification: – We measure the distance between the input window and each template. – We identify the most similar template (the nearest neighbor). – We use the class of the nearest neighbor as our prediction for the input window Training examples of various handshapes 60

Template-Based Recognition • To do nearest neighbor classification, we must use a distance measure or similarity measure. • Examples of similarity measures: normalized cross-correlation. • Examples of distance measures: Euclidean distance, chamfer distance (directed or undirected). input window Training examples of various handshapes 61

Detour: Machine Learning • We have seen how to use templates to solve some standard computer vision problems: – Detection. – Tracking. – Recognition. • Templates are a commonly used method, but there are lots of other methods for solving each of these problems. • A very common approach (or rather family of approaches) is to use machine learning methods. 62

Detour: Machine Learning • Machine learning is not the focus of this class. – CSE 4309 and CSE 6363 are classes that focus on machine learning. – Most of the methods covered in those classes can be applied to computer vision data. • We may look at one or two machine learning methods later in the semester, if we have time. – As an example, look at the “Rectangle Filters and Ada. Boost” slides posted on the course website. 63

Example: Learning a Face Detector • How can we use a machine learning method (e. g. , decision trees, neural networks, support vector machines) for face detection? • First, assemble a dataset. – Thousands of photographs that include faces, together with ground truth (bounding box of each face in each photograph). – Thousands or millions of photographs that do not include faces. Photographs with no faces are easier to add to the training set, because they do not require us to mark the bounding box of faces. • Second, create a set of vectorized training examples. – Face examples are vectors that correspond to a face location. – Non-face examples are vectors that correspond to a non-face location. • Third, run your favorite machine learning method to learn a model. • Given an input image, use the model to classify each image window (possibly at multiple scales/orientations) as face or non-face. 64
- Slides: 64