Machine Learning Methods for HumanComputer Interaction Kerem Altun

Machine learning Pattern recognition Template matching Statistical pattern recognition Supervised methods Regression Structural pattern

What is pattern recognition? n n n title even appears in the International Association

Various applications of PR [Jain et al. , 2000] IEEE Haptics Symposium 2012 4

Supervised learning “tufa” Can you identify other “tufa”s here? lifted from lecture notes by

Unsupervised learning How many categories are there? Which image belongs to which category? lifted

Pattern recognition in haptics/HCI n n n [Altun et al. , 2010 a] human

Pattern recognition in haptics/HCI [Altun et al. , 2010 a] walking basketball right arm

Pattern recognition in haptics/HCI n n [Flagg et al. , 2012] touch gesture recognition

Pattern recognition in haptics/HCI [Flagg et al. , 2012] stroke scratch IEEE Haptics Symposium

Other haptics/HCI applications? IEEE Haptics Symposium 2012 11

Pattern recognition example [Duda et al. , 2000] n n excellent example by Duda

Pattern recognition example n how to classify? what kind of information can distinguish these

Pattern recognition example [Duda et al. , 2000] IEEE Haptics Symposium 2012 14

Pattern recognition example n on the average, salmon are usually shorter, but is this

Pattern recognition example [Duda et al. , 2000] IEEE Haptics Symposium 2012 16

Pattern recognition example n how to choose threshold? IEEE Haptics Symposium 2012 17

Pattern recognition example n how to choose threshold? q n minimize the probability of

Pattern recognition example n n we don't have to use just one feature let's

Pattern recognition example n should we add as more features as we can? q

Pattern recognition example n how to choose the decision boundary? is this one better?

Probability theory review n a chance experiment, e. g. , tossing a 6 -sided

Probability axioms n n n for any event, for the sample space, for disjoint

Conditional probability n n sometimes events occur and change the probabilities of other events

Conditional probability n suppose I flip the coin five times, obtaining the outcome HHHHH

Conditional probability n definition: the conditional probability of event A given that event B

Conditional probability H 0: the coin is fake, both sides H H 1: the

Bayesian inference w 0: the coin belongs to the “fake” class w 1: the

Random variables n n n we model the observations with random variables a random

Random variables n a discrete random variable X is characterized by a probability mass

Random variables n a continuous random variable X is characterized by a probability density

Random variables n a pdf also has two properties IEEE Haptics Symposium 2012 46

Expectation n definition n average of possible values of X, weighted by probabilities also

Variance and standard deviation n variance is the expected value of deviation from the

Gaussian (normal) distribution n possibly the most ''natural'' distribution q n encountered frequently in

Gaussian distribution it can be proved that: figure lifted from http: //assets. allbusiness. com

Random vectors n extension of the scalar case q q q n pdf: mean:

Multivariate Gaussian distribution n probability density function: n two parameters: compare with the univariate

Bivariate Gaussian exercise The scatter plots show 100 independent samples drawn from zero-mean Gaussian

Bayesian decision theory n Bayesian decision theory falls into the subjective interpretation of probability

Bayesian decision theory n n back to the fish example say we have two

Bayesian decision theory n n n prior probabilities reflect our belief about which kind

Bayesian decision theory n let be the feature vector obtained from our observations q

Bayesian decision theory n is called the class-conditional probability density function (CCPDF) q n

Bayesian decision theory n MAP rule (also called the minimum-error rule): q q n

Bayesian decision theory n multiclass problems: maximum a posteriori (MAP) decision rule the MAP

Exercise (single feature) n find: n the maximum likelihood decision rule [Duda et al.

Exercise (single feature) n find: n the MAP decision rule q if [Duda et

Discriminant functions n we can generalize this let be the discriminant function for the

Discriminant functions n the discriminant functions divide the feature space into decision regions that

Discriminant functions for Gaussian densities n consider a multiclass problem (c classes) q n

Examples 2 -D 3 -D equal and spherical covariance matrices equal covariance matrices [Duda

Examples [Duda et al. , 2000] IEEE Haptics Symposium 2012 70

Examples [Duda et al. , 2000] IEEE Haptics Symposium 2012 71

2 -D example n artificial data [Jain et al. , 2000] IEEE Haptics Symposium

Density estimation n but, CCPDFs are usually unknown q that's why we need training

Density estimation n assume we have n samples of training vectors for a class

Parametric methods n we will consider only the Gaussian case q n underlying assumption:

Gaussian case n assume Gaussian distribution n how to find the pdf? are drawn

Gaussian case n assume Gaussian distribution n how to find the pdf? q are

2 -D example n back to the 2 -D example calculate apply the MAP

2 -D example n back to the 2 -D example IEEE Haptics Symposium 2012

2 -D example decision boundary with true pdf decision boundary with estimated pdf IEEE

Haptics example [Flagg et al. , 2012] stroke scratch light touch which feature to

Haptics example n n n [Flagg et al. , 2012] 7 participants performed each

Haptics example assume equal priors apply ML rule IEEE Haptics Symposium 2012 83

Haptics example apply ML rule decision boundaries? (decision thresholds for 1 -D) IEEE Haptics

Haptics example n let's plot the 2 -D distribution n clearly this isn't a

Activity recognition example n n n [Altun et al. , 2010 a] 4 participants

Activity recognition example feature 2 feature 1 IEEE Haptics Symposium 2012 87

Activity recognition example n the Gaussian assumption looks valid n this is a "good"

Activity recognition example n decision boundaries IEEE Haptics Symposium 2012 89

Haptics example n how to solve the problem? IEEE Haptics Symposium 2012 90

Haptics example n how to solve the problem? n either change the classifier, or

Non-parametric methods n n let's estimate the CCPDF directly from samples simplest method to

Non-parametric methods n how to choose the bin size? n number of bins increase

Non-parametric methods n n compare the following density estimates pdf estimates with six samples

Kernel density estimation n a density estimate can be obtained as n where the

Kernel density estimation n three different density estimates with different widths q q q

KDE for activity recognition data IEEE Haptics Symposium 2012 97

KDE for activity recognition data IEEE Haptics Symposium 2012 98

KDE for gesture recognition data IEEE Haptics Symposium 2012 99

Other density estimation methods n Gaussian mixture models q q q parametric model the

Another example [Aksoy. , 2011] IEEE Haptics Symposium 2012 101

Measuring classifier performance n how do we know our classifiers will work? n how

Confusion matrix confusion matrix for an 8 -class problem [Tunçel et al. , 2009]

Measuring classifier performance n use the training samples to test the classifiers q this

Cross validation n having a separate test data set might not be possible for

Cross validation methods n repeated random sub-sampling q q q divide the data into

Cross validation methods n K-fold cross validation q q q n randomly divide the

Haptics example assume equal priors apply ML rule 60. 0% the decision region for

Haptics example apply ML rule 58. 5% IEEE Haptics Symposium 2012 109

Haptics example 58. 8% 62. 4% IEEE Haptics Symposium 2012 110

Activity recognition example 75. 8% 71. 9% IEEE Haptics Symposium 2012 111

Activity recognition example 87. 8% IEEE Haptics Symposium 2012 112

Another cross-validation method n n used in HCI studies with multiple human subjects subject-based

Activity recognition example minimum value maximum value K-fold 75. 8% 71. 9% subject-based leave-one-out

Activity recognition example K-fold 87. 8% subject-based leave-one-out 81. 8% IEEE Haptics Symposium 2012

Dimensionality reduction [Duda et al. , 2000] n for most problems a few features

Dimensionality reduction [Jain et al. , 2000] n should we add as many features

Dimensionality reduction n we should add features up to a certain point q q

Dimensionality reduction n how many features to use? q n which features to use?

Pen input recognition [Willems, 2010] IEEE Haptics Symposium 2012 120

Touch gesture recognition [Flagg et al. , 2012] IEEE Haptics Symposium 2012 121

Feature reduction and selection n form a set of many features some of them

Feature reduction n n we will only consider Principal Component Analysis (PCA) unsupervised method

Principal component analysis n how to “best represent the data? ” IEEE Haptics Symposium

Principal component analysis n how to “best represent the data? ” find the direction(s)

Principal component analysis n find the covariance matrix n spectral decomposition: q q q

Principal component analysis n n n put the eigenvalues in decreasing order corresponding eigenvectors

Activity recognition example [Altun et al. , 2010 a] n five sensor units (wrists,

Activity recognition example n n compute covariance matrix find eigenvalues and eigenvectors plot first

Activity recognition example IEEE Haptics Symposium 2012 130

Activity recognition example what does the Bayesian decision making (BDM) result suggest? IEEE Haptics

Feature reduction n n ideally, this should be done for the training set only

Feature selection n n alternatively, we can select from our large feature set say

Feature selection n best individual features q evaluate all the d features individually, select

Feature selection n sequential forward selection q q q start with the empty set

Feature selection n sequential backward selection q q start with the full feature set

Feature selection n plus p – take away r selection q q first enlarge

Activity recognition example first 5 features selected by sequential forward selection first 5 features

Activity recognition example [Altun et al. , 2010 b] IEEE Haptics Symposium 2012 139

Discriminative methods n we talked about discriminant functions n for the MAP rule we

Linear discriminant functions n consider the discriminant function that is a linear combination of

Linear discriminant functions n for the multiclass case, there are options n c two-class

Linear discriminant functions distinguish one class from others consider classes pairwise [Duda et al.

Linear discriminant functions n or, use the original definition q assign x to class

Nearest mean classifier n find the means of training vectors n assign the class

2 -D example n artificial data IEEE Haptics Symposium 2012 146

2 -D example n estimated parameters decision boundary with true pdf decision boundary with

Activity recognition example IEEE Haptics Symposium 2012 148

k-nearest neighbor method n for a test vector y q q n find the

1 -nearest neighbor n decision regions: this is called a Voronoi tessellation [Duda et

k-nearest neighbor n test sample q n class q n square class q n

k-nearest neighbor n n no training is needed computation time for testing is high

Haptics example K-fold 63. 3% subject-based leave-one-out 59. 0% IEEE Haptics Symposium 2012 153

Activity recognition example K-fold 90. 0% subject-based leave-one-out 89. 2% IEEE Haptics Symposium 2012

Activity recognition example decision boundaries for k=3 IEEE Haptics Symposium 2012 155

Feature normalization n especially when computing distances, the scales of the feature axes are

Feature normalization n linear scaling where l is the lowest value and u is

Feature normalization n n ideally, the parameters l, u, m, and s should be

Discriminative methods n another popular method is the binary decision tree q q q

Activity recognition example IEEE Haptics Symposium 2012 160

Discriminative methods n n n one very popular method is the support vector machine

Comparison for activity recognition n 1170 features reduced to 30 by PCA 19 activities

References n n n S. Aksoy, Pattern Recognition lecture notes, Bilkent University, Ankara, Turkey,

Slides: 163

Download presentation

Machine Learning Methods for Human-Computer Interaction Kerem Altun Postdoctoral Fellow Department of Computer Science University of British Columbia IEEE Haptics Symposium March 4, 2012 Vancouver, B. C. , Canada IEEE Haptics Symposium 2012

Machine learning Pattern recognition Template matching Statistical pattern recognition Supervised methods Regression Structural pattern recognition Neural networks Unsupervised methods IEEE Haptics Symposium 2012 2

What is pattern recognition? n n n title even appears in the International Association for Pattern Recognition (IAPR) newsletter many definitions exist simply: the process of labeling observations (x) with predefined categories (w) IEEE Haptics Symposium 2012 3

Various applications of PR [Jain et al. , 2000] IEEE Haptics Symposium 2012 4

Supervised learning “tufa” Can you identify other “tufa”s here? lifted from lecture notes by Josh Tenenbaum IEEE Haptics Symposium 2012 5

Unsupervised learning How many categories are there? Which image belongs to which category? lifted from lecture notes by Josh Tenenbaum IEEE Haptics Symposium 2012 6

Pattern recognition in haptics/HCI n n n [Altun et al. , 2010 a] human activity recognition body-worn inertial sensors q n daily activities q n accelerometers and gyroscopes sitting, standing, walking, stairs, etc. sports activities q walking/running, cycling, rowing, basketball, etc. IEEE Haptics Symposium 2012 7

Pattern recognition in haptics/HCI [Altun et al. , 2010 a] walking basketball right arm acc left arm acc IEEE Haptics Symposium 2012 8

Pattern recognition in haptics/HCI n n [Flagg et al. , 2012] touch gesture recognition on a conductive fur patch IEEE Haptics Symposium 2012 9

Pattern recognition in haptics/HCI [Flagg et al. , 2012] stroke scratch IEEE Haptics Symposium 2012 light touch 10

Other haptics/HCI applications? IEEE Haptics Symposium 2012 11

Pattern recognition example [Duda et al. , 2000] n n excellent example by Duda et al. classifying incoming fish on a conveyor belt using a camera image q q sea bass salmon IEEE Haptics Symposium 2012 12

Pattern recognition example n how to classify? what kind of information can distinguish these two species? q n suppose a fisherman tells us that salmon are usually shorter q n so, let's use length as a feature what to do to classify? q n length, width, weight, etc. capture image – find fish in the image – measure length – make decision how to make the decision? q how to find the threshold? IEEE Haptics Symposium 2012 13

Pattern recognition example [Duda et al. , 2000] IEEE Haptics Symposium 2012 14

Pattern recognition example n on the average, salmon are usually shorter, but is this a good feature? n let's try classifying according to lightness of the fish scales IEEE Haptics Symposium 2012 15

Pattern recognition example [Duda et al. , 2000] IEEE Haptics Symposium 2012 16

Pattern recognition example n how to choose threshold? IEEE Haptics Symposium 2012 17

Pattern recognition example n how to choose threshold? q n minimize the probability of error sometimes we should consider costs of different errors q q q salmon is more expensive customers who order salmon but get sea bass instead will be angry customers who order sea bass but occasionally get salmon instead will not be unhappy IEEE Haptics Symposium 2012 18

Pattern recognition example n n we don't have to use just one feature let's use lightness and width each point is a feature vector 2 -D plane is the feature space [Duda et al. , 2000] IEEE Haptics Symposium 2012 19

Pattern recognition example n n we don't have to use just one feature let's use lightness and width each point is a feature vector 2 -D plane is the feature space decision boundary [Duda et al. , 2000] IEEE Haptics Symposium 2012 20

Pattern recognition example n should we add as more features as we can? q do not use redundant features IEEE Haptics Symposium 2012 21

Pattern recognition example n should we add as more features as we can? q q do not use redundant features consider noise in the measurements IEEE Haptics Symposium 2012 22

Pattern recognition example n should we add as more features as we can? q q q do not use redundant features consider noise in the measurements moreover, n n n avoid adding too many features more features means higher dimensional feature vectors difficult to work in high dimensional spaces this is called the curse of dimensionality more on this later IEEE Haptics Symposium 2012 23

Pattern recognition example n how to choose the decision boundary? is this one better? [Duda et al. , 2000] IEEE Haptics Symposium 2012 24

Pattern recognition example n how to choose the decision boundary? is this one better? [Duda et al. , 2000] IEEE Haptics Symposium 2012 25

Probability theory review n a chance experiment, e. g. , tossing a 6 -sided die q q q 1, 2, 3, 4, 5, 6 are possible outcomes the set of all outcomes: W={1, 2, 3, 4, 5, 6} is the sample space any subset of the sample space is an event n q q the event that the outcome is odd: A={1, 3, 5} each event is assigned a number called the probability of the event: P(A) the assigned probabilities can be selected freely, as long as Kolmogorov axioms are not violated IEEE Haptics Symposium 2012 26

Probability axioms n n n for any event, for the sample space, for disjoint events third axiom also includes the case die tossing – if all outcomes are equally likely q for all i=1… 6, probability of getting outcome i is 1/6 IEEE Haptics Symposium 2012 27

Conditional probability n n sometimes events occur and change the probabilities of other events example: ten coins in a bag q q q n nine of them are fair coins – heads (H) and tails (T) one of them is fake – both sides are heads (H) I randomly draw one coin from the bag, but I don’t show it to you H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T which of these events would you bet on? IEEE Haptics Symposium 2012 28

Conditional probability n suppose I flip the coin five times, obtaining the outcome HHHHH (five heads in a row) q call this event F H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T n which of these events would you bet on now? IEEE Haptics Symposium 2012 29

Conditional probability n definition: the conditional probability of event A given that event B has occurred: read as: "probability of A given B" n P(AB) is the probability of events A and B occurring together n Bayes’ theorem: IEEE Haptics Symposium 2012 30

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) n we know that F occurred we want to find – n difficult – use Bayes’ theorem n IEEE Haptics Symposium 2012 31

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) IEEE Haptics Symposium 2012 32

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) probability of observing F if H 0 was true posterior probability prior probability (before the observation F) total probability of observing F IEEE Haptics Symposium 2012 33

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) total probability of observing F IEEE Haptics Symposium 2012 34

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 1 IEEE Haptics Symposium 2012 35

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 1 1/10 IEEE Haptics Symposium 2012 36

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 1 1/10 IEEE Haptics Symposium 2012 1/32 37

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 1 1/10 IEEE Haptics Symposium 2012 1/32 9/10 38

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 32/41 1 1/10 1/32 9/10 which event would you bet on? IEEE Haptics Symposium 2012 39

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 32/41 1 1/10 1/32 9/10 this is very similar to a pattern recognition problem! IEEE Haptics Symposium 2012 40

Conditional probability H 0: the coin is fake, both sides H H 1: the coin is fair – one side H, other side T F: obtaining five heads in a row (HHHHH) 1 32/41 1 1/10 1/32 9/10 we can put a label on the coin as “fake” based on our observations! IEEE Haptics Symposium 2012 41

Bayesian inference w 0: the coin belongs to the “fake” class w 1: the coin belongs to the “fair” class x: observation n decide if the posterior probability is higher than others this is called the MAP (maximum a posteriori) decision rule q IEEE Haptics Symposium 2012 42

Random variables n n n we model the observations with random variables a random variable is a real number whose value depends on a chance experiment discrete random variable q n the possible values form a discrete set continuous random variable q the possible values form a continuous set IEEE Haptics Symposium 2012 43

Random variables n a discrete random variable X is characterized by a probability mass function (pmf) n a pmf has two properties IEEE Haptics Symposium 2012 44

Random variables n a continuous random variable X is characterized by a probability density function (pdf) denoted by for all possible values n probabilities are calculated for intervals IEEE Haptics Symposium 2012 45

Random variables n a pdf also has two properties IEEE Haptics Symposium 2012 46

Expectation n definition n average of possible values of X, weighted by probabilities also called expected value, mean n IEEE Haptics Symposium 2012 47

Variance and standard deviation n variance is the expected value of deviation from the mean n variance is always positive q n or zero, which means X is not random standard deviation is the square root of the variance IEEE Haptics Symposium 2012 48

Gaussian (normal) distribution n possibly the most ''natural'' distribution q n encountered frequently in nature central limit theorem q sum of i. i. d. random variables is Gaussian n definition: the random variable with pdf n two parameters: IEEE Haptics Symposium 2012 49

Gaussian distribution it can be proved that: figure lifted from http: //assets. allbusiness. com IEEE Haptics Symposium 2012 50

Random vectors n extension of the scalar case q q q n pdf: mean: covariance matrix: covariance matrix is always symmetric and positive semidefinite IEEE Haptics Symposium 2012 51

Multivariate Gaussian distribution n probability density function: n two parameters: compare with the univariate case: n IEEE Haptics Symposium 2012 52

Bivariate Gaussian exercise The scatter plots show 100 independent samples drawn from zero-mean Gaussian distributions, with different covariance matrices. Match the covariance matrices with the scatter plots, by inspection only. a b IEEE Haptics Symposium 2012 c 53

Bayesian decision theory n Bayesian decision theory falls into the subjective interpretation of probability n in the pattern recognition context, some prior belief about the class (category) of an observation is updated using the Bayes rule IEEE Haptics Symposium 2012 55

Bayesian decision theory n n back to the fish example say we have two classes (states of nature) let be the prior probability that the fish is a sea bass is the prior probability that the fish is a salmon IEEE Haptics Symposium 2012 56

Bayesian decision theory n n n prior probabilities reflect our belief about which kind of fish to expect, before we observe it we can choose according to the fishing location, time of year etc. if we don’t have any prior knowledge, we can choose equal priors (or uniform priors) IEEE Haptics Symposium 2012 57

Bayesian decision theory n let be the feature vector obtained from our observations q can include features like lightness, weight, length, etc. n calculate posterior probabilities n how to calculate? q and IEEE Haptics Symposium 2012 58

Bayesian decision theory n is called the class-conditional probability density function (CCPDF) q n the CCPDF is usually not known q q n pdf of observation x if the true class was e. g. , impossible to know the pdf of the length of all sea bass in the world but it can be estimated, more on this later for now, assume that the CCPDF is known q just substitute observation x in IEEE Haptics Symposium 2012 59

Bayesian decision theory n MAP rule (also called the minimum-error rule): q q n decide if otherwise do we really have to calculate IEEE Haptics Symposium 2012 ? 60

Bayesian decision theory n multiclass problems: maximum a posteriori (MAP) decision rule the MAP rule minimizes the error probability, and is the best performance that can be achieved (of course, if the CCPDFs are known) n if prior probabilities are equal: maximum likelihood (ML) decision rule IEEE Haptics Symposium 2012 61

Exercise (single feature) n find: n the maximum likelihood decision rule [Duda et al. , 2000] IEEE Haptics Symposium 2012 62

Exercise (single feature) n find: n the maximum likelihood decision rule [Duda et al. , 2000] IEEE Haptics Symposium 2012 63

Exercise (single feature) n find: n the MAP decision rule q if [Duda et al. , 2000] IEEE Haptics Symposium 2012 64

Exercise (single feature) n find: n the MAP decision rule q if [Duda et al. , 2000] IEEE Haptics Symposium 2012 65

Discriminant functions n we can generalize this let be the discriminant function for the ith class decision rule: assign x to class i if n for the MAP rule: n n IEEE Haptics Symposium 2012 66

Discriminant functions n the discriminant functions divide the feature space into decision regions that are separated by decision boundaries IEEE Haptics Symposium 2012 67

Discriminant functions for Gaussian densities n consider a multiclass problem (c classes) q n discriminant functions: easy to show analytically that the decision boundaries are hyperquadrics q q if the feature space is 2 -D, conic sections hyperplanes (or lines for 2 -D) if covariance matrices are the same for all classes (degenerate case) IEEE Haptics Symposium 2012 68

Examples 2 -D 3 -D equal and spherical covariance matrices equal covariance matrices [Duda et al. , 2000] IEEE Haptics Symposium 2012 69

Examples [Duda et al. , 2000] IEEE Haptics Symposium 2012 70

Examples [Duda et al. , 2000] IEEE Haptics Symposium 2012 71

2 -D example n artificial data [Jain et al. , 2000] IEEE Haptics Symposium 2012 72

Density estimation n but, CCPDFs are usually unknown q that's why we need training data density estimation parametric non-parametric assume a class of densities (e. g. Gaussian), find the parameters IEEE Haptics Symposium 2012 estimate the pdf directly (and numerically) from the training data 73

Density estimation n assume we have n samples of training vectors for a class n we assume that these samples are independent and drawn from a certain probability distribution q this is called the generative approach IEEE Haptics Symposium 2012 74

Parametric methods n we will consider only the Gaussian case q n underlying assumption: samples are actually noise -corrupted versions of a single feature vector why Gaussian? three important properties q q q completely specified by mean and variance linear transformations remain Gaussian central limit theorem: many phenomena encountered in reality are asymptotically Gaussian IEEE Haptics Symposium 2012 75

Gaussian case n assume Gaussian distribution n how to find the pdf? are drawn from a IEEE Haptics Symposium 2012 76

Gaussian case n assume Gaussian distribution n how to find the pdf? q are drawn from a finding the mean and covariance is sufficient sample mean sample covariance IEEE Haptics Symposium 2012 77

2 -D example n back to the 2 -D example calculate apply the MAP rule IEEE Haptics Symposium 2012 78

2 -D example n back to the 2 -D example IEEE Haptics Symposium 2012 79

2 -D example decision boundary with true pdf decision boundary with estimated pdf IEEE Haptics Symposium 2012 80

Haptics example [Flagg et al. , 2012] stroke scratch light touch which feature to use for discrimination? IEEE Haptics Symposium 2012 81

Haptics example n n n [Flagg et al. , 2012] 7 participants performed each gesture 10 times 210 samples in total we should find distinguishing features let's use one feature at a time q we assume the feature value is normally distributed, find the mean and covariance IEEE Haptics Symposium 2012 82

Haptics example assume equal priors apply ML rule IEEE Haptics Symposium 2012 83

Haptics example apply ML rule decision boundaries? (decision thresholds for 1 -D) IEEE Haptics Symposium 2012 84

Haptics example n let's plot the 2 -D distribution n clearly this isn't a "good" classifier for this problem n the Gaussian assumption is not valid IEEE Haptics Symposium 2012 85

Activity recognition example n n n [Altun et al. , 2010 a] 4 participants (2 male, 2 female) activities: standing, ascending stairs, walking 720 samples in total sensor: accelerometer on the right leg let's use the same features q minimum and maximum values IEEE Haptics Symposium 2012 86

Activity recognition example feature 2 feature 1 IEEE Haptics Symposium 2012 87

Activity recognition example n the Gaussian assumption looks valid n this is a "good" classifier for this problem IEEE Haptics Symposium 2012 88

Activity recognition example n decision boundaries IEEE Haptics Symposium 2012 89

Haptics example n how to solve the problem? IEEE Haptics Symposium 2012 90

Haptics example n how to solve the problem? n either change the classifier, or change the features IEEE Haptics Symposium 2012 91

Non-parametric methods n n let's estimate the CCPDF directly from samples simplest method to use is the histogram partition the feature space into (equally-sized) bins count the number of samples in each bin k: number of samples in the bin that includes x n: total number of samples V: volume of the bin IEEE Haptics Symposium 2012 92

Non-parametric methods n how to choose the bin size? n number of bins increase exponentially with the dimension of the feature space n we can do better than that! IEEE Haptics Symposium 2012 93

Non-parametric methods n n compare the following density estimates pdf estimates with six samples image from http: //en. wikipedia. org/wiki/Parzen_Windows IEEE Haptics Symposium 2012 94

Kernel density estimation n a density estimate can be obtained as n where the functions are Gaussians centered at. More precisely, K: Gaussian kernel hn: width of the Gaussian IEEE Haptics Symposium 2012 95

Kernel density estimation n three different density estimates with different widths q q q if the width is large, the pdf will be too smooth if the width is small, the pdf will be too spiked as the width approaches zero, the pdf converges to a sum of Dirac delta functions [Duda et al. , 2000] IEEE Haptics Symposium 2012 96

KDE for activity recognition data IEEE Haptics Symposium 2012 97

KDE for activity recognition data IEEE Haptics Symposium 2012 98

KDE for gesture recognition data IEEE Haptics Symposium 2012 99

Other density estimation methods n Gaussian mixture models q q q parametric model the distribution as sum of M Gaussians optimization algorithm: n n expectation-maximization (EM) k-nearest neighbor estimation q q q non-parametric variable width fixed k IEEE Haptics Symposium 2012 100

Another example [Aksoy. , 2011] IEEE Haptics Symposium 2012 101

Measuring classifier performance n how do we know our classifiers will work? n how do we measure the performance, i. e. , decide one classifier is better than the other? q q n correct recognition rate confusion matrix ideally, we should have more data independent from the training set and test the classifiers IEEE Haptics Symposium 2012 102

Confusion matrix confusion matrix for an 8 -class problem [Tunçel et al. , 2009] IEEE Haptics Symposium 2012 103

Measuring classifier performance n use the training samples to test the classifiers q this is possible, but not good practice 100% correct classification rate for this example! because the classifier "memorized" the training samples instead of "learning" them [Duda et al. , 2000] IEEE Haptics Symposium 2012 104

Cross validation n having a separate test data set might not be possible for some cases n we can use cross validation q n use some of the data for training, and the remaining for testing how to divide the data? IEEE Haptics Symposium 2012 105

Cross validation methods n repeated random sub-sampling q q q divide the data into two groups randomly (usually the size of the training set is larger) train and test, record the correct classification rate do this repeatedly, take the average IEEE Haptics Symposium 2012 106

Cross validation methods n K-fold cross validation q q q n randomly divide the data into K sets use K-1 sets for training, 1 set for testing repeat K times, at each fold use a different set for testing leave-one-out cross validation q q use one sample for testing, and all the remaining for training same as K-fold cross validation, with K being equal to the total number of samples IEEE Haptics Symposium 2012 107

Haptics example assume equal priors apply ML rule 60. 0% the decision region for light touch is too small!! IEEE Haptics Symposium 2012 108

Haptics example apply ML rule 58. 5% IEEE Haptics Symposium 2012 109

Haptics example 58. 8% 62. 4% IEEE Haptics Symposium 2012 110

Activity recognition example 75. 8% 71. 9% IEEE Haptics Symposium 2012 111

Activity recognition example 87. 8% IEEE Haptics Symposium 2012 112

Another cross-validation method n n used in HCI studies with multiple human subjects subject-based leave-one-out cross validation q q number of subjects: S leave one subject's data out, train with the remaining data repeat for S times, each time test with a different subject, then average gives an estimate for the expected correct recognition rate when a new user is encountered IEEE Haptics Symposium 2012 113

Activity recognition example minimum value maximum value K-fold 75. 8% 71. 9% subject-based leave-one-out 60. 8% 61. 6% IEEE Haptics Symposium 2012 114

Activity recognition example K-fold 87. 8% subject-based leave-one-out 81. 8% IEEE Haptics Symposium 2012 115

Dimensionality reduction [Duda et al. , 2000] n for most problems a few features are not enough n adding features sometimes helps IEEE Haptics Symposium 2012 116

Dimensionality reduction [Jain et al. , 2000] n should we add as many features as we can? n what does this figure say? IEEE Haptics Symposium 2012 117

Dimensionality reduction n we should add features up to a certain point q q n the more the training samples, the farther away this point is more features = higher dimensional spaces in higher dimensions, we need more samples to estimate the parameters and the densities accurately q q number of necessary training samples grows exponentially with the dimension of the feature space this is called the curse of dimensionality IEEE Haptics Symposium 2012 118

Dimensionality reduction n how many features to use? q n which features to use? q n rule of thumb: use at least ten times as many training samples as the number of features difficult to know beforehand one approach: consider many features and select among them IEEE Haptics Symposium 2012 119

Pen input recognition [Willems, 2010] IEEE Haptics Symposium 2012 120

Touch gesture recognition [Flagg et al. , 2012] IEEE Haptics Symposium 2012 121

Feature reduction and selection n form a set of many features some of them might be redundant feature reduction (sometimes called feature extraction) q q n form linear or nonlinear combinations of features in the reduced set usually don’t have physical meaning feature selection q select most discriminative features from the set IEEE Haptics Symposium 2012 122

Feature reduction n n we will only consider Principal Component Analysis (PCA) unsupervised method q n n we don’t care about the class labels consider the distribution of all the feature vectors in the d-dimensional feature space PCA is the projection to a lower dimensional space that “best represents the data” q get rid of unnecessary dimensions IEEE Haptics Symposium 2012 123

Principal component analysis n how to “best represent the data? ” IEEE Haptics Symposium 2012 124

Principal component analysis n how to “best represent the data? ” find the direction(s) in which the variance of the data is the largest IEEE Haptics Symposium 2012 125

Principal component analysis n find the covariance matrix n spectral decomposition: q q q eigenvalues: on the diagonal of eigenvectors: columns of covariance matrix is symmetric and positive semidefinite = eigenvalues are nonnegative, eigenvectors are orthogonal IEEE Haptics Symposium 2012 126

Principal component analysis n n n put the eigenvalues in decreasing order corresponding eigenvectors show the principal directions in which the variance of the data is largest say we want to have m features only q project to the space spanned by the first m eigenvectors IEEE Haptics Symposium 2012 127

Activity recognition example [Altun et al. , 2010 a] n five sensor units (wrists, legs, chest) q q n each unit has three accelerometers, three gyroscopes, three magnetometers 45 sensors in total computed 26 features from sensor signals q q mean, variance, min, max, Fourier transform etc. 45 x 26=1170 features IEEE Haptics Symposium 2012 128

Activity recognition example n n compute covariance matrix find eigenvalues and eigenvectors plot first 100 eigenvalues reduced the number of features to 30 IEEE Haptics Symposium 2012 129

Activity recognition example IEEE Haptics Symposium 2012 130

Activity recognition example what does the Bayesian decision making (BDM) result suggest? IEEE Haptics Symposium 2012 131

Feature reduction n n ideally, this should be done for the training set only estimate from the training set, find eigenvalues and eigenvectors and the projection apply the projection to the test vector for example for K-fold cross validation, this should be done K times q computationally expensive IEEE Haptics Symposium 2012 132

Feature selection n n alternatively, we can select from our large feature set say we have d features and want to reduce it to m q q q optimal way: evaluate all possibilities and choose the best one not feasible except for small values of m and d suboptimal methods: greedy search IEEE Haptics Symposium 2012 133

Feature selection n best individual features q evaluate all the d features individually, select the best m features IEEE Haptics Symposium 2012 134

Feature selection n sequential forward selection q q q start with the empty set evaluate all features one by one, select the best one, add to the set form pairs of features with this one and one of the remaining features, add the best one to the set form triplets of features with these two and one of the remaining features, add the best one to the set … IEEE Haptics Symposium 2012 135

Feature selection n sequential backward selection q q start with the full feature set evaluate by removing one feature at a time from the set, then remove the worst feature continue step 2 with the current feature set … IEEE Haptics Symposium 2012 136

Feature selection n plus p – take away r selection q q first enlarge the feature set by adding p features using sequential forward selection then remove r features using sequential backward selection IEEE Haptics Symposium 2012 137

Activity recognition example first 5 features selected by sequential forward selection first 5 features selected by PCA SFS performs better than PCA for a few features. If 10 -15 features are used, their performances become closer. Time domain features and leg features are more discriminative [Altun et al. , 2010 b] IEEE Haptics Symposium 2012 138

Activity recognition example [Altun et al. , 2010 b] IEEE Haptics Symposium 2012 139

Discriminative methods n we talked about discriminant functions n for the MAP rule we used n discriminative methods try to find directly from data IEEE Haptics Symposium 2012 140

Linear discriminant functions n consider the discriminant function that is a linear combination of the components of x n for the two-class case, there is a single decision boundary IEEE Haptics Symposium 2012 141

Linear discriminant functions n for the multiclass case, there are options n c two-class problems, separate n consider classes pairwise IEEE Haptics Symposium 2012 from others 142

Linear discriminant functions distinguish one class from others consider classes pairwise [Duda et al. , 2000] IEEE Haptics Symposium 2012 143

Linear discriminant functions n or, use the original definition q assign x to class i if [Duda et al. , 2000] IEEE Haptics Symposium 2012 144

Nearest mean classifier n find the means of training vectors n assign the class of the nearest mean for a test vector y IEEE Haptics Symposium 2012 145

2 -D example n artificial data IEEE Haptics Symposium 2012 146

2 -D example n estimated parameters decision boundary with true pdf decision boundary with nearest mean classifier IEEE Haptics Symposium 2012 147

Activity recognition example IEEE Haptics Symposium 2012 148

k-nearest neighbor method n for a test vector y q q n find the k closest training vectors let be the number of training vectors belonging to class i among these k vectors simplest case: k=1 q q just find the closest training vector assign its class decision boundaries: n Voronoi tessellation of the space IEEE Haptics Symposium 2012 149

1 -nearest neighbor n decision regions: this is called a Voronoi tessellation [Duda et al. , 2000] IEEE Haptics Symposium 2012 150

k-nearest neighbor n test sample q n class q n square class q n circle triangle note how the decision is different for k=3 and k=5 k=3 k=5 http: //en. wikipedia. org/wiki/K-nearest_neighbor_algorithm IEEE Haptics Symposium 2012 151

k-nearest neighbor n n no training is needed computation time for testing is high q n many techniques to reduce the computational load exist other alternatives exist for computing the distance q q Manhattan distance (L 1 norm) chessboard distance (L∞ norm) IEEE Haptics Symposium 2012 152

Haptics example K-fold 63. 3% subject-based leave-one-out 59. 0% IEEE Haptics Symposium 2012 153

Activity recognition example K-fold 90. 0% subject-based leave-one-out 89. 2% IEEE Haptics Symposium 2012 154

Activity recognition example decision boundaries for k=3 IEEE Haptics Symposium 2012 155

Feature normalization n especially when computing distances, the scales of the feature axes are important q n features with large ranges may be weighted more feature normalization can be applied so that the ranges are similar IEEE Haptics Symposium 2012 156

Feature normalization n linear scaling where l is the lowest value and u is the largest value of the feature x n normalization to zero mean & unit variance where m is the mean value and s is the standard deviation of the feature x n other methods exist IEEE Haptics Symposium 2012 157

Feature normalization n n ideally, the parameters l, u, m, and s should be estimated from the training set only, and then used on the test vectors for example for K-fold cross validation, this should be done K times IEEE Haptics Symposium 2012 158

Discriminative methods n another popular method is the binary decision tree q q q start from the root node proceed in the tree by setting thresholds on the feature values proceed with sequentially answering questions like n "is feature j less than threshold value Tk? " IEEE Haptics Symposium 2012 159

Activity recognition example IEEE Haptics Symposium 2012 160

Discriminative methods n n n one very popular method is the support vector machine classifier linear classifier applicable to linearly separable data if the data is not linearly separable, maps to a higher dimensional space q [Aksoy, 2011] usually a Hilbert space IEEE Haptics Symposium 2012 161

Comparison for activity recognition n 1170 features reduced to 30 by PCA 19 activities 8 participants IEEE Haptics Symposium 2012 162

References n n n S. Aksoy, Pattern Recognition lecture notes, Bilkent University, Ankara, Turkey, 2011. A. Moore, Statistical Data Mining tutorials (http: //www. autonlab. org/tutorials) J. Tenenbaum, The Cognitive Science of Intuitive Theories lecture notes, Massachussetts Institute of Technology, MA, USA, 2006. (accessed online: http: //www. mit. edu/~jbt/9. iap/9. 94. Tenenbaum. ppt) R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, 2 nd ed. , Wiley-Interscience, 2000. A. K. Jain, R. P. D. Duin, J. Mao, “Statistical pattern recognition: a review, ” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1): 4— 37, January 2000. A. R. Webb, Statistical Pattern Recognition, 2 nd ed. , John Wiley & Sons, West Sussex, England, 2002. V. N. Vapnik, The Nature of Statistical Learning Theory, 2 nd ed. , Springer-Verlag New York, Inc. , 2000. K. Altun, B. Barshan, O. Tuncel, (2010 a) “Comparative study on classifying human activities with miniature inertial/magnetic sensors, ” Pattern Recognition, 43(10): 3605— 3620, October 2010. K. Altun, B. Barshan, (2010 b) "Human activity recognition using inertial/magnetic sensor units, " in Human Behavior Understanding, Lecture Notes in Computer Science, A. A. Salah et al. (eds. ), vol. 6219, pp. 38— 51, Springer, Berlin, Heidelberg, August 2010. A. Flagg, D. Tam, K. Mac. Lean, R. Flagg, “Conductive fur sensing for a gesture-aware furry robot, ” Proceedings of IEEE 2012 Haptics Symposium, March 4 -7, 2012, Vancouver, B. C. , Canada. O. Tuncel, K. Altun, B. Barshan, “Classifying human leg motions with uniaxial piezoelectric gyroscopes, ” Sensors, 9(11): 8508— 8546, November 2009. D. Willems, Interactive Maps – using the pen in human-computer interaction, Ph. D Thesis, Radboud University Nijmegen, Netherlands, 2010 (accessed online: http: //www. donwillems. net/waaaa/Interactive. Maps_Ph. DThesis_DWillems. pdf) IEEE Haptics Symposium 2012 163