Machine Learning for Signal Processing Regression and Prediction
- Slides: 89
Machine Learning for Signal Processing Regression and Prediction Class 14. 17 Oct 2012 Instructor: Bhiksha Raj 17 Oct 2013 11755/18797 1
Matrix Identities • The derivative of a scalar function w. r. t. a vector is a vector 17 Oct 2013 11755/18797 2
Matrix Identities • The derivative of a scalar function w. r. t. a vector is a vector • The derivative w. r. t. a matrix is a matrix 17 Oct 2013 11755/18797 3
Matrix Identities • The derivative of a vector function w. r. t. a vector is a matrix – Note transposition of order 17 Oct 2013 11755/18797 4
Derivatives , , Ux. V Nx 1 Ux. Vx. N • In general: Differentiating an Mx. N function by a Ux. V argument results in an Mx. Nx. Ux. V tensor derivative 17 Oct 2013 11755/18797 5
Matrix derivative identities X is a matrix, a is a vector. Solution may also be XT A is a matrix • Some basic linear and quadratic identities 17 Oct 2013 11755/18797 6
A Common Problem • Can you spot the glitches? 17 Oct 2013 11755/18797 7
How to fix this problem? • “Glitches” in audio – Must be detected – How? • Then what? • Glitches must be “fixed” – Delete the glitch • Results in a “hole” – Fill in the hole – How? 17 Oct 2013 11755/18797 8
Interpolation. . • “Extend” the curve on the left to “predict” the values in the “blank” region – Forward prediction • Extend the blue curve on the right leftwards to predict the blank region – Backward prediction • How? – Regression analysis. . 17 Oct 2013 11755/18797 9
Detecting the Glitch OK NOT OK • Regression-based reconstruction can be done anywhere • Reconstructed value will not match actual value • Large error of reconstruction identifies glitches 17 Oct 2013 11755/18797 10
What is a regression • Analyzing relationship between variables • Expressed in many forms • Wikipedia – Linear regression, Simple regression, Ordinary least squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, …. • Generally a tool to predict variables 17 Oct 2013 11755/18797 11
Regressions for prediction • y = f(x; Q) + e • Different possibilities – y is a scalar • y is real • y is categorical (classification) – y is a vector – x is a vector • x is a set of real valued variables • x is a set of categorical variables • x is a combination of the two – f(. ) is a linear or affine function – f(. ) is a non-linear function – f(. ) is a time-series model 17 Oct 2013 11755/18797 12
A linear regression Y X • Assumption: relationship between variables is linear – – A linear trend may be found relating x and y y = dependent variable x = explanatory variable Given x, y can be predicted as an affine function of x 17 Oct 2013 11755/18797 13
An imaginary regression. . • • http: //pages. cs. wisc. edu/~kovar/hall. html Check this shit out (Fig. 1). That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data possible. Now, let's look a bit more closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered. 17 Oct 2013 11755/18797 14
Linear Regressions • y = Ax + b + e – e = prediction error • Given a “training” set of {x, y} values: estimate A and b – y 1 = Ax 1 + b + e 1 – y 2 = Ax 2 + b + e 2 – y 3 = Ax 3 + b+ e 3 –… • If A and b are well estimated, prediction error will be small 17 Oct 2013 11755/18797 15
Linear Regression to a scalar y 1 = a. Tx 1 + b + e 1 y 2 = a. Tx 2 + b + e 2 y 3 = a. Tx 3 + b + e 3 n Define: • Rewrite 17 Oct 2013 11755/18797 16
Learning the parameters Assuming no error • Given training data: several x, y • Can define a “divergence”: D(y, ) – Measures how much differs from y – Ideally, if the model is accurate this should be small • Estimate A, b to minimize D(y, ) 17 Oct 2013 11755/18797 17
The prediction error as divergence y 1 = a. Tx 1 + b + e 1 y 2 = a. Tx 2 + b + e 2 y 3 = a. Tx 3 + b + e 3 • Define divergence as sum of the squared error in predicting y 17 Oct 2013 11755/18797 18
Prediction error as divergence • y = a. Tx + e – e = prediction error – Find the “slope” a such that the total squared length of the error lines is minimized 17 Oct 2013 11755/18797 19
Solving a linear regression • Minimize squared error • Differentiating w. r. t A and equating to 0 17 Oct 2013 11755/18797 20
Regression in multiple dimensions y 1 = ATx 1 + b + e 1 y 2 = ATx 2 + b + e 2 y 3 = ATx 3 + b + e 3 yi is a vector yij = jth component of vector yi ai = ith column of A • Also called multiple regression bj = jth component of • Equivalent of saying: yi 1 = a 1 Txi + b 1 + ei 1 yi 2 = a 2 Txi + b 2 + ei 2 yi 3 = a 3 Txi + b 3 + ei 3 yi = ATxi + b + ei • Fundamentally no different from N separate single regressions – But we can use the relationship between ys to our benefit 17 Oct 2013 11755/18797 21 b
Multiple Regression Dx 1 vector of ones • Differentiating and equating to 0 17 Oct 2013 11755/18797 22
A Different Perspective = + • y is a noisy reading of ATx • Error e is Gaussian • Estimate A from 17 Oct 2013 11755/18797 23
The Likelihood of the data • Probability of observing a specific y, given x, for a particular matrix A • Probability of collection: • Assuming IID for convenience (not necessary) 17 Oct 2013 11755/18797 24
A Maximum Likelihood Estimate • Maximizing the log probability is identical to minimizing the trace – Identical to the least squares solution 17 Oct 2013 11755/18797 25
Predicting an output • From a collection of training data, have learned A • Given x for a new instance, but not y, what is y? • Simple solution: 17 Oct 2013 11755/18797 26
Applying it to our problem • Prediction by regression • Forward regression • xt = a 1 xt-1+ a 2 xt-2…akxt-k+et • Backward regression • xt = b 1 xt+1+ b 2 xt+2…bkxt+k +et 17 Oct 2013 11755/18797 27
Applying it to our problem • Forward prediction 17 Oct 2013 11755/18797 28
Applying it to our problem • Backward prediction 17 Oct 2013 11755/18797 29
Finding the burst • At each time – Learn a “forward” predictor at – At each time, predict next sample xtest = Si at, kxt-k – Compute error: ferrt=|xt-xtest |2 – Learn a “backward” predict and compute backward error • berrt – Compute average prediction error over window, threshold 17 Oct 2013 11755/18797 30
Filling the hole • Learn “forward” predictor at left edge of “hole” – For each missing sample – At each time, predict next sample xtest = Si at, kxt-k • Use estimated samples if real samples are not available • Learn “backward” predictor at left edge of “hole” – For each missing sample – At each time, predict next sample xtest = Si bt, kxt+k • Use estimated samples if real samples are not available • Average forward and backward predictions 17 Oct 2013 11755/18797 31
Reconstruction zoom in Reconstruction area Next glitch Distorted signal Recovered signal Interpolation result 17 Oct 2013 Actual data 11755/18797 32
Incrementally learning the regression Requires knowledge of all (x, y) pairs • Can we learn A incrementally instead? – As data comes in? • The Widrow Hoff rule • Note the structure Scalar prediction version error – Can also be done in batch mode! 17 Oct 2013 11755/18797 33
Predicting a value • What are we doing exactly? – For the explanation we are assuming no “b” (X is 0 mean) – Explanation generalizes easily even otherwise n Let and n n 17 Oct 2013 Whitening x N-0. 5 C-0. 5 is the whitening matrix for x 11755/18797 34
Predicting a value • What are we doing exactly? 17 Oct 2013 11755/18797 35
Predicting a value • Given training instances (xi, yi) for i = 1. . N, estimate y for a new test instance of x with unknown y : • y is simply a weighted sum of the yi instances from the training data • The weight of any yi is simply the inner product between its corresponding xi and the new x – With due whitening and scaling. . 17 Oct 2013 11755/18797 36
What are we doing: A different perspective • Assumes XXT is invertible • What if it is not – Dimensionality of X is greater than number of observations? – Underdetermined • In this case XTX will generally be invertible 17 Oct 2013 11755/18797 37
High-dimensional regression • XTX is the “Gram Matrix” 17 Oct 2013 11755/18797 38
High-dimensional regression • Normalize Y by the inverse of the gram matrix • Working our way down. . 17 Oct 2013 11755/18797 39
Linear Regression in High-dimensional Spaces • Given training instances (xi, yi) for i = 1. . N, estimate y for a new test instance of x with unknown y : • y is simply a weighted sum of the normalized yi instances from the training data – The normalization is done via the Gram Matrix • The weight of any yi is simply the inner product between its corresponding xi and the new x 17 Oct 2013 11755/18797 40
Relationships are not always linear • How do we model these? • Multiple solutions 17 Oct 2013 11755/18797 41
Non-linear regression • y = Aj(x)+e n n Y = AF(X)+e Replace X with F(X) in earlier equations for solution 17 Oct 2013 11755/18797 42
Problem n n Y = AF(X)+e Replace X with F(X) in earlier equations for solution F(X) may be in a very high-dimensional space The high-dimensional space (or the transform F(X)) may be unknown. . 17 Oct 2013 11755/18797 43
The regression is in high dimensions • Linear regression: • High-dimensional regression 17 Oct 2013 11755/18797 44
Doing it with Kernels • High-dimensional regression with Kernels: • Regression in Kernel Hilbert Space. . 17 Oct 2013 11755/18797 45
A different way of finding nonlinear relationships: Locally linear regression • Previous discussion: Regression parameters are optimized over the entire training set • Minimize • Single global regression is estimated and applied to all future x • Alternative: Local regression • Learn a regression that is specific to x 17 Oct 2013 11755/18797 46
Being non-committal: Local Regression • Estimate the regression to be applied to any x using training instances near x • The resultant regression has the form – Note : this regression is specific to x • A separate regression must be learned for every x 17 Oct 2013 11755/18797 47
Local Regression • But what is d()? – For linear regression d() is an inner product • More generic form: Choose d() as a function of the distance between x and xj • If d() falls off rapidly with |x and xj| the “neighbhorhood” requirement can be relaxed 17 Oct 2013 11755/18797 48
Kernel Regression: d() = K() • Typical Kernel functions: Gaussian, Laplacian, other density functions – Must fall off rapidly with increasing distance between x and xj • Regression is local to every x : Local regression • Actually a non-parametric MAP estimator of y – But first. . MAP estimators. . 49
Map Estimators • MAP (Maximum A Posteriori): Find a “best guess” for y (statistically), given known x y = argmax Y P(Y|x) • ML (Maximum Likelihood): Find that value of y for which the statistical best guess of x would have been the observed x y = argmax Y P(x|Y) • MAP is simpler to visualize 17 Oct 2013 11755/18797 50
MAP estimation: Gaussian PDF Assume X and Y are jointly Gaussian The parameters of the F 1 Y Gaussian are learned from training data 17 Oct 2013 F 0 X 11755/18797 51
Learning the parameters of the Gaussian 17 Oct 2013 11755/18797 52
Learning the parameters of the Gaussian 17 Oct 2013 11755/18797 53
MAP estimation: Gaussian PDF Assume X and Y are jointly Gaussian The parameters of the F 1 Y Gaussian are learned from training data 17 Oct 2013 F 0 X 11755/18797 54
MAP Estimator for Gaussian RV Assume X and Y are jointly Gaussian Level set of Gaussian The parameters of the Gaussian are learned from training data X 0 Now we are given an X, but no Y What is Y? 11755/18797 55
MAP estimator for Gaussian RV x 0 17 Oct 2013 11755/18797 56
MAP estimation: Gaussian PDF F 1 Y 17 Oct 2013 X 11755/18797 57
MAP estimation: The Gaussian at a particular value of X F 1 17 Oct 2013 x 0 11755/18797 58
MAP estimation: The Gaussian at a particular value of X Most likely value F 1 17 Oct 2013 x 0 11755/18797 59
MAP Estimation of a Gaussian RV Y = argmaxy P(y| X) ? ? ? x 0 17 Oct 2013 11755/18797 60
MAP Estimation of a Gaussian RV x 0 17 Oct 2013 11755/18797 61
MAP Estimation of a Gaussian RV Y = argmaxy P(y| X) x 0 17 Oct 2013 11755/18797 62
So what is this value? • Clearly a line • Equation of Line: • Scalar version given; vector version is identical • Derivation? Later in the program a bit – Note the similarity to regression 17 Oct 2013 11755/18797 63
This is a multiple regression • This is the MAP estimate of y – y = argmax Y P(Y|x) • What about the ML estimate of y – argmax Y P(x|Y) • Note: Neither of these may be the regression line! – MAP estimation of y is the regression on Y for Gaussian RVs – But this is not the MAP estimation of the regression parameter 17 Oct 2013 11755/18797 64
Its also a minimum-mean-squared error estimate • General principle of MMSE estimation: – y is unknown, x is known – Must estimate it such that the expected squared error is minimized – Minimize above term 17 Oct 2013 11755/18797 65
Its also a minimum-mean-squared error estimate • Minimize error: • Differentiating and equating to 0: The MMSE estimate is the mean of the distribution 17 Oct 2013 11755/18797 66
For the Gaussian: MAP = MMSE Most likely value is also The MEAN value § Would be true of any symmetric distribution 17 Oct 2013 11755/18797 67
MMSE estimates for mixture distributions n n n Let P(y|x) be a mixture density The MMSE estimate of y is given by Just a weighted combination of the MMSE estimates from the component distributions 68
MMSE estimates from a Gaussian mixture n Let P(x, y) be a Gaussian Mixture n P(y|x) is also a Gaussian mixture 17 Oct 2013 11755/18797 69
MMSE estimates from a Gaussian mixture n Let P(y|x) is a Gaussian Mixture 17 Oct 2013 11755/18797 70
MMSE estimates from a Gaussian mixture n P(y|x) is a mixture Gaussian density n E[y|x] is also a mixture 17 Oct 2013 11755/18797 71
MMSE estimates from a Gaussian mixture n A mixture of estimates from individual Gaussians 17 Oct 2013 11755/18797 72
Voice Morphing • Align training recordings from both speakers – Cepstral vector sequence • Learn a GMM on joint vectors • Given speech from one speaker, find MMSE estimate of the other • Synthesize from cepstra 17 Oct 2013 11755/18797 73
MMSE with GMM: Voice Transformation - Festvox GMM transformation suite (Toda) awb bdl jmk slt 17 Oct 2013 11755/18797 74
A problem with regressions • ML fit is sensitive – Error is squared – Small variations in data large variations in weights – Outliers affect it adversely • Unstable – If dimension of X >= no. of instances • (XXT) is not invertible 17 Oct 2013 11755/18797 75
MAP estimation of weights a y=a. TX+e X e • Assume weights drawn from a Gaussian – P(a) = N(0, s 2 I) • Max. Likelihood estimate • Maximum a posteriori estimate 17 Oct 2013 11755/18797 76
MAP estimation of weights q q P(a) = N(0, s 2 I) Log P(a) = C – log s – 0. 5 s-2 ||a||2 • Similar to ML estimate with an additional term 17 Oct 2013 11755/18797 77
MAP estimate of weights • Equivalent to diagonal loading of correlation matrix – Improves condition number of correlation matrix • Can be inverted with greater stability – Will not affect the estimation from well-conditioned data – Also called Tikhonov Regularization • Dual form: Ridge regression • MAP estimate of weights – Not to be confused with MAP estimate of Y 17 Oct 2013 11755/18797 78
MAP estimate priors • Left: Gaussian Prior on W • Right: Laplacian Prior 17 Oct 2013 11755/18797 79
MAP estimation of weights with laplacian prior • Assume weights drawn from a Laplacian – P(a) = l-1 exp(-l-1|a|1) • Maximum a posteriori estimate • No closed form solution – Quadratic programming solution required • Non-trivial 17 Oct 2013 11755/18797 80
MAP estimation of weights with laplacian prior • Assume weights drawn from a Laplacian – P(a) = l-1 exp(-l-1|a|1) • Maximum a posteriori estimate –… • Identical to L 1 regularized least-squares estimation 17 Oct 2013 11755/18797 81
L 1 -regularized LSE • No closed form solution – Quadratic programming solutions required • Dual formulation subject to • “LASSO” – Least absolute shrinkage and selection operator 17 Oct 2013 11755/18797 82
LASSO Algorithms • Various convex optimization algorithms • LARS: Least angle regression • Pathwise coordinate descent. . • Matlab code available from web 17 Oct 2013 11755/18797 83
Regularized least squares Image Credit: Tibshirani • Regularization results in selection of suboptimal (in least-squares sense) solution – One of the loci outside center • Tikhonov regularization selects shortest solution • L 1 regularization selects sparsest solution 17 Oct 2013 11755/18797 84
LASSO and Compressive Sensing Y = X a • Given Y and X, estimate sparse W • LASSO: – X = explanatory variable – Y = dependent variable – a = weights of regression • CS: – X = measurement matrix – Y = measurement – a = data 17 Oct 2013 11755/18797 85
An interesting problem: Predicting War! • Economists measure a number of social indicators for countries weekly – Happiness index – Hunger index – Freedom index – Twitter records –… • Question: Will there be a revolution or war next week? 17 Oct 2013 11755/18797 86
An interesting problem: Predicting War! • Issues: – Dissatisfaction builds up – not an instantaneous phenomenon • Usually – War / rebellion build up much faster • Often in hours • Important to predict – Preparedness for security – Economic impact 17 Oct 2013 11755/18797 87
Predicting War O 1 O 2 O 3 O 4 W W S Given S S S O 5 O 6 O 7 O 8 W W S S wk 1 wk 2 wk 3 wk 4 wk 5 wk 6 wk 7 wk 8 – Sequence of economic indicators for each week – Sequence of unrest markers for each week • At the end of each week we know if war happened or not that week • Predict probability of unrest next week – This could be a new unrest or persistence of a current one 17 Oct 2013 11755/18797 88
Predicting Time Series • Need time-series models • HMMs – later in the course 17 Oct 2013 11755/18797 89
- On formalizing fairness in prediction with machine learning
- Machine learning algorithms for restaurants
- Inversion attacks
- Prediction of nba games based on machine learning methods
- Stealing machine learning models via prediction apis
- Ebk regression prediction
- Linear regression andrew ng
- Linear regression multiple features
- K nearest neighbors python
- Linear regression with multiple variables machine learning
- Simple multiple linear regression
- Linear regression vs multiple regression
- Logistic regression vs linear regression
- Logistic regression vs linear regression
- Weisfeiler-lehman neural machine for link prediction
- Inductive vs analytical learning
- Deductive learning vs inductive learning
- Lazy learning vs eager learning
- Baseband signal and bandpass signal
- Baseband signal and bandpass signal
- Product of two odd signal is
- Concept learning task in machine learning
- Analytical learning in machine learning
- Pac learning model in machine learning
- Machine learning t mitchell
- Instance based learning in machine learning
- Inductive learning machine learning
- First order rule learning in machine learning
- Deep learning vs machine learning
- Digital signal as a composite analog signal
- Cuadro comparativo de e-learning
- Support vector machine regression
- 인과성
- Parallel system tsample tclock
- Unfolding in vlsi signal processing
- Ladder structure realization of iir filter
- Types of signals
- Dr sujan ali
- Digital signal processing
- Signal processing toolbox matlab
- Digital signal processing
- Ica
- Signal processing for big data
- Financial signal processing
- Super audio cd
- Digital signal processing
- Open architecture radar
- Image processing
- Digital signal processing
- Digital signal processing
- Sysc
- Genomic signal processing
- What is dsp
- Properties of roc in dsp
- Digital image processing
- Signal processing solutions
- High-performance digital signal processing
- Precision analog signal processing
- Digital signal processing
- Point processing and neighbourhood processing
- Primary food processing examples
- Batch processing vs interactive processing
- Mealy or moore machine
- Chapter 10 energy work and simple machines answer key
- Top down vs bottom up processing
- Bottom up processing vs top down processing
- Bottom up processing
- What is point processing in digital image processing
- Histogram processing in digital image processing
- Parallel processing vs concurrent processing
- A generalization of unsharp masking is
- Image processing
- Morphological dilation
- Top down vs bottom up psychology
- Formuö
- Typiska novell drag
- Nationell inriktning för artificiell intelligens
- Returpilarna
- Shingelfrisyren
- En lathund för arbete med kontinuitetshantering
- Personalliggare bygg undantag
- Tidbok
- Sura för anatom
- Förklara densitet för barn
- Datorkunskap för nybörjare
- Boverket ka
- Att skriva debattartikel
- Delegerande ledarskap
- Nyckelkompetenser för livslångt lärande
- Påbyggnader för flakfordon