Optimal Learning Informs Tut ORials October 2008 Warren

Optimal Learning Informs Tut. ORials October, 2008 . Warren Powell Peter Frazier With research by Ilya Ryzhov Princeton University © 2008 B. Warren B. Powell © 2008 Warren Powell, Princeton University 1

Outline n Introduction © 2008 Warren B. Powell Slide 2

Applications n Sports » Who should be in the batting lineup for a baseball team? » What is the best group of five basketball players out of a team of 12 to be your starting lineup? » Who are the best four people to man the four-person boat for crew racing? » Who will perform the best in competition for your gymnastics team? © 2008 Warren B. Powell 3

Applications n Figure out Manhattan: » » » Walking Subway/walking Taxi Street bus Driving

Applications n Biomedical research » How do we find the best drug to cure cancer? » There are millions of combinations, with laboratory budgets that cannot test everything. » We need a method for sequencing experiments. © 2008 Warren B. Powell 5

Applications n Biosurveillance Deaths from vector-born diseases » What is the prevalence of drug-resistant TB, MRSA, HIV/AIDS, malaria…. , in the population? » How do we efficiently collect information about the state of disease around the world? » What are the best strategies for minimizing transmission? © 2008 Warren B. Powell 6

Applications n High technology National Ignition Facility » What is the best sensor to use to evaluate the status of optics for the National Ignition Facility? » When should lenses be inspected? » How often should an experiment be run to test a new hypothesis on the physics of fusion? © 2008 Warren B. Powell 7

Applications n Stochastic optimization » Stochastic search over surfaces that can only be measured with uncertainty » Simulation-optimization – What is the best set of parameters to produce the best manufacturing configuration? » Active learning – How do we choose which samples to collect for machine learning applications? » Exploration vs. exploitation in approximation dynamic programming – How do we decide which states to visit to balance our need to estimate the value of being in a state versus the reward from visiting a state? © 2008 Warren B. Powell 8

Introduction n Deterministic optimization » Find the choice with the highest reward (assumed known):

Introduction n Stochastic optimization » Now assume the reward you will earn is stochastic, drawn from a normal distribution. The reward is revealed after the choice is made. The winner! © 2008 Warren B. Powell 10

Introduction n Optimal learning » Now, you have a budget of 10 measurements to determine which of the 5 choices is best. You have an initial probability distribution for the reward that each will return, but you are willing to change your belief as you make choices. How should you sequence your measurements to produce the best answer in the end? • We might keep trying the option we think is best: • … but what if the third or fourth choice is actually the best? © 2008 Warren B. Powell 11

Introduction n Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? 1 © 2008 Warren B. Powell 2 3 4 5 12

Introduction n Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? No improvement 1 © 2008 Warren B. Powell 2 3 4 5 13

Introduction n Now assume we have five choices, with uncertainty in our belief about how well each one will perform. Imagine you can make a single measurement, after which you have to make a choice about which one is best. What would you do? New solution 1 2 3 4 5 The value of learning is that it may change your decision. © 2008 Warren B. Powell 14

Outline n Types of learning problems © 2008 Warren B. Powell Slide 15

Elements of a learning problem n Things we have to think about: » How do we make measurements? What is the nature of the measurement decision? » What is the effect of a measurement? How does it change our state of knowledge? » What do we do with the results of what we learn from a measurement? What is the nature of the implementation decision? » How do we evaluate how well we have done with the results of our measurement? » Do we learn as we go, or are we able to make a series of measurements before solving a problem? © 2008 Warren B. Powell 16

Elements of a learning problem n Types of measurement decisions » • Stopping problems – observe until you have to make a decision, such as selling an asset. » • Finite (and not too big) set of choices » • Subset selection – What is the best group of people for a sports team – What is the best subset of energy saving technologies for a building » • What is the best price, density, temperature, speed » • Linear, nonlinear and integer programming © 2008 Warren B. Powell 17

Elements of a learning problem n Optimal learning » Now assume that you do not know the distribution of the reward, although you have an estimate (a “prior”). » After you make your choice, you observe the actual reward which changes your belief about the distribution of rewards. Observation © 2008 Warren B. Powell 18

Elements of a learning problem n Updating the distribution » Frequentist view • Assume we start with observations: • Statistics: • Frequentist interpretation: – and are random variables reflecting the randomness in the observations of © 2008 Warren B. Powell 19

Elements of a learning problem n Updating the distribution » Bayesian view • We assume we start with a distribution of belief about the true mean • Next we observe , which we assume comes from a distribution with variance (we assume the variance is known). • Using Bayes theorem, we can show that our new distribution of belief about the true mean is normally distributed with mean and variance. We first define the precision of a distribution as the inverse variance: – • The updating formulas are © 2008 Warren B. Powell 20

Elements of a learning problem n Frequentist vs. Bayesian » For optimal learning applications, we are generally in the situation where we have some knowledge about our choices, and we have to decide which one to measure to improve our final decision. » The state of knowledge: • Frequentist view: • Bayesian view: » For the remainder of our talk, we will adopt a Bayesian view since it allows us to introduce prior knowledge, a common property of learning problems. © 2008 Warren B. Powell 21

Elements of a learning problem n Relationships between beliefs and measurements » Beliefs • Uncorrelated – What we know about one choice tells us nothing about what we know about another choice • Correlated – If our belief of one choice is high, our belief about another choice might be higher » Measurement noise • Uncorrelated - If we were to make two measurements at the same time, the measurements are independent. • Correlated: – At a point in time – Simultaneous measurements are correlated. – Over time – Measurements of different choices may or may not be correlated, but measurements of the same choice at different points in time are correlated. © 2008 Warren B. Powell 22

Elements of a learning problem n Types of learning probems » On-line learning • Learn as you earn • Give example problems – Finding the best path to work – What is the best set of energy-saving technologies to use for your building? – What is the best medication to control your diabetes? » Off-line learning • There is a phase of information collection with a finite (sometimes small) budget. • You are allowed to make a series of measurements, after which you make an implementation decision. • Examples: – Finding the best drug compound through laboratory experiments – Finding the best design of a manufacturing configuration or engineering design which is evaluated using an expensive simulation. – What is the best combination of designs for hydrogen production, storage and conversion. © 2008 Warren B. Powell 23

Elements of a learning problem n Measuring the benefits of knowledge: » Minimizing/maximizing a cost or reward • Minimizing expected cost/maximizing reward or utility • Minimizing expected opportunity cost (minimizing the gap from the best possible) • Collecting information to produce a better solution to an optimization problem. » Making the right choice • Maximizing the probability of making the correct selection • Indifference zone selection – maximizing the probability of collecting a choice whose performance is within of the optimal. » Statistical measures • Minimizing a measure (square, absolute value) of the distance between observations and a predictive function (classical estimation) • Minimizing a metric (e. g. Kullback-Leibler divergence) measuring the distance between actual and predicted probability distributions. • Minimizing entropy (or entropic loss) © 2008 Warren B. Powell 24

Outline n Measurement policies © 2008 Warren B. Powell Slide 25

Measurement policies n What do we know? » The real average path times: • • Path 1 Path 2 Path 3 Path 4 » Mean time 20 minutes 22 minutes 24 minutes 26 minutes • Errors are +/- 10 minutes » What we think: • • Path 1 Path 2 Path 3 Path 4 25 minutes 24 minutes 22 minutes 20 minutes » We act by choosing the path that we “think” is the best. The only way we learn anything new is by choosing a path. © 2008 Warren B. Powell 26

Measurement policies n Illustration of calculations: © 2008 Warren B. Powell 27

Measurement policies © 2008 Warren B. Powell 28

Measurement policies © 2008 Warren B. Powell 29

Measurement policies © 2008 Warren B. Powell 30

Measurement policies © 2008 Warren B. Powell 31

Measurement policies n For problems with a finite number of alternatives » On-line learning (learn as you earn) • This is known in the literature as the multi-armed bandit problem, where you are trying to find the slot machine with the highest payoff. • It is necessary to trade off what you think you will earn with each decision, against the value of the information you will gain that might improve decisions in the future. » Off-line learning • You have a budget for taking measurements. After your budget is exhausted, you have to make a final choice. • This is known as the ranking and selection problem. © 2008 Warren B. Powell 32

Measurement policies n Elements of a measurement policy: » Deterministic or sequential • Deterministic policy - you decide what you are going to measure in advance. • Sequential policy – Future measurements depend on past observations. » Designing a measurement policy • We have to strike a balance between the value of a good measurement policy and the cost of computing it • If we are drilling oil exploration holes, we might be willing to spend a day on the computer deciding what to do next • We may need a trivial calculation if we are guiding an algorithm that will perform thousands of iterations. » Evaluating a policy • The goal is to find a policy that gets us close enough to the truth that we make the optimal (or near-optimal) decisions • To do this, we have to assume a truth, and then use a policy to try to guess at the truth. © 2008 Warren B. Powell 33

Measurement policies n Finding an optimal policy » Dynamic programming formulation • Let be the “state of knowledge” – E. g. if we have 10 choices, each with a mean and variance, our state would be • An optimal learning policy is characterized by Bellman’s equation: » Computational challenges • State variable has 20 dimensions, each is continuous. • Solving this is impossible (and this is a simple problem!) © 2008 Warren B. Powell 34

Measurement policies n Special case: on-line learning with independent beliefs » Multi-armed bandit problem – Which slot machine should I try next to maximize total expected rewards? » Breakthrough (Gittins and Jones, 1974) • • Do not need to solve the high-dimensional dynamic program Compute a single index (the “Gittins index”) for each slot machine Try the slot machine with the largest index For normally distributed rewards, the index looks like: » Notes Standard deviation of measurement Gittins index mean zero, variance 1 x Current estimate of for the reward from machine • Yao (2006) and Brezzi and Lai (2002) provide analytical approximation for • Despite extensive literature on index policies, range of applications is fairly limited. © 2008 Warren B. Powell 35

Measurement policies n Heuristic measurement policies » Pure exploitation – Always make the choice that appears to be the best. » Pure exploration – Make choices at random so that you are always learning more, but without regard to the cost of the decision. » Hybrid • Explore with probability and exploit with probability • Epsilon-greedy exploration – explore with probability to zero as , but not too quickly. . Goes » Boltzmann exploration • Explore choice x with probability » Interval estimation (upper confidence bounding) • Choose x which maximizes © 2008 Warren B. Powell 36

Measurement policies n Approximate policies for off-line learning » Optimal computing budget allocation (Chen et al) • Formulates the problem of allocating a set of observations as an optimization problem subject to a budget constraint » LL(s) – Batch linear loss (Chick et al) » Maximizing the expected value of a single measurement • (R 1, …, R 1) Gupta and Miescke (1996) • EVI (Chick, Branke and Schmidt, under review) • “Knowledge gradient” (Frazier and Powell, 2008) © 2008 Warren B. Powell 37

Measurement policies n Evaluating measurement policies » How do we compare one measurement policy

Measurement policies n Illustration » Setup: • Option 1 is worth 15 • Remaining 999 options are worth 10 • Standard deviation of a measurement is 5 » Policy 1: • Measure each option 10 times » Policy 2: • Measure remaining 999 options once. • Measure first option 9, 001 times » Which measurement policy produces the best result? © 2008 Warren B. Powell 39

Measurement policies n Measuring each alternative 10 times Best choice © 2008 Warren B.

Measurement policies n Measuring option 1 9, 001 times, and everything else once. Lucky

Measurement policies n What did we find? » Although option 1 is best, we will almost always identify some other option as being better, just through randomness. This method rewards collecting too little information. n A better way: » Assume a truth for each x. We do this by choosing a sample realization of a truth from a prior probability distribution for the mean. » Given this truth, apply policy to produce statistical estimates given by. Let be the best solution based on these estimates. Repeat this n times and evaluate the policy using » Note: This must be done with realistic (but not real) data. © 2008 Warren B. Powell 42

Outline n The knowledge gradient policy © 2008 Warren B. Powell Slide 43

The knowledge gradient n Basic principle: » Assume you can make only one measurement, after which you have to make a final choice (the implementation decision). » What choice would you make now to maximize the expected value of the implementation decision? Change which produces a change in the decision. Change in estimate of value of option 5 due to measurement. 1 © 2008 Warren B. Powell 2 3 4 5 44

The knowledge gradient n General model » Off-line learning – We have a measurement budget of N observations. After we do our measurements, we have to make an implementation decision. » Notation: © 2008 Warren B. Powell 45

The knowledge gradient n The knowledge gradient » The knowledge gradient is the expected value of a single measurement x, given by Marginal. Expectation value of optimization measuring x. Updated (the knowledge gradient) Optimization given what we New problem over different measurement outcomes Implementation decision knowledge stateproblem given measurement x know Knowledge state » The challenge is a computational one: how do we compute the expectation? © 2008 Warren B. Powell 46

The knowledge gradient n Derivation » Notation » We update the precision using »

The knowledge gradient n Derivation » The change in variance can be found to

The knowledge gradient n Knowledge gradient 1 © 2008 Warren B. Powell 2 3

The knowledge gradient n The knowledge gradient policy n Properties » Effectively a myopic policy, but also similar to steepest ascent for nonlinear programming. » The best single measurement you can make (by construction) » Asymptotically optimal (more difficult proof). As the measurement budget grows, we get the optimal solution. » The knowledge gradient policy is the only stationary policy with this behavior. • Many policies are asymptotically optimal (e. g. pure exploration, hybrid exploration/exploitation, epsilon-greedy), but are not myopically optimal. © 2008 Warren B. Powell 50

The knowledge gradient Current estimate of of value standard of a decision deviation Value

The knowledge gradient © 2008 Warren B. Powell 52

The knowledge gradient © 2008 Warren B. Powell 53

The knowledge gradient n Experimental comparisons: » KG vs: • Boltzmann • Interval estimation • Equal allocation KG - Boltzmann KG – Interval est. KG – Equal alloc KG – OCBA KG – Pure exploit © 2008 Warren B. Powell • OCBA • Pure exploitation • Linear loss KG – LL(S) 54

The knowledge gradient n Notes: » KG slightly outperforms Interval Estimation (IE), OCBA, and LL(S), and is easier to compute than OCBA and LL(S). » KG is fairly easy to compute for independent, normally distributed rewards. » But KG is a general concept which generalizes to other important problem classes: • • On-line applications Correlated beliefs Correlated measurements (e. g. Common Random Numbers) … more general optimization problems © 2008 Warren B. Powell 55

Outline n The knowledge gradient for on-line applications Ø Research by Ilya Ryzhov, Ph.

KG for on-line learning problems n Setting » Assume we have a budget of N measurements and that we have made n measurements. » As before, let be the value of measuring x after n measurements. » Proposal: Assume that each of the remaining decisions is improved by. So an estimate of the value of measuring x is given by What we think we will earn now The value of the information now on future decisions » For infinite horizon problems: © 2008 Warren B. Powell 57

KG for on-line learning problems n On-line KG vs. Gittins On-line KG matches Gittins.

KG for on-line learning problems n Experimental comparisons: » KG-online vs: • Gittins indices • Interval estimation • Pure exploitation KG versus: Heterogeneous prior Uniform prior Gittins Interval estimation Pure exploitation © 2008 Warren B. Powell 59

Outline n The knowledge gradient with correlated beliefs Ø Presented by Peter Frazier, Ph.

The correlated knowledge gradient (CKG) n In the independent knowledge gradient, measuring one alternative taught us nothing about the others. 1 2 3 4 5 n Sometimes, this is not true. In such cases we should use the correlated knowledge gradient. 1 © 2008 Warren B. Powell 2 3 4 5 61

The correlated knowledge gradient (CKG) n Measuring one alternative can teach us about other alternatives. n Examples » finding the best price at which to sell a product. • demand at a price of $8 is close to demand at a price of $9 » marketing budget allocation. • two budgets that differ only in the amount allocated to radio advertising will produce similar effects. » choosing the temperature and pressure at which to run a chemical process. » choosing a combination of products to sell in a product line. • two product lines with all but one product in common will produce similar revenue streams. » choosing a combination of drugs to treat a disease. » finding a chemical for a particular medical or industrial purpose. • two chemicals sharing similar molecular structures behave similarly. © 2008 Warren B. Powell 62

The correlated knowledge gradient (CKG) n To compute the correlated KG policy, we use the same formula, n but now F(y, K(x)) changes from F(y, x) for every y, not just for y=x. This changes the expectation. 1 2 3 4 5 measure. . . these beliefs change too. here. . . © 2008 Warren B. Powell 63

CKG for Wind Farm Location The posterior knowledge state K(x) is written here as

CKG for Wind Farm Location © 2008 Warren B. Powell 65

CKG for Wind Farm Location © 2008 Warren B. Powell 66

CKG for Wind Farm Location © 2008 Warren B. Powell 67

CKG for Wind Farm Location © 2008 Warren B. Powell 68

CKG for Wind Farm Location © 2008 Warren B. Powell 69

CKG for Wind Farm Location © 2008 Warren B. Powell 70

CKG for Wind Farm Location © 2008 Warren B. Powell 71

CKG for Wind Farm Location © 2008 Warren B. Powell 72

CKG for Wind Farm Location © 2008 Warren B. Powell 73

CKG for Wind Farm Location © 2008 Warren B. Powell 74

CKG for Wind Farm Location © 2008 Warren B. Powell 75

CKG for Wind Farm Location © 2008 Warren B. Powell 76

CKG for Wind Farm Location © 2008 Warren B. Powell 77

CKG for Wind Farm Location © 2008 Warren B. Powell 78

CKG for Wind Farm Location © 2008 Warren B. Powell 79

CKG for Wind Farm Location © 2008 Warren B. Powell 80

CKG for Wind Farm Location © 2008 Warren B. Powell 81

CKG for Wind Farm Location © 2008 Warren B. Powell 82

CKG for Wind Farm Location Best location is 1. F(1, K)+s 1 W ©

CKG for Wind Farm Location Best location is 34. F(34, K)+s 34 W ©

CKG for Wind Farm Location Best location is 48. s 48 W )+ F(48,

CKG for Wind Farm Location Best location is 49 W )+ (49, K F

CKG for Wind Farm Location s 49 K)+ 9, s 48 K)+ W F(4 8, F(1, K)+s 1 W F(34, K)+s 34 W F(4 where n 1, 34, 48, 49 are the locations that may be best. n a 1, . . . , a 4 are the values of F(y, K) at these locations. n b 1, . . . , b 4 are the values of sy at these locations. © 2008 Warren B. Powell 87 W

CKG for Wind Farm Location b 4 a 4+ a 1+b 1 W a 2+b 2 W W b 3 W + a 3 where n 1, 34, 48, 49 are the locations that may be best. n a 1, . . . , a 4 are the values of F(y, K) at these locations. n b 1, . . . , b 4 are the values of sy at these locations. © 2008 Warren B. Powell 88

CKG for Wind Farm Location Then the knowledge gradient is given by where f(z) = (z) + z (z), is the normal pdf, and is the normal cdf. a 2+b 2 W a 1+b 1 W W a 3+b 3 W a 4+b 4 W © 2008 Warren B. Powell 89

Wind Speed CKG for Wind Farm Location Function’s true value Our belief (mean and

CKG for Wind Farm Location Wind Speed Measurement Locations © 2008 Warren B. Powell

Wind Speed CKG for Wind Farm Locations © 2008 Warren B. Powell 92

Wind Speed CKG for Wind Farm Locations © 2008 Warren B. Powell 93

CKG in 1 Dimension Comparison of: CKG = correlated knowledge gradient SKO = sequential kriging optimization (Huang et al. 2006) IKG = independent knowledge gradient. y # of measurements Opportunity Cost (OC) = (value of best alternative) - (value of implemented alternative). © 2008 Warren B. Powell 94

CKG in 2 Dimensions Branin function Typical behavior Average behavior © 2008 Warren B.

CKG for Diabetes Treatments n Healthy people have fasting plasma glucose (FPG) levels of 4 to 6 mmol/L. n Diabetes patients have FPG levels of 10 to 15 mmol/L. n 4 diabetes drugs A, B, C, D. » By themselves, each drug reduces FPG between 0. 5 and 2 mmol/L. 6 mmol/L A B C D 0 n Goal: find the combination of 3 drugs that most reduces FPG. » FPG reduction is approximately additive. 6 mmol/L ABC BCD ABD Combination effect 0 © 2008 Warren B. Powell 96

CKG for Diabetes Treatments n Individual drug and combination effect are unknown. n We have a prior belief on their values. 6 mmol/L A B C D BCD ABD ACD 0 6 mmol/L ABC 0 . . n The FPG reduction achieved by a combination can be measured with noise. n What sequence of combinations should we test to find the best one? © 2008 Warren B. Powell 97

CKG for Diabetes Treatments n When we measure one drug we learn about all

CKG for Diabetes Treatments n When we measure a combination of drugs, we learn about all other combinations with drugs in common. Before measurement of ABC mmol/L A B C D BCD ABD ACD 0 mmol/L ABC 0 6 ABC BCD ABD ACD mmol/L 0 © 2008 Warren B. Powell 100

CKG for Diabetes Treatments n When we measure a combination of drugs, we learn about all other combinations with drugs in common. After measurement of ABC mmol/L A B C D BCD ABD ACD 0 mmol/L ABC 0 6 ABC BCD ABD ACD mmol/L 0 © 2008 Warren B. Powell 101

CKG for Diabetes Treatment The knowledge gradient for this problem is given by, where f(z) = (z) + z (z), is the normal pdf, and is the normal cdf. n Formula is identical to that from the wind farm example. n ai corresponds to F(y, K), as before. n bi is the “sensitivity” of F(y, K(x)) to the observation resulting from x. This is computed to reflect the correlation structure of the diabetes treatments. © 2008 Warren B. Powell 102

Outline n Optimal learning on a graph Ø Research by Ilya Ryzhov, Ph. D.

Information collection on a graph n Congestion detection » We have a vehicle that is traveling segments of the network to evaluate congestion. Your information will then be used to route emergency vehicles (the implementation decision). » Traveling an arc from i to j means it is now possible to look at arcs out of j. » How should you traverse the network? © 2008 Warren B. Powell 104

Information collection on a graph n Biosurveillance » You are part of a medical team traveling around Africa to measure the presence of malaria in different parts of the continent. » Visiting one location changes the cost of visiting other locations. » How do you sequence your measurements to have the great impact on prevention strategies (the implementation decision). © 2008 Warren B. Powell Presence of malaria 105

Information collection on a graph n Optimal routing over a graph: node (e. g.

Information collection on a graph n Ranking and selection/bandit problems » Making a measurement changes our belief about the value of each choice 1 2 3 4 5 New state of knowledge Current state of knowledge Decision to make a measurement © 2008 Warren B. Powell 107

Information collection on a graph n Notation » Bellman’s equation can be used to describe transitions to different nodes of a network, or changes in our distributional knowledge about the value of each decision. » To distinguish between a physical state (the node we are at) and the knowledge state (the probability distribution describing our understanding of the value of a decision), let Physical state (“resource state”) © 2008 Warren B. Powell Knowledge state 108

Information collection on a graph n Optimal routing over a graph: © 2008 Warren

Information collection on a graph n Optimal routing over a graph » The shortest

Information collection on a graph n Optimal routing over a graph » The shortest path » The path we explore » Now we have a new shortest path » How do we decide which links to measure? © 2008 Warren B. Powell 112

Information collection on a graph n The knowledge gradient on a graph » We can apply the knowledge gradient concept directly Expected valuepath of updated shortest problem Shortest problem after thepath update weproblem believe about Current path Updated distributions ofshortest arc What costs link Based on what wecosts know » How do we compute the expected value of a stochastic shortest path problem? © 2008 Warren B. Powell 113

Information collection on a graph n The knowledge gradient on a graph » The knowledge gradient is almost the same as it was for the ranking and selection problem » Just think of each path through the network as a possible choice » Recall that the knowledge gradient is given by » where » and © 2008 Warren B. Powell 114

Information collection on a graph n Optimal routing over a graph » For the graph problem, the normalized influence is given by » The normalized influence is given by Value of best path that includes link (i, j) Value of best path that does not include link (i, j) Measurement link (i, j) © 2008 Warren B. Powell 115