Beyond demonstrations Learning behavior from higherlevel supervision Hal

Beyond demonstrations: Learning behavior from higher-level supervision Hal Daumé III Microsoft Research University of Maryland me@hal 3. name @haldaume 3 he/him/his

Hybrid imitation / reinforcement learning Teacher provides high-level feedback and agent applies standard RL at the low level à speed-up in learning [evaluate on Montezuma’s revenge, theory bounds number of queries to expert] with: Hoang Le Nan Jiang Alekh Agarwal Yisong Yue Miro Dudík Reinforcement learning with convex constraints Teacher provides high-level constraints/preferences with: and agent applies interleaved RL & online updates Sobhan Miryoosefi à speed-up in learning Kianté Brantley Miro Dudík [evaluate on “Mars rover” domains Robert Schapire “safety” and “diversity” constraints

Hybrid imitation / reinforcement learning Teacher provides high-level feedback and agent applies standard RL at the low level à speed-up in learning [evaluate on Montezuma’s revenge, theory bounds number of queries to expert] with: Hoang Le Nan Jiang Alekh Agarwal Yisong Yue Miro Dudík Reinforcement learning with convex constraints Teacher provides high-level constraints/preferences with: and agent applies interleaved RL & online updates Sobhan Miryoosefi à speed-up in learning Kianté Brantley Miro Dudík [evaluate on “Mars rover” domains Robert Schapire “safety” and “diversity” constraints

Reinforcement learning with convex constraints Best mixture policy Expected measurement vector Arbitrary convex set

Best mixture policy Best-response on policy = Run RL against measurement vector Expected measurement vector Online gradient descent on “dual” variables

Constraints: 1. High reward 2. Low probability of failure 3. High diversity (optional)

Theorem (hand-wavy): Given: - Positive response RL oracle with tolerance - Estimation oracle (of measurement) with tolerance - Projection oracle (onto convex set) After T rounds:

Hybrid imitation / reinforcement learning Teacher provides high-level feedback and agent applies standard RL at the low level à speed-up in learning [evaluate on Montezuma’s revenge, theory bounds number of queries to expert] with: Hoang Le Nan Jiang Alekh Agarwal Yisong Yue Miro Dudík Reinforcement learning with convex constraints Teacher provides high-level constraints/preferences with: and agent applies interleaved RL & online updates Sobhan Miryoosefi à speed-up in learning Kianté Brantley Miro Dudík [evaluate on “Mars rover” domains Robert Schapire “safety” and “diversity” constraints