Dylan Hadfield-Menell Aug 17, 2017
“Be careful what you wish for!” – we’ve all heard it! The story of King Midas is there to warn us of what might happen when we’re not. Midas, a king who loves gold, runs into a satyr and wishes that everything he touches would turn to gold. Initially, this is fun and he walks around turning items to gold. But his happiness is short lived. Midas realizes the downsides of his wish when he hugs his daughter and she turns into a golden statue.
We, humans, have a notoriously difficult time specifying what we actually want, and the AI systems we build suffer from it. With AI, this warning actually becomes “Be careful what you reward!”. When we design and deploy an AI agent for some application, we need to specify what we want it to do, and this typically takes the form of a reward function: a function that tells the agent which state and action combinations are good. A car reaching its destination is good, and a car crashing into another car is not so good.
AI research has made a lot of progress on algorithms for generating AI behavior that performs well according to the stated reward function, from classifiers that correctly label images with what’s in them, to cars that are starting to drive on their own. But, as the example of King Midas teaches us, it’s not the stated reward function that matters: what we really need are algorithms for generating AI behavior that performs well according to the designer or user’s intended reward function.
Our recent work on Cooperative Inverse Reinforcement Learning formalizes and investigates optimal solutions to this value alignment problem — the joint problem of eliciting and optimizing a user’s intended objective.
Open AI gave a recent example of the difference between stated vs. intended reward functions. The system designers were working on reinforcement learning for racing games. They decided to reward the system for obtaining points; this seems reasonable as we expect policies that win races to get a lot of points. Unfortunately, this lead to quite suboptimal behavior in several environments:
This video demonstrates a racing strategy that pursues points and nothing else, failing to actually win the race. This is clearly distinct from the desired behavior, yet the designers did get exactly the behavior they asked for.
For a less light-hearted example of value misalignment, we can look back to late June 2015. Google had just released an image classifier feature that leveraged some of the recent advances in image classification. Unfortunately for one user, the system decided to classify his African-American friend as a gorilla.