Reinforcement Learning: Crash Course AI#9
Articles,  Blog

Reinforcement Learning: Crash Course AI#9

Hey, I’m Jabril and welcome to Crash Course
AI. Say I want to get a cookie from a jar that’s
on a tall shelf. There isn’t one “right way” to get the
cookies. Maybe I find a ladder, use a lasso, or build
a complicated system of pulleys. These could all be brilliant or terrible ideas,
but if something works, I get the sweet taste of victory… and I learn that doing that
same thing could get me another cookie in the future. We learn lots of things by trial-and-error,
and this kind of “learning by doing” to achieve complicated goals is called Reinforcement
Learning. INTRO So far, we’ve talked about two types of
learning in Crash Course AI: Supervised Learning, where a teacher gives an AI answers to learn
from, and Unsupervised Learning, where an AI tries to find patterns in the world. Reinforcement Learning is particularly useful
for situations where we want to train AIs to have certain skills we don’t fully understand
ourselves. For example, I’m pretty good at walking,
but trying to explain the process of walking is kind of difficult. What angle should your femur be relative to
your foot? And should you move it with an average angular
velocity of… yeah, never mind… its really difficult. With reinforcement learning, we can train
AIs to perform complicated tasks. But unlike other techniques, we only have
to tell them at the very end of the task if they succeeded, and then ask them to tell
us how they did it. (We’re going to focus on this general case,
but sometimes this feedback could come earlier. So if we want an AI to learn to walk, we give
them a reward if they’re both standing up and moving forward, and then figure out what
steps they took to get to that point. The longer the AI stands up and moves forward,
the longer it’s walking, and the more reward it gets. So you can kind of see how the key to reinforcement
learning is just trial-and-error, again and again. For humans, a reward might be a cookie or
the joy of winning a board game. But for an AI system, a reward is just a small
positive signal that basically tells it “good job” and “do that again”! Google Deepmind got some pretty impressive
results when they used reinforcement learning to teach virtual AI systems to walk, jump,
and even duck under obstacles. It looks kinda silly, but works pretty well! Other researchers have even helped real life
robots learn to walk. So seeing the end result is pretty fun and
can help us understand the goals of reinforcement learning. But to really understand how reinforcement
learning works, we have to learn new language to talk about these AI and what they’re
doing. Similar to previous episodes, we have an AI
(or Agent) as our loyal subject that’s going to learn. An agent makes predictions or performs Actions,
like moving a tiny bit forward, or picking the next best move in a game. And it performs actions based on its current
inputs, which we call the State. In supervised learning, after /each/ action,
we would have a training label that tells our AI whether it did the right thing or not. We can’t do that here with reinforcement
learning, because we don’t know what the “right thing” actually is until it’s
completely done with the task. This difference actually highlights one of
the hardest parts of reinforcement learning called credit assignment. It’s hard to know which actions helped us
get to the reward (and should get credit) and which actions slowed down our AI when
we don’t pause to think after every action. So the agent ends up interacting with its
Environment for a while, whether that’s a game board, a virtual maze, or real life
kitchen. And the agent takes many actions until it
gets a Reward, which we give out when it wins a game or gets that cookie jar from that really
tall shelf. Then, every time the agent wins (or succeeds
at its task), we can look back on the actions it took and slowly figure out which game states
were helpful and which weren’t. During this reflection, we’re assigning
Value to those different game states and deciding on a Policy for which actions work best. We need Values and Policies to get anything done in reinforcement learning. Let’s say I see some food in the kitchen:
a box, a small bag, and a plate with a donut. So my brain can assign each of these a value,
a numerical yummy-ness value. The box probably has 6 donuts in it, the bag
probably has 2, and the plate just has 1… so the values I assign are 6, 2, and 1. Now that I’ve assigned each of them a value,
I can decide on a policy to plan what action to take! The simplest policy is to go to the highest
value (that box of possibly 6 donuts). But I can’t see inside of it, and that could
be a box of bagels, so it’s high reward but high risk. Another policy could be low reward but low
risk, going with the plate with 1 guaranteed delicious donut. Personally, I’d pick a middle-ground policy,
and go for the bag because I have a better chance of guessing that there are donuts inside
than the box, and a value of 1 donut isn’t enough. That’s a lot of vocab, so let’s see these
concepts in action to help us remember everything. Our example is going to focus on a mathematical
framework that could be used with different underlying machine learning techniques. Let’s say John-Green-bot wants to go to
the charging station to recharge his batteries. In this example, John-Green-bot is a brand
new Agent, and the room is the Environment he needs to learn about. From where he is now in the room, he has four
possible Actions: moving up, down, left, or right. And his State is a couple of different inputs:
where he is, where he came from, and what he sees. For this example, we’ll assume John-Green-bot
can see the whole room. So when he moves up (or any direction), his
state changes. But he doesn’t know yet if moving up was
a good idea, because he hasn’t reached a goal. So go on, John-Green-bot… explore! He found the battery, so he got a Reward (that
little plus one). Now, we can look back at the path he took
and give all the cells he walked through a Value — specifically, a higher value for
those near the goal, and lower for those farther away. These higher and lower values help with the
trial-and-error of reinforcement learning, and they give our agent more information about
better actions to take when he tries again! So if we put John-Green-bot back at the start,
he’ll want to decide on a Policy that maximizes reward. Since he already knows a path to the battery,
he’ll walk along that path, and he’s guaranteed another +1. But that’s… too easy. And kind of boring if John-Green-bot just
takes the same long and winding path every time. So another important concept in reinforcement
learning is the trade-off between exploitation and exploration. Now that John-Green-bot knows one way to get
to the battery, he could just exploit this knowledge by always taking the same 10 actions. It’s not a terrible idea — he knows he
won’t get lost and he’ll definitely get a reward. But this 10-action path is also pretty inefficient,
and there are probably more efficient paths out there. So exploitation may not be the best strategy. It’s usually worth trying lots of different
actions to see what happens, which is a strategy called exploration. Every new path John-Green-bot takes will give
him a bit more data about the best way to get a reward. So let’s let John-Green-bot explore for
100 actions, and after he completes a path, we’ll update the values of the cells he’s
been to. Now we can look at all these new values! During exploration, John-Green-bot found a
short-cut, so now he knows a path that only takes 4 actions to get to the goal. This means our new policy (which always chooses
the best value for the next action) will take John-Green-bot down this faster path to the
target. That’s much better than before, but we paid
a cost, because during those 100 actions of exploration, he took some paths that were
even /more/ inefficient than the first 10-action try and only got a total of 6 points. If John-Green-bot had just exploited his knowledge
of the first path he took for those 100 actions, he could have made it to the battery 10 times
and gotten 10 points. So you could say that exploration was a waste
of time. BUT if we started a new competition between
the new John-Green-bot (who knows a 4-action path) and his younger, more foolish self (who
knows a 10-action path), over 100 actions, the new John-Green-bot would be able to get
25 points because his path is much faster. His reinforcement learning helped! So should we explore more to try and find
an even better path? Or should we just use exploitation right away
to collect more points? In many reinforcement learning problems, we
need a balance of exploitation and exploration, and people are actively researching this tradeoff. These kinds of problems can get even more
complicated if we add different kinds of rewards, like a +1 battery and a +3 bigger battery. Or there could even be Negative Rewards that
John-Green-Bot needs to learn to avoid, like this black hole. If we let John-Green-Bot explore this new
environment using reinforcement learning, sometimes he falls into the black hole. So the cells will end up having different
values than the earlier environment, and there could be a different best policy. Plus, the whole environment could change in
many of these problems. If we have an AI in our car helping us drive
home, the same road will have different people, bicycles, cars, and black holes on it every
day. There might even be construction that completely
reroutes us. This is where reinforcement learning problems
get more fun, but much harder. When John-Green-bot was learning how to navigate
on that small grid, cells closer to the battery had higher values than those far away. But for many problems, we’ll want to use
a value function to think about what we’ve done so far, and decide on the next move using
math. For example, in this situation where an AI
is helping us drive home, if we’re optimizing safety and we see the brake lights of the
car in front of us, it’s probably time to slow down, but if we saw a bag of donuts in
the street, we would want to stop. So reinforcement learning is a powerful tool
that’s been around for decades, but a lot of problems need a ton of data and a ton of
time to solve. There have been really impressive results
recently thanks to deep reinforcement learning on large-scale computing. These systems can explore massive environments
and a huge number of states, leading to results like AIs learning to play games. At the core of a lot of these problems are
discrete symbols, like a command for forward or the squares on a game board, so how to
reason and plan in these spaces is a key part of AI. Next week, we’ll dive into symbolic AI and
how it’s a powerful tool for systems we use every day. See you then. Crash Course Ai is produced in association
with PBS Digital Studios. If you want to help keep Crash Course free
for everyone, forever, you can join our community on Patreon. And if you want to learn other approaches
to control robot behavior check out this video on Crash Course Computer Science.


Leave a Reply

Your email address will not be published. Required fields are marked *