Reinforcement Learning#
Introduction#
At the very core of the nature of learning is the idea of interacting with the environment. This interaction produces a wealth of information about cause and effect relationships and consequences of actions. Such as an infant doing absolutely anything without the help of a teacher. but in this process it learns the world around it, learns to walk and talk and etc. with just the sensorimotor connection to its environment. For example, driving a car, we seek to improve by being aware of how the environment responds to the actions we take. This is the foundation for learning and intelligence. The compitational approach to learning from interaction and various methods involved is called Reinforcement Learning (RL). This is a goal directed learning process as compared to other machine learning approaches such as the widely used supervised learning. We will discuss this later in the book.
The idea is to learn control strategies by a learner called an agent that interacts with an environment. The agent’s job is to learn what to do i.e. how to map situations to actions through trial and error. The agent receives rewards from the environment as feedback for its actions.
The agent eventually must learn which actions yield the most reward by trying them. We can think of this as learning a set of actions through positive reinforcement. In most cases, actions actions may affect not only the immediate reward but in subsequent time steps. The agent must learn to take actions that maximize the total reward over time.
Note
Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision making. Trial and error search and delayed reward form most important features of reinforcement learning.
This can be thought of as an optimization framework of a dynamical system.
Note
A dynamical system is a concept in mathematics that describes how the state of a system changes over time
State Space: It has a “state” at any given moment, represented by a point in an abstract space. Exmple: position and velocity of a moving object. Dynamics: The rules that govern how the state changes over time. Time Evolution: The system evolves over time according to these rules, which can be deterministic or stochastic. Time Steps: The system’s state is updated at discrete time steps or continuously.
The agent gets to measure the current state in the environment i.e. state S. It doesn’t measure the full state, it just knows what state it is right now and what state it was in the past. And the agent gets to take an action, about what to do next.
Difference between Reinforcement Learning and other types of learning#
Supervised Learning#
This is very different from supervised learning, where you have a training set of labeled examples provided by an external supervisor. Each sample or example is a situation and each label is the correct action to take in that situation. So we learn a function that maps the situation to actions and use it to extrapolate or generalize to new situations. This is an interesting way of learning but is inadequate given that there is no learning from interactions.
In interactive learning problems, it is often impractical to obtain examples of desired behaviour of numerous multiple situations. In such cases, an agent must be able to learn from its own experience.
Unsupervised Learning#
In unsupervised learning, the idea is to find patterns and structure hidden in unlabeled data. So there is not external supervisor and labeled data to learn a function mapping. Some might get confused given the fact that reinforcement learning also does not rely on labeled data, but the underlying goal is to maximize a reward signal instead of trying to find a hidden structure. Uncovering the hideen structure happens implicitly through the agent’s experience.
Note
Reinforcement learning can be put as a third paradigm of machine learning, alongside supervised and unsupervised learning.
Exploration vs Exploitation#
To obtain a lot of reward, the agent must take the action that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try new actions that it has not tried or selected before. So, the agent has to exploit what it has experienced in order to obtain a reward, but it also has to explore new actions to discover their effects and make better action selections in the future. This is called the exploration-exploitation trade-off.
On a stochastic or random task, each action must be tried multiple times to get a good estimate of the expected reward. the agent must try a variety of actions and progressively focus on the most promising ones to finish the task. This is a fundamental problem in reinforcement learning and is called the exploration-exploitation dilemma. Researchers have been studying this problem for decades and yet remains unresolved.
We do not have this problem in supervised or unsupervised learning.
Engineering and Scientific disciplines#
This is part of Artificial Intelligence and Machine Learning since decades and integrated with statistics, optimization, control theory. Since its closest to the kind of learning that humans and animals do, its strongly related to psychology and neuroscience and is originally inspired by biological learning systems.
Examples#
A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.
A master chess player makes a move. The choice is informed both by planning—anticipating possible replies and counterreplies—and by immediate, intuitive judgments of the desirability of particular positions and moves.
A child learns to walk by falling down and getting up again.
A man preparing breakfast. Its a mundane task, but this is a complex web of conditional behavior acquired over a period of time, and is tuned based on interactive consequences. Rapid judgements are made based on series of eye movements to obtain information and to guide reaching the objects. Each step is guided by goals and optimized, such as on what to carry first and what to put on the dining table in order to minimize the number of trips to the kitchen but also to avoid dropping things. Whether the agent in this case the human is aware of it or not, it is constantly determining the action to be taken in a given state.
All of these involve, interaction, an environment, a decision-making agent and finally a goal.
The agent’s actions are permitted to affect the future state of the environment (e.g., the next chess position) thereby affecting the actions and opportunities available to the agent at later times.
At the same time the effects of actions cannot be fully predicted, so the agent must monitor its environment frequently that is observe each state and react appropriately. The chess player makes and move and observes the opponent’s response and knows whether or not he wins.
The agent can use its experience to improve its performance overtime. The chess player refines the intuition he uses to evaluate positions, thereby improving his play.
The agent can either bring knowledge from previous experience pr learn from scratch by evolution, but interaction with the environment is essential for adjusting behavior to exploit specific features of the task.
Elements of Reinforcement Learning#
Along with the agent and environment, these are the subelements of a reinforcement learning system:
Policy: This defines a learning agent’s way of behaving at a given time. It is a mapping from perceived states of the environment to actions to be taken when in those states. In psychology we call this stimulus-response associations. A policy can be a simple look up table or a complex function approximator such as a neural network. The policy can be deterministic (based on rules) or stochastic (random i.e probabilities for each action given the state or situation.). In normal systems, the rules never change so the behavior is deterministic. But in most other systems, you want to explore other possibilities and to introduce this randomness.
Reward Signal: This defines the goal of the reinforcement learning problem. At each time step, the environment sends a single number called reward. the agent’s sole objective is to maximize the total reward it receives over the long run. this of this as the experiences of pleasure and pain in a biological ystem. This is the primary basis of altering the policy. If an action selected by the policy is followed by low reward then the policy may be changed to select some other action in that situation in future.
Value Function: A reward signal indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Rewards determine the immediate desirability of a state, values indicate the long-term desirability of a state taking into account the states that are likely to follow and the rewards available in those states.
Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Although, action choices are made based on value judgments, because accumulated rewards provide greatest amount of reward for us over long term than immediate rewards. Values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.
State: The situation or information the agen is in at a given time about the environment. The state signal provides a sense of how the environment is at a given time. For example, in a chess game, the state is the current position of the pieces on the board. In a driving game, the state is the current position of the car, its speed, and other relevant information about the environment.
Note
One might think that evolutionary algorithms and methods such as genetic programming are also other optimization methods. Although, they do not interact with the environment and cannot sense the complete state of its environment in most complex cases. We have multiple static policies (multiple generations with different genes/chromosomes) each inetracting with the enviroment separately. The policies that obtain the most reward (high fitness value) are selected to produce the next generation of policies. Here, we are not learning a value function nor estimating it. And also the policy that we are searching for is a function from states to actions which is not present in a GA, that is which states an individual passes.