Terminology #
- Agent - An entity that takes actions
- e.g. A robot, a self-driving car, a human, etc.
- The algorithm itself is the agent
- Environment - The world that the agent interacts with
- e.g. The physical world, a video game, etc.
- Action - The behavior that the agent can take
- e.g. Moving left, moving right, jumping, etc.
- Action space A - The set of all possible actions (this may be discrete or continuous)
- Observation - How the environment responds back to the agent
- Observations are reports of states and rewards
- State - A concrete/immediate situation that the agent is in at a moment
- Reward - A scalar value that the agent receives from the environment that is a measure of how well the agent is doing
- Deep Q Networks (DQN) - A neural network designed to learn the
Q
function - Policy Function -
pi(s)
- Given a state, which is the best action to take
Examples #
- AlphaGo
- This used a supervised learning approach to bootstrap a decent agent
- Then, it used reinforcement learning approach by playing against itself to improve it beyond human-level performance
- AlphaGo Zero
- This used a reinforcement learning approach from the start (without any human supervision)
- AlphaStart (Starcraft)
- OpenAI Five (Dota 2)
Deep Reinforcement Learning (Deep RL) #
- The key idea of reinforcement learning is to learn without any human supervision, by interacting with the environment and learning from the outcomes of those interactions
Classes of Learning Problems #
- Supervised learning
- We are given input data, and labels y
- The goal is to learn a function that maps the input data to the labels
- This is supervised because we are providing the labels
- Unsuperivsed learning
- We are given input data, but no labels
- The goal is to learn the underlying structure of the data
- This is unsupervised because we are not providing the labels
- Reinforcement Learning
- Given data in state-action pairs
- States - observations that the neural network can see
- Actions - behaviors that the agent can take based on the states
- Rewards - The goal is to maximize the future rewards over time steps
- Given data in state-action pairs
Reinforcement Learning Key Concepts #
- Total Reward (
R_t
) - The sum of all rewards over time - Discounted Sum - A
gamma
variable defined that applies discounts to rewards that achieved in the future- This enforces some short-term greediness in the algorithm
- The
Q
function – this is the foundation of algorithms- The goal of this function is to see given a state, you can find the best action to take
- This captures the expected total future reward for taking action
a
in states
Q(s, a) = R_t | s, a
Deep Reinforcement Learning Algorithms #
- Value Learning
- Determine the
Q
function via deep learning - Value learning tries to use deep learning to estimate the Q function
- Use the
Q
function to determine the best action to take
- Determine the
- Policy Learning (Policy Gradients)
- First try to directly optimize the policy
- What action to take in a given state
- If we can directly optimize the policy, we don’t need to learn the
Q
function and can directly take the action
- First try to directly optimize the policy
More on the Q
function #
- It can be very difficult even for humans to accurately estimate Q-values
- Deep Q Networks (DQN)
- Take in a state
- e.g. screenshot of the environment, etc.
- Take in an action
- Output: A single number that is the expected
Q
value for that state-action pair - This is the naive approach, but would require us to run the network for each action in the action space
- Instead, we can have the network output a vector of
Q
values for each action in the action space- This is called the action-value function
- This is the approach used in DQNs
- Take in a state
- So, to recap, a DQN
- Takes in input the
state
- Outputs a vector of
Q
values for each action in the action space
- Takes in input the
- The policy function can thus be defined as:
argmax_a Q(s, a)
- This is the action that maximizes the
Q
value for a given state
DQN Networks #
- The loss function is stateful
- Start with an initial reward plus the discounted sum of all the best future action rewards
- Take the squared difference between the optimal
Q
value and the actualQ
value predicted
- After selecting an action, the action gets sent to the agent which executes it in the environment
- Now, the environment transitions to a new state
- There is a new state
S'
- The environment provides a reward signal
R
(a scalar value) - The agent uses this data to update the
Q
function - This happens for many iterations until a proper
Q
function is learned
- There is a new state
- Now, the environment transitions to a new state
- Downsides of Q-learning
- It’s good for handling discrete action spaces, but not continuous action spaces
- The policy function is determisitic
- This means that the same state will always result in the same action
- This is not good for exploration and having the agent try new things
- This is not good for stochastic environments
- To address these shortcomings, policy gradient algorithms (policy learning) try to build a neural network to directly learn policy functions from the data
Policy Gradient Algorithms #
- Goal: Train a network that outputs a probability distribution that selecting an action given a state is the best action to take
- We don’t care about the value of each action (like in a Q function) we only care about if it’s the best action to take
- This is why we’re learning a policy function, not a value function like in DQNs
- Because this outputs a probability distribution, we can sample from this distribution to get the action to take
- This allows for stochasticity in the policy function
- Probability distributions allow this to be used in a continuous action space
- The output probability distribution of the policy function must sum to 1
- This helps make it easier to train the network
- The output of the policy function network is no longer a vector
- The output is the mean and variance of the distribution
Training Policy Gradients - Case Study #
- Case study: self-driving cards (simplified example)
- Agent: The car
- State: Camera, lidar, etc.
- Action: Steering angle
- Reward: Distance traveled
Training Algorithm
Training in policy gradients requires both data collection and optimization
- Initialize the Agent
- Run a policy until termination
- Record all states, actions, and rewards that were taken until termination
- Training step #1 – Decrease the probability of actions that resulted in low rewards (actions taken closer to crash)
- Increase the probability of actions that resulted in high rewards (actions taken further from crash)
- Repeat this over and over again until the agent converges
-
This is super interesting because the car learned on its own. It never learned about the physics of the car, what a road is, etc.
-
How can we formalize steps #4 and #5?
- Loss:
-log(pi(a | s)) * R_t
pi(a | s)
- The probability of taking actiona
given states
R_t
- The total reward- This is designed to penalize actions that have high rewards and low probabilities
- Loss:
-
A standard gradient algorithm can be used to optimize the actions to take
- If you are optimizing the policy function, you can use gradient ascent to maximize the probability of taking the best action
-
In real-life, we can’t run a policy until termination (e.g. if it’s a car)
- One way around this is to try and train this in simulation
- Modern simulation environments in reality do not capture the phenomenon of the real world
- This is called the reality gap
- This is a huge problem in robotics
- This is why we can’t just train a policy in simulation and expect it to work in the real world
- Photo-realistic simulators (like VISTA from MIT) are simulators that use real
photos to eliminate the sim to real gap
- This worked to train a sample self-driving car in the real world
RLHF #
- Writing a loss function to capture what is “better” in an LLM response is extremely difficult
- What if we could use human preferences to help with that loss function instead?
- RLHF uses methods from reinforcement learning to use the human feedback to directly optimize a language model
- Once we have an LLM, the next step in RLHF is to train a reward model
- This is how human preferences are integrated into the system
- Reward model
- The goal is to train a model that takes in a sequence of text and returns a scalar reward which should numerically represent human preference
- There are multiple ways to implement this but a scalar output is required
- e.g. This could be another language model fine-tuned or trained from scratch on the preference data
- Training the reward model (RM)
- Human annotators rank the text outputs from the LLM
- These rankings are relative to other outputs to remove noise, differences in ranking amongst annotators, etc.
- The head-to-head Elo system is most commonly used for this
- These rankings are normalized into a scalar reward signal for training
- Now that we have an initial LLM to generate text, and a reward model to rank it, we can use reinforcement learning to optimize the original language model with respect to the reward model
- In this finetuning, a large number of parameters are frozen and a policy gradient algorithm Proximal Policy Optimization (PPO) is used
- Human annotators rank the text outputs from the LLM
- Formalized definitions for RL in LMs
- Policy - A language model that takes in a sequence of text and outputs another sequence of text
- Action space - All the tokens of the language model
- Observation space - The distribution of possible input sequences (quite large)
- Reward function - A combination of the preference model and a constraint on policy shift
- Tying this all together
- Given a prompt
x
,y
text is generated - Concatenate
x
andy
to get the full text, and feed this to the reward model for evaluation - Per-token probability distributions from the in progress finetuning model and the original model are compared to ensure
the RL isn’t making the RLHF finetuned model stray too far from the original model
- This ensures the LM doesn’t output gibberish that fools the reward model
- The update rule is the parameter update from PPO that maximizes the reward metrics on the given batch of dataG
- Given a prompt
Downsides #
- Have to train two separate models
- This is complex and unstable as you don’t want the finetuned LM to drift far from the original LM
Miscellaneous Notes #
- RL is very useful because you can run thousands of simulations in parallel
- This allows you to get to a better solution faster