features are not just visual, they are often quantitative, abstract, etc.
Hidden layer - A layer of neurons that is intermediate and is not the input or the output
Forward Propagation - Input data is fed into a neural network, that is used to compute an output, and that output is optionally
passed to more layers to compute the final output
Feed Forward - Describes the architecture of a neural network where information moves in one direction and there are no cycles
Hyperparameters - Parameters that determine the network structure and how the network is trained
Examples: learning rate, number of hidden layers, number of epochs, batch size, etc.
Layer - A collection of neurons that are connected to the same inputs and output
In a feed forward model this is just one step of the feed forward process (inputs -> layer 1 -> layer 2 -> output)
Neuron - Represents one unit in a neural network – it takes inputs, applies a function to them, and produces an output
A perceptron is a subset type of a neuron in a neural network
LSTMs - Long short-term memory – a type of RNN that uses gated cells to remember relevant information and forget irrelevant information
over time
These are designed so long-term dependencies can be tracked and backpropagation through time can be done without vanishing gradients
Fully Connected Neural Network - A neural network where each neuron in a hidden layer is connected to every neuron in the next layer
Dense Layer - A layer where each neuron receives input from every neuron in the previous layer
Optimizer - An algorithm that is used to update the weights of the neural network
Examples: gradient descent, stochastic gradient descent, etc. Different optimization algorithms often use gradient
descent as a backbone but not always
Residual Neural Network - A neural network that uses skip connections to jump over layers
Autoregressive - Autoregressive models take in historical state data as input to predict future state
This is different than models that just take in a set of defined feature inputs
Many stock market models are autoregressive
ChatGPT is autoregressive because the next token in a sequence generation depends on the previous tokens
The output from one iteration is used as the input into the next
LoRA - Low-Rank Adaptation of Large Language Models
A training method that accelerates the training of large models while consuming less memory
This essentially freezes the pre-trained weights and injects trainable parameters into each layer of the architecture
This is critical, since redeploying new weights for each 70B+ parameter model fine-tuned is not feasibleggk
Emergent abilities - Functions that the model develops that aren’t explicitly in the model’s training objective
Policy Learning - A reinforcement learning technique where the agent learns to optimize the decision making policy directly
without ever having to learn the Q function
Direct Preference Optimization (DPO) - A new approach for aligning models in finetuning with human preference as an alternative to RLHF
Residual Connections - basically a skip connection, where output from one layer can skip layers in the network
Layer Normalization - This is a layer that normalizes all of the features with the same distribution
e.g. for student loan models, age and value could be the features which are in very different scales
Gradient descent algorithms take much longer to converge in training when input features are not all on the same scale (intuitive, because of learning rate)
But the input features aren’t the only input to each layer in a model
A normalization layer ensures the outputs of each hidden layer are normalized to a standard mean and std deviation
Generally this will normalize the inputs to a mean of 0 and a stddeviation of 1
To train a model, you have to tell a model how far off the prediction is from the true prediction
This is defined as the loss of the model
The loss function is a continuous function that is used to calculate the loss (cost function)
Loss functions look over the entire dataset and calculate the loss for each data point
The loss function is the function that is used to minimize the error between the actual and predicted outcomes
You want to minimize the loss on average across the entire dataset
The goal is to find the weights that achieve the lowest loss on average
A loss function is just a function of the set of weights
Conceptually, you can plot the total loss for each set of weights
You can then choose some optimization algorithm to find the set of weights that minimizes the loss function
Gradient descent is one such algorithm
How do you compute the gradient line?
Backpropagation
Training is done by computing the gradient descent and multiplying it by a learning rate (i.e. how big of a step in the direction of the gradient do you want to take)
The learning rate is a hyperparameter that you can tune
Setting the learning rate is important because if you set it too high, you can overshoot the minimum and if you set it too low, it will take a long time to converge
How do you set the learning rates?
Try out a bunch of learning rates and see what works best
Design an adapt learning rate that “adapts” to the landscape
A gradient (over the whole dataset) is very expensive to compute for a large neural network
Instead, you can ping a single point on your dataset and compute the gradient at that point
This is called stochastic gradient descent because it is a random point on the dataset (it’s noisy but easy to compute)
You can compute a “mini-batch” of points on the dataset and compute the gradient at that point
10s or 100s of examples in your dataset
This is called mini-batch gradient descent and is a happy medium between stochastic and full gradient descent
This is the most common way to train neural networks
More accurate gradient calculations lead to smoother convergence and a larger learning rate (since we can trust the gradient more)
This is parallelizable because you can split up a bunch of batches
Overfitting
When a model is too complex and it starts to memorize the training data
You want to build models that generalize well to new datav and don’t just mimic test data
To address this problem you can introduce regularization to penalize the model for being too complex
Regularization 1: Dropout – randomly set some activations to 0 on some iterations of the training
This forces the model to not rely on any one node, and instead rely on the entire network and have multiple pathways
Regularization 2: Early stopping
Stop training before we have a chance to overfit
Set some of the training data aside and use it as a validation set and use it to monitor how well the network is doing on non-training data
When the loss decreases but the test loss increases on the validation set, you stop training
Modern deep networks can perfectly fit to random data
This means that models don’t necessarily generalize well beyond the training data
The network is just generating a function that fits the training data – there is no guarantee that the function
structure beyond the training data is meaningful
It could be a crazy function that over the training data just happens to fit
The network is only as good as the training data it is given
Algorithmic bias
Computationally expensive…
These are problems in research we can solve
Difficult to encode structure and prior human knowledge during models
Alignment vs. capability is similar to accuracy vs. precision link
Capability - How well a model can maximize its objective function (e.g. a model designed to maximize stock market predictions)
Alignment - How well can the model do what we want it to do vs. what it was trained to do
For example, a model designed to predict bird types may maximize its objective function extremely well, but that objective function may
not actually be great at making predictions
Models like the original GPT-3 are misaligned – they can produce large amounts of text but not be consistent with human intent
These models’ objective function is a probability distribution around tokens (capability) but they are expected to perform cognitive reasoning (alignment)
ChatGPT was trained with human feedback to help align the model better to its output (RLHF)
ChatGPT was the first model to put RLHF into production
Alignment also includes producing non-toxic and unbiased responses
Supervised fine tuning (SFT) is a technique to train a model that involves fine-tuning the model on a set of high-quality, pre-labeled output data by humans
This is the standard fine tuning process
The huggingface transformers library is the most popular way to fine-tune models
Distilled supervised fine tuning (dSFT) is a technique to train a model on a smaller dataset by using a larger model as a teacher
The teacher model is used to generate labels for the smaller dataset
The smaller dataset is then used to train the smaller model
This is a form of transfer learning, but the student model does not ever reach the performance of the parent model and are not aligned
In general, training on human feedback > dSFT
One epoch in model training is when all the training data is used once
This gives each piece of training data an ability to update the model parameters
One limitation of this study is that GPT-4 was used as the evaluator for the benchmarks and
it’s known to be biased towards models derived from itself
Loss function - A loss function is the mathematical function that is used to minimize the
error between the actual and predicted outcomes