Aleatoric uncertainty - Data uncertainty – very similar inputs that have very different outputs
Cannot be reduced
Epistemic uncertainty - Model uncertainty – no samples in the training data that are similar to the input – points input points there are outside the distribution
Can be reduced by adding more data to regions of the input space with high model uncertainty
Models output a probability distribution, regardless of input
This is not a confidence score
Uncertainty estimation gives us a confidence score for the model as well as the probability distribution
Types of uncertainty
Data uncertainty (Aleatoric uncertainty) - very similar inputs that have very different outputs
Data uncertainty is irreducible
All types of models (e.g. image classification, regression, etc.) have data uncertainty
Model uncertainty (Epistemic uncertainty) - no samples in the training data that are similar to the input – points input points there are outside the distribution
This can be reduced by adding more data to regions of the input space with high model uncertainty
Estimating aleatoric uncertainty
Goal: learn a set of variances corresponding to the input
Higher variance = more noise
This variance is a function of the input x
Training a model that also has variance
We can use a loss function that takes variance into account
Estimating epistemic uncertainty
What if we train the same network multiple times (an ensemble of networks) and compare outputs?
All of these should have the same hyperparameters
This works because for familiar inputs the networks should converge on similar outputs
For unfamiliar inputs, the networks should diverge on outputs
This is costly and expensive
Instead, we can use dropout layers to introduce stochastisity into the model
We’ve seen these be used at test time to prevent overfitting
We can keep them at inference time to estimate epistemic uncertainty
This sampling can still be pretty expensive
Another way to estimate epistemic uncertainty is to use reconstruction error to measure how confident a model is in a prediction
This is just an autoencoder or VAE!!
We then have to train a whole decoder to test this…which is also expense
One last way to learn the variance directly by placing priors on the distribution
This is the most advanced and compute-light way to do this