This is the core model architecture for computer vision
Steps
Convolution: apply filters to generate feature maps
Non-linearity: apply a non-linearity activation function (oftern ReLU)
Pooling: downsample the feature maps to reduce the number of parameters
Feed the features to a fully-connected neural network to classify the image – we can use the standard
fully-connected neural network architecture since the image has been broken up into features and is no longer a 2D input
Computing a neuron in the hidden layer:
Apply convolution to the input
Compute the weighted sum of the convolution
Apply a bias to the weighted sum
Add the non-linearity
For CNNs this is generally ReLU which is just max(0, x) for each pixel
Each neuron in then only sees a portion of the input
This is important because it allows the model to scale as it doesn’t need a crazy amount of weights
A single convolutional layer may have multiple filters
We should think of the layer as outputting a 3D volume (collection of 2D “images” of feature maps)
The output is spatially described as hxwxd, where d=depth which is the number of filters
Pooling
Reduce the dimensionality of the image as you go through the convolutional layers
Max pooling: Sliding window that takes the max value of the window
This just reduces dimensionality of the output, but preserves the features because the max value is preserved
A pooling layer takes in the output of a convolutional layer and outputs a smaller volume
So the layers are structured as convolutional -> pooling -> convolutional -> pooling, etc.
In a CNN, we can conceptualize this as:
The first layer detects edges
The second layer detects shapes (eyes, nose, mouth, etc.)
The third layer detects objects (faces, cars, etc.)
Each successive layer is detecting more features
A CNN for classification has two key parts:
Feature extraction
Convolutional layers
Classification
Fully connected layers
Can do this using a softmax function – this just collapses the output into a probability distribution
CNNs are super powerful because they are designed to extract features – anything after that can be very flexible
What you do with the features can use any architecture
Object Detection
This is both a classification problem and a regression problem
Naive approach: draw a random box on the image and feed that box into the CNN for classification
Repeat this for many random samples
There are obviously way too many random samples to do this
R-CNN algorithm: Find regions we think have objects, then use CNN to classify the object
This is brittle and slow
Fast R-CNN algorithm: Use CNN to find regions that have objects, then use CNN to classify the object
All of these different model architectures that involve images all start by using the same feature extraction steps
They then can branch of with what to do when the features are extracted