CNN Layers
Last updated
Last updated
A nice summary: YouTube
Detects patterns in images
With each layer, number of filters must be specified
Layer receives input, transforms input into outer layer
Remember in PyTorch, we just convert the image to a 1D vector. The model then treats this vector as a simple vector of numbers when in reality, a picture is a 2D grid with different things in different places of the grid. So an NN is not aware of this which is very important when analyzing images since pixels that are closer to one another in a grid can be more related than pixels that are far apart.
e.g. The only reason an NN was okay for the MNIST handwriting dataset was because each number in the image of MNIST is in the exact same position, so spatial awareness isn't a factor for classification in this dataset.
Does a hidden node need to be connected to every input node like above?
Not really. We could split it into sections (indicated by the colours) and have it connected like so instead:
This way, there are way less connections and thus less calculations to make during training.This way, each node is responsible for finding patterns in their own spatial region as opposed to each node is responsible for finding all patterns in all spatial regions (That's what makes the NN way redundant).
Note: This is the meaning of sparsely/locally vs. fully connected layers. CNNs use sparesely connected layers while NNs use fully connected layers. We can also easily expand the number of patterns we want to find by introducing more hidden nodes like below.
If we were to pass in neurons per layer, that's a lot of operation
CNN's extract features of images, then convert the image into lower dimension without losing its characteristics which will decrease the number of parameters the model needs to compute
Remember that 1 image can be represented by a matrix/2D tensor
Each convolution applied on image creates a new matrix/2D tensor
These tensors are stacked into 1 3D tensor.
e.g. Stacking a convolution that detects horizontal lines on top of a convolution that detects vertical lines, etc.
Result: The image in 3D.
Moving horizontal across first tensor of 3D tensor:
Moving horizontal across image
Moving vertical down first tensor of 3D tensor:
Moving vertical down image
Moving from first tensor → second tensor → third tensor of 3D tensor:
Moving from one convolution output to another
Channel dimension
Each set of convolutions applied at the same time is a layer.
Layer 1: Takes raw pixel intensities and translated into 3D tensor indicating where vertical/horizontal lines
Layer 2: Takes map from layer 1 as input, multiples by more 3D tensors to find more patterns
Continues on
See Input Layer Notes
High-Level Idea
Feature extractor
Detects patterns (edge, corner, circle, square)
Produced by applying series of filters to an image => which produces an image per filter
These images are stacked and forms a convolutional layer with depth = # of filters applied
Makes the image "deeper" - Look at image above and see that rectangular prisms are getting deeper and skinnier (skinnier due to pooling)
In the OpenCV notebook, we defined our own weights in the kernel. However, neural networks learn what the best weights are as they train.
Uses:
Filter out irrelevant image information
Amplify distinguishing traits or object boundaries
Hyperparameter Tuning
Increasing the # of filters increases the number of patterns your network will learn
Increasing the size of filters increases the size of detected patterns
Stride
Padding
As you go deeper in the network, the filters become more complex and detailed
Usually has ReLU activation function applied (which means this function is applied on all square of the feature map) to force all negative values to be 0
Filter (all the same thing) starts in the top left corner of the input image and slides/convolving right across all areas of input image (Covers FxF area at a time)
Region FxF is the receptive field
Also an array of numbers (aka the weights/parameters)
What is happening as it shifts from one area to the next?
Multiplying values in filter with original pixel value of image and sum it all up
i.e. - You get a single number from this entire area (hence why the dimensions shrink in the feature map)
: Image dimensions
: Filter dimensions
Sharpens image
Enhances high-frequency parts of an image
Convolutions + filters are used to detect edges
Darker part → smaller pixel value
Lighter part → larger pixel value
High vs. low frequency images
High frequency images: Rapid changes in brightness (e.g. a striped shirt has black and white areas)
High frequency components in image help detect edges in image
Low Frequency images: Minimal changes in brightness (e.g. a blank white page)
Sharpens image and emphasizes edges
Enhances high-frequency part of image (so the parts where rapid changes in brightness occur.. aka edges!)
Question: Of the four kernels, which would be best for finding and enhancing horizontal edges and lines in an image?
Solution: d) This kernel finds the difference between the top and bottom edges surrounding a given pixel.
Question: Which one is true?
There are more visual patterns that can be captured by large convolutions
There are fewer visual patterns that can be captured by large convolutions
The number of visual patterns that can be captured by large convolutions is the same as the number of visual patterns that can be captured by small convolutions?
Solution: There are more visual patterns that can be captured by large convolutions.
Anything a 2x2 convolution can capture, a 3x3 convolution can also capture, but the things a 3x3 convolution can capture, a 2x2 convolution cannot (it's smaller)
Denotes # of steps we are moving in each steps in convolution (default is 1)
Notice that size of output (feature map) is smaller than the input
We use padding when we want to maintain the input shape's dimensions.
The number in the padded squares are just 0s.
Now you can see the size of the output is the same as the input!
→ Padding is dependent on the dimension of filter.
If we set the above stride to 2, there would be times where the filter would be outside of the image. In these cases, we have two options:
Drop the unknown values that were outside the image. However, this risks not learning any information in certain parts of the image.
Use padding
A nice summary: Youtube
Reduces spatial volume of input image AFTER convolution
Used between 2 convolution layers
Applying FC after Convo layer is computationally expensive
In the input matrix, find the maximum value and that is your output feature value
Why do we do max pooling?
Since max pooling reduces resolution of input, so it reduces number of parameters and computational load
Reducing the # of parameters also prevents overfitting
If we have W x H x D, F is filter, S is stride (both hyperparameters)
Dimensions of output after processed by Max Pooling Layer:
Example:
Set filter size (eg. 2x2)
Set stride (eg. 2)
Find the maximum number in 2x2 matrix and store it in another grid (which is building to be the output), Keep going in intervals that was set by the stride
Same as max pooling, but instead takes the average of the pixels that the filter covers
Typically not used for image classification since max pooling is better for edge detection
Used for smoothing images
A series of numbers (a 1D matrix). It contains information describing an object's important characteristics. After the convolutional layers output specific features (e.g. for a car, it may output, there are wheels here), the feature vector will output this fact to make a classification.
Involved weights, biases, and neurons
Connects 1 layer to neurons in another layer (shown in diagram above)
Last layer
Logistic/Sigmoid - Binary classification
Softmax - Multiclassification
Contains label in form of one-hot encoded