Backpropagation Reverse Mode Differentiation
Refer to Calculus on Computational Graphs: Backpropagation
Prerequisite Knowledge
Gradient Descent: first-order iterative optimization algorithm for finding the global minimum of the cost function
Derivatives of the cost with respect to all the parameters are used in finding gradient descent
i.e. Derivatives used to train models
Millions of parameters in neural network
Dependency Graphs: Depicts how elements in codebase depend on one another (methods, types, namespaces, whole assemblies)
Call Graphs: Control flow graph of program, what functions call other functions. (i.e. Recursion is represented by cycles in call graphs)
High Level Explanation
Technique for calculating partial derivatives quickly
Trains gradient descent faster in modern deep learning neural networks
Algorithm reinvented many times
Aka reverse-mode differentiation
Computational Graphs
Broken-down way of thinking about mathematical expressions
Example: e = (a + b) ∗ (b + 1) => c = a + b => d = b + 1 => e = c ∗ d
How to graph them (Refer to link for visualization)
Each equation and input variables are nodes
i.e. the above equations and a and b
Arrows point from input to equation with that input
Can also set specific values to variables so that each node has corresponding value
The "top" node is what the expression evaluates to
Applications
Dependency graphs
Call graphs
Abstraction behind deep learning framework
Theanos
Derivatives on Computational Graphs
Want to see how variables change the equation
i.e. how the node pointing to another node affects the node that is pointed at
This is the Partial Derivative which is labeled on the graph edges (the arrows)
To understand how nodes that are not directly connected affect each other, use the sum over paths rule
multiply derivatives of each edge on each path and sum all the paths up (again refer to website)
Factoring Paths
Reduces combinatorial explosion into more condensed way Example:
Say we have 3 nodes, 3 paths between each nodes => 9 different paths are possible
i.e. This can get very large if the # of nodes increase or the # of paths between each node increases (exponential growth)
We have ∂Z/∂X = αδ + αϵ +αζ + βδ + βϵ + βζ + γδ + γϵ +γζ This can be factored/reduced to: ∂Z/∂X = (α+β+γ)(δ+ϵ+ζ)
Forward-mode/Reverse-mode differentiation both compute the same sum more efficiently using this method
Touch edges exactly once
Forward-mode Differentiation
Starts at input
Moves towards end
Sums all paths feeding into next node
Each path represents how input affects node (derivative
Summary: Tracks how 1 input affects every node
i.e. Applies to every node
Reverse-mode Differentiation
Starts at output
Moves towards beginning
At each node, merges paths which originated at that node
Summary: Tracks how every node affects 1 output
i.e. Applies to every node
Seeing the Advantages of Reverse-mode Differentiation
Refer to diagram in this section
Forward-mode differentiation gave derivative of output with respect to one input
i.e. This way, you have to go through the nodes multiple times to cover all inputs
Reverse-mode differentiation gives us all inputs.
i.e. This way, you only go through the nodes once to get the information
For indicated graph, factor of 2 speed-up (can be much more if there are millions of nodes and paths)
Note:
Forward-mode differentiation can still be faster than reverse-mode differentiation
eg. If there is one input and many outputs in neural network
Other Benefits of Backpropogation
Helps to see the flow of networks
Can see why some models are hard to optimize
eg. Vanishing gradients in RNNs
Last updated