Neural Networks are powerful because of self organization of the intermediate levels. Neural Network Layer \begin{equation} z = Wx + b \end{equation} for the output, and the activations:
where the activation function f is applied element-wise. Why are NNs Non-Linear? there’s no representational power with multiple linear (though, there is better learning/convergence properties even with big linear networks!) most things are non-linear! Activation Function We want non-linear and non-threshold (0/1) activation functions because it has a slope—meaning we can perform gradient-based learning. sigmoid sigmoid: it pushed stuff to 1 or 0.
tanh \begin{equation} tanh(z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \end{equation} this is just rescaled logistic: its twice as steep (tanh(z) = 2sigmoid(2z)-1) hard tanh tanh but funny. because exp is hard to compute
this motivates ReLU relu slope is 1 so its easy to compute, etc.
Leaky ReLU “ReLU like but not actually dead”
or slightly more funny ones
(this is relu on positive like exp, or a negative at early bits) swish slope of 1, and then “swishes back”. Gives ReLU but never with a discontinuity in gradient :
usually \beta = 1. GELU \begin{equation} x \cdot \phi\left(x\right) \end{equation} where:
where CDF_{\mathcal{N}} is the CDF of the normal distribution. Vectorized Calculus Multi input function’s gradient is a vector w.r.t. each input but has a single output
where we have:
if we have multiple outputs:
Transposes Consider:
but because of shape conventions we call:
Useful Jacobians! Why is the middle one? Because the activations f are applied elementwise, only the diagonal are values and the off-diagonals are all 0 (because \pdv{h(x_1)}{x_2} = 0). Shape Convention We will always by output shape as the same shape of the parameters. shape convention: derivatives of matricies are the shape Jacobian form: derivatives w.r.t. matricies are row vectors we use the first one Actual Backprop create a topological sort of your computation graph calculate each variable in that order calculate backwards pass in reverse order Check Gradient \begin{equation} f’(x) \approx \frac{f(x+h) - f(x-h)}{2h} \end{equation}