neural network

Neural networks are a non-linear learning architecture that involves a combination of matrix multiplication and entry-wise non-linear operations.
two layers constituents Consider a two layer neural network with:
m hidden units d dimensional input x \in \mathbb{R}^{d} requirements \begin{align} &\forall j \in \left{1, \dots, m\right}\ &z_{j} = w_{j}^{(1)}^{T} x + b_{j}^{(1)}\ &a_{j} = \text{ReLU}\left(z_{j}\right) \ &a = \left(a_1, \dots, a_{m}\right)^{T} \in \mathbb{R}^{m} \ &h_{\theta} \left(x\right) = w^{(2)}^{T} a + b^{(2)} \end{align}
z_{j} are hidden units, a_{j} are activated hidden units, h_{\theta} is the prediction function.
vectorized two-layer constituents m hidden units per layer d input dimension requirements \begin{equation} W^{(1)} = \mqty[w_1^{(1)}^{T} \ \dots \ w_m^{(1)}^{T} ] \end{equation}
which emits a m \times d matrix. So this gives:

\begin{equation} \mqty[z_1 \\ \dots \\ z_{M}] = \mqty[w_1^{(1)}^{T} \\ \dots \\ w_m^{(1)}^{T} ] \mqty[x_1 \\ \dots \\ x_{D}] + \mqty[b_1^{( 1 )} \\ \dots \\ b_m^{( 1 )}] \end{equation}

where z \in \mathbb{R}^{m \times 1}, w^{(1)} \in \mathbb{R}^{m \times d}, x \in \mathbb{R}^{d \times 1} , b^{(j)} \in \mathbb{R}^{m \times 1}. Writing this as matrix operations:

\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}

and

\begin{equation} a = \text{ReLU}\left(z\right) \end{equation}

with:

\begin{equation} h_{\theta}\left(x\right) = w^{(2)} a + b^{(2)} \end{equation}

multi-layer \begin{equation} a^{(1)} = \text{ReLU}\left(W^{(1)} x + b^{(1)}\right) \end{equation}

\begin{equation} a^{(2)} = ReLU\left(W^{(2)} a^{(1)} + b^{(2)}\right) \end{equation}

and so on…

\begin{equation} a^{(r-1)} = \text{ReLU}\left(W^{(r-1)} a^{(r-2)} + b^{(r-1)}\right) \end{equation}

\begin{equation} h_{\theta}\left(x\right) = W^{r} a^{r-1} + b^{r} \end{equation}

metadata total number of neurons: m_1 + m_2 + … + m_{r} number of parameters: \left(d+1\right) m_1 + \left(m_{1}+1\right)m_{2} + … + \left(m_{r-1}+1\right)m_{r} additional information Neural networks admit a local optima, and we cannot find a global optima.
neuron Consider first a single neuron neural network in one dimension. For instance, let’s think of a slightly non-linear case first:

\begin{align} h_{\theta}\left(x\right) &= \max \left(wx+b, 0\right) \end{align}

it admits two parameters, \theta = \left(w, b\right) \in \mathbb{R}^{2}. Such a function is called relu function. What if we have multiple input features? Consider: x \in \mathbb{R}^{d}, w \in \mathbb{R}^{d}, and b \in \mathbb{R}. Now:

\begin{equation} h_{\theta} \left(x\right) = \text{ReLU}\left(w^{\top}x + b\right) \end{equation}

We refer to relu function as an “Activation Function”.
neurons We can write latent units in terms of the input units, as well as parameters to weight them; for instance:

\begin{equation} a_1 = \text{ReLU}\left(\theta_{1} x_1 + \theta_{2} x_2 + \theta_{3}\right) \end{equation}

Instead of writing this directly, we can just make every neuron connected to every other neuron, resulting in:

\begin{equation} a_1 = \text{ReLU}\left(w_1^{T} x + b_1\right) \end{equation}

\begin{equation} a_2 = \text{ReLU}\left(w_2^{T} x + b_2\right) \end{equation}

and so on.
why would the neurons learn different things? Because random initializations will give local minims, and each node would specialize.
some activation functions see Activation Function
see also Neural Networks see Neural Networks
kernel methods vs deep learning Instead of using Kernel Trick and feature map to extract features yourselves, deep learning promises to learn the correct feature map after multiple, non-linear layers.
Consider \beta as the parameters of a fully-connected neural network, then the final hypothesis function of a neural network is:

\begin{equation} h_{\theta} = W^{r} \phi_{\beta} \left(x\right) + b^{( r)} \end{equation}

in some sense, the entire damn neural network is a feature map for the final, linear output head. We can therefore think of training a neural network as automatically finding a feature map \phi_{\beta}, as well as learning a classifier for that feature map.