SU-CS224N APR252024 — Jemoka Knowledge Base

Transformers Motivation Lower Sequence-Length Time Complexity Minimize Linear Interaction Distance The interaction distances scale by O(l) with l sequence length—gradient is affected by linear interaction distance: linear order is baked in. Maximize Parallelization Forward and backward passes require waiting (waiting for it to roll from left to right)—-instead, you can compute attention in parallel. Key Advantage Maximum interaction distance is O(1) — each word is connected to each other word Unparallizable operation does not increase by sequence length Self-Attention Self-attention is formulated as each word in a sequence attending to each word in the same sequence. Calculating QKV \begin{equation} \begin{cases} q_{i} = W^{(Q)} x_{i}\ k_{i} = W^{(K)} x_{i}\ v_{i} = W^{(V)} x_{i}\ \end{cases} \end{equation} and then you have a standard good time using reduced-rank multiplicative attention:

\begin{equation} e_{ij} = q_{i} \cdot k_{j} \end{equation}

and normalize:

\begin{equation} a_{ij} = \text{softmax} (e_{ij}) \end{equation}

to obtain:

\begin{equation} O_{i} = \sum_{j}^{} a_{ij} v_{j} \end{equation}

Vectorized:

\begin{equation} \begin{cases} Q = W^{(Q)} x\\ K = W^{(K)} x\\ V = W^{(V)} x\\ \end{cases} \end{equation}

and scale dot-product attention

\begin{equation} Out = \text{softmax} \left(\frac{QK^{\top}}{\sqrt{d_{k}}}\right) V \end{equation}

why divide by \sqrt{d_{k}}? see Tricks in Training. Transformer Block Naively having Self-Attention can be described as simply a rolling average. To introduce nonlinearity, we apply a linear layer with a ReLU after. Tricks in Training skip connections x_{l} = F(x_{l-1})+x_{l-1} layernorm (normalize each layer to mean zero and standard deviation of one, so we protect against lower layer’s distribution shifts) x^{(l’)} = \frac{x^{(l)}- \mu^{(l)}}{\sigma^{(l)} + \epsilon} we use population mean and population standard deviation mean of sum is sum of means, meaning after this the input would have mean 0 which is good yet, the mean of variance is sum of variance, so for dimension d, the resulting one-variant layer becomes d-variant. so, we normalize our attention by d_{k} Word Order Sinusoidal Position Embeddings No one uses it lol. ABSOLUTE position doesn’t really matter. See Relative Position Embeddings. Relative Position Embeddings Relative positions are LEARNED and added to the self-attention outputs. so we learn embeddings Multi-Head Attention Perform attention multiple times, get a series of SA embeddings and concatenate. For each single head, divide by number of heads (so you end up doing the same amonut of computation) Transformer Drawbacks quadratic compute of self-attention (computing pairs of interaction means that the computation grows quadratic) — linformer, attempts to solve this