SU-CS224N MAY092024 — Jemoka Knowledge Base

Floating Point 4 bytes

\begin{equation} (-1)^{B} + e^{E-127} \times \left(1 + \sum_{i=1}^{23} b_{23-i}2^{-i}\right) \end{equation}

usually E is a 8 bytes, and 23 digits of b. With more E, we will have more range, with more b, we will have more precision. Mixed Precision Training copy the model in FP32 Run forward pass in FP16 Scale loss to be large enough to not be rounded away Compute gradients in FP16 Convert the gradients onto FP32 Scale the gradients down apply to the model BFloat16 To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same E, chop off b) —no need to scale, just have more dynamic range. Distributed Data Parallel every GPU has a copy of the model each GPU all-reduce reduce each copy of the weights down Deepspeed Zero reduce-scatter squish down and send each part tot the right GPU all gather send eveything over everybody else Stage 1 We cache a slice of the optimizer state on each GPU. Stage 2 perform a backwards pass at each layer, compute gradient look up who in the cluster is responsible for that layer Stage 3 divide the model parameters into FSDP units shard each unit across multiple GPUs run forward pass run backward pass each GPU updates its own shard using the full gradient from earlier (unlike stages 1 and 2, you need to stream in your parameters—more communication overhead!) Lessons always use mix precision training always use bfloat16 PEFT See PEFT