SU-CS224N APR302024 — Jemoka Knowledge Base

Subword We use SUBWORD modeling modeling to deal with: combinatorial morphology (resolving word form and infinitives) — “a single word has a million forms in Finnish” (“transformify”) misspelling extensions/emphasis (“gooooood vibessssss”) You mark each actual word ending with some of combine marker. To fix this: Byte-Pair Encoding “find pieces of words that are common and treat them as a vocabulary” start with vocab containing only characters and EOS look at the corpus, and find the most common pair of adjacent characters replace all instances of the pair with the new subword repeat 2-3 until vecab size is big enough Writing Systems phonemic (directly translating sounds, see Spanish) fossilized phonemic (English, where sounds are whack) syllabic/moratic (each sound syllable written down) ideographic (syllabic, but no relation to sound instead have meaning) a combination of the above (Japanese) Whole-Model Pretraining all parameters are initalized via pretraining don’t even bother training word vectors MLM and NTP are “Universal Tasks” Because in different circumstances, performing well MLM and NLP requires {local knowledge, scene representations, language, etc.}. Why Pretraining maybe local minima near pretraining weights generalize well or maybe, because the outputs are sensible, gradients propagate nicely because they are modulated well Types of Architecture Encoders bidirectional context can condition on the future Bert replace input word with [mask] 80% of time replace input word with a RANDOM WORD 10% of the time leaving the word unchanged 10% of the time i.e. BERT will then need to resolve a proper sentence representation from lots of noise Original BERT also pretrained on top a next sentence prediction loss in addition to MLM, but that ended up being unnecessary. Bertish RoBERTa - train on longer context SpanBert - mask a span Encoder/Decoder do both pretraining maybe hard T5 Encoder/Decoder model. Pretraining task: blank inversion: “Thank you for inviting me to your party last week” “Thank you <x> to your <y> last week” => “<x> for inviting <y> party <z> This actually is BETTER than the LM training objective. Decoder general LMs use this nice to generate from + cannot condition no future words In-Context Learning really only very capable at hundreds of billion parameters uses no gradient steps—-repeat and attend to examples