Small scale: DCLM Baseline Data Legally friendly data: CommonPile Web scraped data with quality group: NemoTron people measure isoFLOPS Problems of pre-training data pre-training influence downstream capabilities …and therefore can escape into model generation real world users expect novelty Changes in Distribution Big Pretraining Data GPT2 deduplicated data Removed Wikipedia (to prevent data leak) Heuristic based cleaning GPT3 Deduplicated based on leaked data Llama the usual spheal removed high perplexity data using wiki n-gram model removed non-English deduplicated Llama 2 removed high volue of PII Removed non-english Pretraining Curation Decisions what to include what is the timestamp being scraped heuristic based cleaning? data cleaning? etc. language filtering (only take English?) PII removal dedup Toxicity + SafeURL filtering “quality filtering” sampling distributions Change in Model Age Good alignment shown between validation year and pre-training year, even mixing in older data. Implication: “fine-tuned T5 may still be worse than fine-tuned llama, because T5 was pretrained using older data—despite even if FTing is newer” Change in Toxicity Filtering toxicity made the model worst at spotting toxicity. Change in Data Distribution out of domain answers do worse on out of domain results Reduce Memorization de-duplication using approximate matching think carefully for multiple-epoch training (what is ok to memorize?) remove sensitive memorization from pre-training data Two iffy strategies: Check for memorization Trivial style transfers can get around safety checks “do the [copyrighted thing] in French”; “do the [copyrighted thing] with double the spaces”. Use RLHF or something “hide flaws, and not eliminate them”—edge case problems doesn’t eliminate the underlying vulnerability.