One-Liner “If you have infinite compute but limited samples, how do you pretrain?” Novelty closed-form best approach which reduces loss given the current data budget Notable Methods Outline Takes 200 million tokens, 300 million parameters, from a corpus Measure validation loss Vary training recipes regularized parameter scaling Stick in some more weight decay: more parameters, more weight decay Model WDK 150M 0.8 300M 1.6 600M 3.2 1.4B 3.2 ensembling Train a bunch of seperate models (i.e. with random shuffles?) and then parameter merge; different initialization, etc. Key Results Various training recipes gets various results: epoching: overfitting eventually, and larger models does so quickly regularized parameter scaling: faster improvements in loss at increased parameter scales ensemble: lower asymptote Takeaways training recipes current aprpocah overfit regularization can scaling law esembling decreases loss at asymptote inference time efficiency This is an ad for ensembling, and that’s expensive! You can distill the model down to a single dense size and a 4-ensemble distilled down densely can even outperform an optimal 4 ensemble. Self distillation can be good as well! continual pre-training Training with ensembling gives efficiency gains even at large data scale regimes (4B tokens, etc.) New Concepts Notes

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?