Some ideas of model validation Cross Validation Hold-out cross-validation For instance, you can do: 70% for training 30% hold out cross validation for testing But at very large dataset scales, the validation size can be capped at a fixed size (so you can hold out like 0.1% or something but still have 10k samples). k-fold cross validation shuffle the data divide the data into k equal sized pieces repeatedly train the algorithm on 4/5 of the data, test on remaining 1/5 In practice people do 10 folds. LOOCV See Leave-One-Out Cross Validation Test Set For academic settings, not for production, we can report the result on a third unbiased estimate of the dataset. LLM Evaluation Types Intrinsic Evaluation In-Vitro Evaluation or Intrinsic Evaluation focuses on evaluating the language models’ performance at, well, language modeling. Typically, we use perplexity. directly measure language model performance doesn’t necessarily correspond with real applications Extrinsic Evaluation Extrinsic Evaluation, also known as In-Vivo Evaluation, focuses on benchmarking two language models in terms of their differing performance on a test task.