Benchmark tradeoffs baseline too high: no one can beat it baseline too low: no differentiation Close-ended evaluation do standard ML (“accuracy”) because there’s one of a few known answers types of tasks: SST, IMDP, Yelp; SNLI Most common multi-task benchmark: SuperGLUE Difficult what metrics do you choose? how to aggregate across metrics (average?) label statistics spurious correlations Open-ended evaluations long generations with too many correct answers (can’t directly apply classic ML) there are better and worse answers (relative) Content Overlap Metrics compare lexical similarity between generated and gold text: usually n-gram overlap metrics (BLEU (usually considered a precision metric), ROUGE (usually considered a recall metric), METEOR, CIDEr, etc.) doesn’t consider semantic relatedness but is fast! Semantic metrics BERTSCORE: get contextual embeddings of a sequence using a Bert, do some contextual smart averaging things Word Embeddings: averaging all the embeddings and compare them BLEURT: pretrain Bert, continual pretrain a Bert on BLEU, then fine tune on human annotation data Model Based Metrics AlpacaEval and MT-Bench: asking GPT4 to scoring a particular sample. self bias worries length normalization Humans! automatic evaluations need to compared against what humans could have. “ask humans to evaluate some axis (“fluency”, “coherence”, etc.)” slow expensive inter-annotator disagreement intra-annotator (time) disagreement not reproducable is a measure of precision, not recall

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?