evaluation is quite hard—you need Classical Test Theory “just average each test” (think MUC, b3, etc.) test-dependent ability estimation BAD: because each test maybe different difficulty Item Response Theory (IRT) model item and test taker characteristics test-invariant ability estimation (subset invariant) adaptive testing problem requires calibration first …which is quite costly Flash-HELM HELM, prioritizing higher-ranked models. Evaluate good model more. Sang’s Method We want to estimate \theta with a budget of K questions. test taker ability is fixed, but unknown: \theta \sim p(\theta) there’s some function z(q) \to Z \in \triangle, for some question q \in Q our response model, then, is p(y=1 | z; \theta) = \sigma(\theta - z) Then for ever question we have we ask what the fisher information is. You then update for every test result the response model using MLE. amortized calibration compute the calibration difficulty z advantages more reliable and efficient across emprical setting incorporates amortized (learned) calibration to reduce calibration costs introduces conditional question generation to generate questions of specific difficulties