SU-CS120 OCT012024 — Jemoka Knowledge Base

specification gaming specification gaming, or reward hacking, is the phenomina where a system runs suboptimally because it exploited an underspecified part of the reward. challenges sparse rewards partial obervability dynamic rewards (and reward shifting) sim-to-real transfer is hard computational costs specification gaming AI alignment AI alignment ensures that AI systems are aligned with human values and interests. there is a spectrum of unexpected solutions: undesirable novel solutions an desirable novel solutions Problems with RLHF RLHF degrates model quality Goodharting Overfitting!! is an example of goodharting.