specification gaming specification gaming, or reward hacking, is the phenomina where a system runs suboptimally because it exploited an underspecified part of the reward. challenges sparse rewards partial obervability dynamic rewards (and reward shifting) sim-to-real transfer is hard computational costs specification gaming AI alignment AI alignment ensures that AI systems are aligned with human values and interests. there is a spectrum of unexpected solutions: undesirable novel solutions an desirable novel solutions Problems with RLHF RLHF degrates model quality Goodharting Overfitting!! is an example of goodharting.

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?