About

I am a PhD student in machine learning at the University of Cambridge, working on AI safety with David Krueger. I think it’s plausible that we’ll build AI systems capable of doing everything humans can do in front of computers in the next decade, and am hoping to ensure that the (inevitably widespread) deployment of such systems won’t lead the humanity to permanently lose control over our future.

I earned my master’s degree in AI from the University of Amsterdam, and had the opportunity to work with UC Berkeley’s Center for Human-Compatible AI during and after my studies. I also spent a year working on deep RL and robotics at Sony AI Zurich.

Selected publications

See full list on Google Scholar.

  1. [work in progress] Steering clear: a systematic study of activation steering in a toy setup. Dmitrii Krasheninnikov, David Krueger. MINT workshop at NeurIPS 2024. Paper.

  2. Stress-testing capability elicitation with password-locked models. Ryan Greenblatt*, Fabien Roger*, Dmitrii Krasheninnikov, David Krueger. NeurIPS 2024. Paper.

  3. Implicit meta-learning may lead language models to trust more reliable sources. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, Bruno Mlodozeniec, David Krueger. ICML 2024. Paper, poster, code.

  4. Defining and characterizing reward hacking. Joar Skalse*, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger*. NeurIPS 2022. Paper.

  5. Preferences implicit in the state of the world. Rohin Shah*, Dmitrii Krasheninnikov*, Jordan Alexander, Anca Dragan, Pieter Abbeel. ICLR 2019. Paper, blog post, poster, code.

* Equal contribution