I am a PhD student in machine learning at the University of Cambridge, working on AI safety with David Krueger and Rich Turner. I think it’s plausible that we’ll build AI systems capable of doing everything humans can do using computers in the next decade, and am hoping to ensure that the inevitably widespread deployment of such systems won’t lead humanity to permanently lose control over our future.
Before Cambridge, I earned my MSc in AI from the University of Amsterdam, and worked with UC Berkeley’s Center for Human-Compatible AI during and after my studies. I also spent a year at Sony AI Zurich researching deep RL and robotics.
Selected publications
See full list on Google Scholar.
-
Stress-testing capability elicitation with password-locked models. Ryan Greenblatt*, Fabien Roger*, Dmitrii Krasheninnikov, David Krueger. NeurIPS 2024. Paper.
-
Implicit meta-learning may lead language models to trust more reliable sources. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, Bruno Mlodozeniec, David Krueger. ICML 2024. Paper, poster, code.
-
Defining and characterizing reward hacking. Joar Skalse*, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger*. NeurIPS 2022. Paper.
-
Preferences implicit in the state of the world. Rohin Shah*, Dmitrii Krasheninnikov*, Jordan Alexander, Anca Dragan, Pieter Abbeel. ICLR 2019. Paper, blog post, poster, code.
* Equal contribution