I am a PhD student in machine learning at the University of Cambridge, working on AI safety with David Krueger and Rich Turner. I think it’s plausible that we’ll build AI systems capable of doing everything humans can do using computers in the next decade, and am hoping to ensure that the inevitably widespread deployment of such systems won’t lead humanity to permanently lose control over our future.

Before Cambridge, I earned my MSc in AI from the University of Amsterdam, and worked with UC Berkeley’s Center for Human-Compatible AI during and after my studies. I also spent a year at Sony AI Zurich researching deep RL and robotics.

Selected publications

See full list on Google Scholar.

Fresh in memory: Training-order recency is linearly encoded in language model activations. Dmitrii Krasheninnikov, Richard E. Turner, David Krueger. Best paper runner-up at the MemFM workshop @ ICML 2025. Paper, summary.
Detecting High-Stakes Interactions with Activation Probes. Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov. NeurIPS 2025; also outstanding paper at the Actionable Interpretability workshop @ ICML 2025. Paper, blog post.
Stress-testing capability elicitation with password-locked models. Ryan Greenblatt*, Fabien Roger*, Dmitrii Krasheninnikov, David Krueger. NeurIPS 2024. Paper, blog post.
Implicit meta-learning may lead language models to trust more reliable sources. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, Bruno Mlodozeniec, David Krueger. ICML 2024. Paper, summary, poster, code.
Defining and characterizing reward hacking. Joar Skalse*, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger*. NeurIPS 2022. Paper.
Preferences implicit in the state of the world. Rohin Shah*, Dmitrii Krasheninnikov*, Jordan Alexander, Anca Dragan, Pieter Abbeel. ICLR 2019. Paper, blog post, poster, code.

* Equal contribution