I am a final-year PhD student at Cambridge, working on AI safety with David Krueger and Rich Turner. I think we might be less than a decade away from AI systems capable of doing everything humans can do using computers – including further AI research. I want to make sure this won’t lead humanity to permanently lose control of our future.

I’m into a pretty broad range of research topics, lately focusing on a mix of interpretability / science of deep learning / control / security. My claim to fame is coining “out-of-context learning” in an early version of this paper. More recently, we showed that language models linearly encode at which point of training they learned a specific fact.

Before Cambridge, I earned my MSc in AI from the University of Amsterdam, and worked with UC Berkeley’s Center for Human-Compatible AI during and after my studies. I also spent a year at Sony AI Zurich researching deep RL and robotics.

Selected publications

See full list on Google Scholar.

  1. Fresh in memory: Training-order recency is linearly encoded in language model activations. Dmitrii Krasheninnikov, Richard E. Turner, David Krueger. Best paper runner-up at the MemFM workshop @ ICML 2025. Paper, summary.

  2. Detecting High-Stakes Interactions with Activation Probes. Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov. NeurIPS 2025; also outstanding paper at the Actionable Interpretability workshop @ ICML 2025. Paper, blog post.

  3. Stress-testing capability elicitation with password-locked models. Ryan Greenblatt*, Fabien Roger*, Dmitrii Krasheninnikov, David Krueger. NeurIPS 2024. Paper, blog post.

  4. Implicit meta-learning may lead language models to trust more reliable sources. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, Bruno Mlodozeniec, David Krueger. ICML 2024. Paper, summary, poster, code.

  5. Defining and characterizing reward hacking. Joar Skalse*, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger*. NeurIPS 2022. Paper.

  6. Preferences implicit in the state of the world. Rohin Shah*, Dmitrii Krasheninnikov*, Jordan Alexander, Anca Dragan, Pieter Abbeel. ICLR 2019. Paper, blog post, poster, code.

* Equal contribution