About

I am a PhD student in machine learning at the University of Cambridge, working on AI safety with David Krueger. I think it’s plausible that we’ll build AI systems capable of doing everything humans can do in front of computers in the next decade, and am hoping to ensure that the (inevitably widespread) deployment of such systems won’t lead the humanity to permanently lose control over our future.

I earned my master’s degree in AI from the University of Amsterdam, and had the opportunity to work with UC Berkeley’s Center for Human-Compatible AI during and after my studies. I also spent a year working on deep RL and robotics at Sony AI Zurich.

Publications

  1. Stress-testing capability elicitation with password-locked models. Ryan Greenblatt*, Fabien Roger*, Dmitrii Krasheninnikov, David Krueger. NeurIPS 2024. Paper.

  2. Implicit meta-learning may lead language models to trust more reliable sources. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, Bruno Mlodozeniec, David Krueger. ICML 2024. Paper, poster, code.

  3. Open problems and fundamental limitations of reinforcement learning from human feedback. Stephen Casper*, Xander Davies*, and 30 coauthors including Dmitrii Krasheninnikov. TMLR. Paper.

  4. Harms from increasingly agentic algorithmic systems. Alan Chan and 21 coauthors including Dmitrii Krasheninnikov. ACM FAccT 2023. Paper.

  5. Assistance with large language models. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, David Krueger. InterNLP, Human in the Loop Learning, and ML Safety workshops at NeurIPS 2022. Paper.

  6. Defining and characterizing reward hacking. Joar Skalse*, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger*. NeurIPS 2022. Paper.

  7. Benefits of assistance over reward learning. Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell. Best paper award at the Cooperative AI workshop at NeurIPS 2020. Paper, code.

  8. Combining reward information from multiple sources. Dmitrii Krasheninnikov, Rohin Shah, Herke van Hoof. Learning with Rich Experience and Safety & Robustness in Decision Making workshops at NeurIPS 2019. Paper, poster.

  9. Preferences implicit in the state of the world. Rohin Shah*, Dmitrii Krasheninnikov*, Jordan Alexander, Anca Dragan, Pieter Abbeel. ICLR 2019. Paper, blog post, poster, code.

* Equal contribution