About

I am a PhD student in machine learning at the University of Cambridge, working on AI safety with David Krueger. I think it’s plausible that we’ll build AI systems capable of doing everything humans can do in front of computers in the next decade or two, and am hoping to ensure that the (inevitably widespread) deployment of such systems won’t lead the humanity to permanently lose control over our future.

I earned my master’s degree in AI from the University of Amsterdam, and had the opportunity to work with UC Berkeley’s Center for Human-Compatible AI during and after my studies. I also spent a year working on deep RL and robotics at Sony AI Zurich.

Publications

Meta- (out-of-context) learning in Neural Networks. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, Bruno Mlodozeniec, David Krueger. Workshop on Understanding Foundation Models at ICLR 2023. Paper, poster, code.

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Stephen Casper*, Xander Davies*, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell. Paper.

Harms from Increasingly Agentic Algorithmic Systems. Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Shalaleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger, Tegan Maharaj. ACM FAccT 2023. Paper.

Assistance with large language models. Dmitrii Krasheninnikov*, Egor Krasheninnikov*, David Krueger. InterNLP, Human in the Loop Learning, and ML Safety workshops at NeurIPS 2022. Paper.

Defining and characterizing reward hacking. Joar Skalse*, Nikolaus Howe, Dmitrii Krasheninnikov, David Krueger*. NeurIPS 2022. Paper.

Benefits of Assistance over Reward Learning. Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell. Best paper award at the Cooperative AI workshop at NeurIPS 2020. Paper, code.

Combining reward information from multiple sources. Dmitrii Krasheninnikov, Rohin Shah, Herke van Hoof. Learning with Rich Experience and Safety & Robustness in Decision Making workshops at NeurIPS 2019. Paper, poster.

Preferences implicit in the state of the world. Rohin Shah*, Dmitrii Krasheninnikov*, Jordan Alexander, Anca Dragan, Pieter Abbeel. International Conference on Learning Representations (ICLR) 2019. Paper, blog post, ICLR 2019 poster, code.

* Equal contribution