CV · Scholar · Twitter · Causal Incentives Working Group

I’m a final-year PhD student at Oxford, supervised by Robin Evans, where I work on theory involving causal models. I’m also a cofounder of the Causal Incentives Working Group, which uses causal models to reason about AI safety. Previously, I’ve been a research fellow at the Future of Humanity Institute, a research intern at DeepMind and OpenAI, and the founder of the EA Forum.

My research

I've been especially interested in finding concepts and tools for modelling AI safety problems.

One interesting problem is how to design a corrigibile system - one that wants to folow and not manipulate its instructions. Even systems that try to learn the human's goals may be incorrigible. Also, corrigible systems may behave unsafely, whereas ``shutdown instructable'' systems are safe.

A second problem is how to identify and shape agent's incentives - such as whether an agent's goal compels it to (un-)fairly respond to sensitive demographic characterics, or (un-)safely influence delicate parts of the environment. Sometimes the causal structure alone suffices to identify the incentives --- see also the closely related issue of identifying nonrequisite edges in an influence diagram. One can also modify an AI system so that it won't ``try'' to influence a delicate variable and in-fact this is a general template that many past safe AI algorithms implicitly follow.

A third is specification-gaming - where a system fulfills an extreme version of its assigned goal, rather than the intended goal. One proposed remedy is for the AI system to quantilise the assigned objective, i.e. to sample from the best n% of actions performed by a human demonstrator. Quantilisation has some nice properties, but they don't hold for all kinds of goal mis-specification.

Since a lot of these analyses benefit from using graphical causal models, my PhD studies causality, including how to best represent marginalisation and conditionalisation in causal graphs.

Selected Publications

Human Control: Definitions and Algorithms: We study definitions of human control, including variants of corrigibility and alignment, the assurances they offer for human autonomy, and the algorithms that can be used to obtain them. Ryan Carey, Tom Everitt. UAI. 2023.

Reasoning about Causality in Games: Introduces (structural) causal games, a single modelling framework that allows for both causal and game-theoretic reasoning. Lewis Hammond, James Fox, Tom Everitt, Ryan Carey, Alessandro Abate, Michael Wooldridge. Artificial Intelligence Journal, 2023.

Path-Specific Objectives for Safer Agent Incentives: How do you tell an ML system to optimize an objective, but not by any means? E.g. optimize user engagement without manipulating the user? Sebastian Farquhar, Ryan Carey, Tom Everitt. AAAI. 2022.

A Complete Criterion for Value of Information in Soluble Influence Diagrams: Presents a complete graphical criterion for value of information in influence diagrams with more than one decision node, along with ID homomorphisms and trees of systems. Chris van Merwijk*, Ryan Carey*, Tom Everitt AAAI. 2022.

Why Fair Labels Can Yield Unfair Predictions: Graphical Conditions for Introduced Unfairness: When is unfairness incentivized? Perhaps surprisingly, unfairness can be incentivized even when labels are completely fair. Carolyn Ashurst, Ryan Carey, Silvia Chiappa, Tom Everitt. AAAI. 2022.

Agent Incentives: A Causal Perspective: An agent’s incentives are largely determined by its causal context. This paper gives sound and complete graphical criteria for four incentive concepts: value of information, value of control, response incentives, and control incentives. Tom Everitt*, Ryan Carey*, Eric Langlois*, Pedro A. Ortega, Shane Legg. AAAI. 2021.

Incorrigibility in the CIRL Framework: A study of how the value learning method, cooperative inverse reinforcement learning, may not prevent incorrigible behaviour. Ryan Carey. AIES. 2018.

Other Writing

Description of the image

[firstname].[lastname]@jesus.ox.ac.uk