Last week’s post explored some of the potential philosophical issues raised in association with the development of safe artificial intelligence. While the considerations presented there are important to address, they are doubtless both abstract and seemingly intractable – at least for the time being. This week, we explore a number of more concrete topics in contemporary AI research, with safety implications for both current and future AI systems.

Many of the problems presented here are framed in relation to the machine learning paradigm of reinforcement learning. In reinforcement learning, an AI agent learns how to interact with its environment by receiving a reward for its actions, as determined by a reward function defined by the developers. The agent’s aim is to receive as much reward as possible. For example, suppose we have a cleaning robot (the agent), tasked with cleaning a kitchen (the environment). It makes sense to implement a reward function that rewards the robot if the kitchen is clean, and potentially punish the robot (give it ‘negative reward’) if it is dirty. In this way, it is in the robot’s interests to keep the kitchen as clean as possible, as this is how it will maximise the reward that it receives. Since the reward function contains no information as to how to keep the kitchen clean, the robot has to figure out for itself during training that it’s probably a good idea to mop the floors but not to empty out all the bins on to the table.

With this basic understanding of the workings of many current AI systems in place, we are ready to explore some of the issues that may arise in their deployment.

For in-depth discussions of these issues and more, see Amodei et al. 2016. ‘Concrete Problems in AI Safety’, and the Future of Life Institute’s excellent graphical online resource

Avoiding Negative Consequences

It is important to ensure that, while in the process of successfully completing its objective, an AI agent does not take actions that affect the environment in unintended negative ways due to simple oversight. Due to the complexity of environments that agents are likely to be operating in, it is infeasible to specify the potentially infinite number of undesirable ways in which the environment could be disturbed, and thus a more elegant solution must be found.

Avoiding Reward Hacking

Related to the above, we must be able to ensure that an agent does not ‘game its reward function’, meaning that the agent would successfully complete the desired objective, but in a manner that differs to what the design team had intended. Returning to the example of the cleaning robot, if we decide to reward it for the action of cleaning, it may learn to initially make the kitchen dirtier in order to have more cleaning to complete in exchange for more reward.

Scalable Oversight

As the set of tasks given to AI systems become more and more complex, how can we ensure that the agent is in fact completing the objective in a desired manner without having to make a large number of potentially time-consuming observations of the agents actions?

Safe Exploration

A central notion of reinforcement learning is that of exploration. This is that, while in the process of training, the agent takes actions that may not seem to be the best choice on the chance that they turn out to be rewarded better than the agent expected. However, it is important to guarantee that, while exploring, the agent does not take actions with harmful consequences.

Robustness to Distributional Shift

It is possible, indeed quite likely, that the environment that an AI agent will be deployed in will differ from that which it was trained in. How can we be sure that the agent is able to recognise these differences and respond in a suitable manner?


As AI systems are applied more and more to decision-making scenarios it is important to retain the ability for humans to be able to understand and assess how an AI agent reached the conclusion that it did. This could be through primary analysis of the agent’s underlying decision-making methods, or by having the agent explain its own reasoning to a human supervisor. The former option will likely become infeasible as AI systems become more and more complex, while the latter could potentially just kick the can down the road if the agent’s reasoning method is itself not transparent.


While possibly not such a concern for current AI systems, it is important to ensure that future systems are corrigible, meaning that they are compliant to human intervention that aims to shut it down or reprogram it.

As can be seen, there are a number of important features of reinforcement learning that must be addressed in order for the agent to act safely and as its designers intended. Coming up with solutions to these concerns will only become more and more important as the power and pervasiveness of AI systems increases in the coming years.