Estimated by Oxford philosopher Toby Ord to constitute approximately a one in ten chance of existential risk in the next century, unaligned Artificial Intelligence (AI) is widely considered by researchers to be one of the most potent potential causes of existential catastrophe that we face.

In this article we present a number of theses concerning how this could be the case. While still hotly-debated hypotheticals, these propositions, if true, would suggest that an advanced AI system could pose a significant risk to humanity’s survival if we fail to instil them with goals that accurately encapsulate humanity’s values. In the next news post we will look at a number of more tangible open problems in current AI research that also play a significant role in how AI could pose an existential risk.


We begin with two theses formalised by the Swedish philosopher Nick Bostrom in his 2012 paper, ‘The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents’, that propose a relation between the intelligence and motivations of advanced AI agents. The first is known as ‘The Orthogonality Thesis’ and reads as follows:

The Orthogonality Thesis:

‘Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.’

If this proposition turns out to be true then, as researcher Phil Torres has put it, ‘there [isn’t] any principled reason to think that a superintelligent machine whose goal system is programmed to count the blades of grass on Harvard’s campus would stop and think, “I could be using my vast abilities to marvel at the cosmos, construct a ‘theory of everything’, and solve global poverty. This is a silly goal, so I’m going to refuse to do it.”‘ Though an amusing example, it is not hard to dream up more concerning scenarios in which an AI agent is given a clearly malicious task yet does not stop to consider the consequences of its actions.

The second of Bostrom’s theses is known as ‘The Instrumental Convergence Thesis’ and suggests that, regardless of the final goal, a superintelligent agent will necessarily have a number of potentially dangerous instrumental sub-goals:

The Instrumental Convergence Thesis:

‘Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realised for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.’

One example of how this could be of danger to humanity commonly given by Bostrom himself is that of a ‘paperclip maximiser’. Imagine a sufficiently capable superintelligent AI agent given the task of producing as many paperclips as possible. The agent would eventually realise that, in order to complete its task as well as possible, it is necessary to cause the extinction of humanity for a number of reasons. Humans may have the power to turn the agent off or otherwise impede its progress on making paperclips, and so to avoid disturbance, it would be better if humans ceased to exist. Furthermore, the matter that makes up the bodies of all humans contains atoms that would be more useful if they were in the form of paperclips (according to the agents final goal).


There are also a number of meta-ethical theses concerning the nature of human values. While seemingly abstract, if we are ever to imbue an AI system with ‘human values’, it is obviously important that we are certain what ‘human values’ are. The three theses that address this are as follows:

The Perplexity of Value Thesis:

Despite centuries of philosophical enquiry into what human values are or should be, we seem to be nowhere near a consensus. In the absence of universal consensus as to what constitutes human values, the developers of an advanced AI system will be left with a choice between competing ethical frameworks, leaving the door open for an AI system inheriting an unrepresentative picture of what humans value.

The Complexity of Value Thesis:

It is possible that, whatever human values turn out to be, they are not able to expressed in a succinct, mathematical, and objective form. This poses a problem when it comes to trying to implement these values in an AI system.

The Fragility of Value Thesis:

Finally, it is possible that value systems are fragile in the sense that a slight difference in one of their components results in wildly different outcomes. Even if both the Perplexity and Complexity problems are solved, and the time has come for human values to be ‘loaded in’ to an AI system, any bug, glitch, or minor misrepresentation of a constituent part of the framework could result in the AI having a greatly different overview of human values than was intended.


It is not difficult to see how just a couple of these hypotheses need to be true for them to combine in a way that would be a great cause for concern in future AI research. For example, if any one of the ‘value theses’ were true, it would at best be highly unlikely that we would be able to instil an AI system with the knowledge of even the most common sense human values. If this was the case, then the scenario of the paperclip maximiser becomes a seemingly likely corollary of the Instrumental Convergence Thesis. The agent would not have the common sense to realise that paperclips are not inherently valuable, and so it’s probably not a good idea to bring about a world in which there are no humans to make use of the paperclips that it has been tasked with producing. Furthermore, and perhaps more worryingly, the above theses are by no means a comprehensive list of the philosophical issues relating to the design and deployment of advanced AI, and there are most likely many more ways in which AI could endanger humanity that researchers are yet to consider.

If sufficiently capable AI is possible, we had better find answers to these, and many more, troubling thoughts.