Domestic Affairs

Our Final Invention? Weighing Existential Risks from Artificial Intelligence

“Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.”

Thus reads a letter signed by over 350 AI scientists and tech executives, philosophers, and politicians including Open AI CEO Sam Altman, DeepMind co-founder Dennis Hassabis, and Anthropic CEO Dario Amodei. Other notable signatories included Bill Gates and Representative Ted Lieu (D, CA).

However, the broader AI research community is also worried. A study found that the median AI researcher “believes the probability that the long-run effect of advanced AI on humanity will be “extremely bad (e.g., human extinction)” is 5%,” with almost half believing the number  to be over 10%.

It is prima facie evidence that there is significant risk involved with the development of AI if so many experts in the field believe that AI poses enough of a danger to humanity to necessitate mitigation, but while expert opinion is enough evidence that the issue is worth consideration, it is not enough evidence that the foretold threat is significant. This article will examine the case for and against existential AI risk, and attempt to understand and quantify the risk which AI poses to humanity. 

First, some definitions: artificial general intelligence (AGI) is a computer with not just individual problem-solving ability, but a cross-domain general factor of intelligence similar to human intelligence. For example, Deep Blue can play chess far better than any human player, but cannot apply that knowledge to checkers, or even to slight alterations of chess. Deep Blue does not qualify as AGI, because its intelligence was designed to function in one specific domain, and cannot be transferred without human support.

Similarly, GPT4, the most powerful1 model2 in existence, approaches AGI, as it is able to use its general knowledge to effectively perform tasks out of the sample distribution. However, it fails to make the grade because of its inability to apply its knowledge in trivial ways, such as in solving basic arithmetic problems. Additionally, it currently lacks any significant ability to continuously track its surroundings, and remember them, or to act coherently over time in an agentic way. However, the question of where to draw the line on what constitutes AGI is, of course, a matter of some controversy3.  

The existential risk of AI comes from Artificial Super Intelligence (ASI) which is an AGI significantly more powerful than the most powerful human mind. AGI on the level of an orangutan would pose certain difficulties, but an ASI that is actually smarter than humanity’s brightest minds belongs in its own category of risk analysis. Thus, for the purposes of this article, “AI risk” is defined as the risk from the development or deployment of an ASI.  

Second, a clarification: the term “AI risk” can refer to the ordinary risks associated with AI, and indeed with any new technology, such as economic and social disruption, novel security threats, increased opportunities for criminal activity, and the creation or exasperation of social inequalities. AI risk can also refer to the existential risk associated with AI, the idea that creating an AGI could lead to catastrophic results for humanity. While both risks ought to be taken seriously, the second risks are both comparatively less understood and, if they turn out to be plausible, far more critical than the former, so this article will exclusively focus on the latter4.

What cause do we have to believe that AI is likely to lead to mass human extinction? There are many technical arguments on both sides, but the debate can be boiled down to three key propositions:

  1. AI alignment (the study of how to get AI to robustly do what we want) is lagging behind AI capabilities (the study of how to make bigger and more capable AI). This is because alignment is more difficult than capabilities. If this remains true, any AGI we create will have a quasirandom goal function, since without proper alignment research we can neither state our preferences as a goal function nor put this goal function in the AI.
  2. An AGI with an arbitrarily selected goal function will quite likely result in human extinction. This is because human values are an infinitesimal part of the vast space of potential things the AGI could value. For almost any value set with which the AI could end up, the optimal strategy would involve liquidating humanity and using our constituent atoms for something else. The AI will be able to do this because it will be vastly smarter than we are, on the order of not Einstein versus a toddler, but of a human versus a rabbit. One is so much smarter than the other that the less intelligent creature simply cannot conceive of the more intelligent one. 
  3. The creation of AGI is near, on the order of decades, or perhaps even years, rather than centuries or millennia. Further, there will not be a clear threshold beyond which the danger zone begins. AI capabilities are advancing at a pace perhaps never before seen in science. For much of 2023, it was meaningful to describe large language models as “weeks out of date.” If current trends continue, we actually have no idea when we will get AGI, because no one knows how AGI will work. But our ignorance itself, combined with the pace of our progress, is reason to fear that we will stumble over the threshold into AGI without any clear idea that we are doing so. Once this happens, there is reason to believe the AI would recursively self- improve and bootstrap its ascension to ASI. 

To summarize, any AGI we currently create will have an arbitrary goal function, creating an AI with an arbitrary goal function will likely lead to human extinction, and we are, wittingly or unwittingly, about to create an AGI. But how likely are these to be true in conjunction? 

Is AI alignment lagging behind AI capabilities, and will this likely remain true?

In terms of the first proposition, there are several key points to be demonstrated. First, AI capabilities research is, by the testimony of leading scientists in both fields, outstripping AI alignment; but this could simply be an artifact of current social priorities — to be resolved as soon as we realize the danger and commit further resources to alignment research. Unfortunately, this is unlikely to be true, as there are theoretical reasons to believe that capabilities are fundamentally easier than alignment. 

Consider why it is easier to simply throw a ball, not caring where it lands, than to hit a target with that ball. One requires both precision and strength, the other can be achieved by just throwing a bunch of energy at the ball, by, if you will, brute forcing the problem. Capabilities is fundamentally a field of force, giving the AI the ability to do more things. Alignment, meanwhile, is a science of precision — of getting the AI to do what you want, not just in the circumstances it is being trained in, but in all potential situations it may encounter5

Another difficulty is that the usual way of figuring out how to make something work precisely, trial and error, which equates to taking a rough version of the thing and playing around with it until it works, implies that alignment will naturally lag behind capabilities. That is, to figure out how to align an AI of X capability level, the natural thing to do is to build an AI of X level, and play around with it until it is aligned. This works perfectly fine for AI on the level of GPT4, but it still means that you end up first solving capabilities, building an AI that can do the thing you want it to do, and then solving alignment, getting the AI to do the thing. This means that, if we want to have alignment outpace capabilities, the needs of alignment researchers must work purely by theory alone, designing ways to shape and control systems that are not only unbuilt but undersigned and unconceived by their colleagues researching capabilities. They will have no chance to empirically verify their work, or even to test it before capabilities catch up with them, and the model is built. This is a far more difficult task than it may at first sound, for both scientific inquiry and engineering are designed to make extensive use of empirical feedback. If alignment is to remain safely ahead of capabilities, then many usual tools for developing new scientific paradigms will be unavailable. Further, given how difficult the task will be, it will likely only be successful if it is explicitly optimized for, meaning there must be widespread agreement among both groups that capabilities need to be kept behind alignment. That is, if half the community believes the alignment must be developed before capabilities, and the other half does not care, then alignment will likely continue to lag behind capabilities. While such wid- scale cooperation is possible, it is very much not the default course, and will not happen unless there is a serious commitment to make it happen. In the default world, proposition one seems to be robustly true. 

If the above paragraph was confusing, then consider which would, by default, come first, building an airplane, or figuring out how to safely operate an airplane. Obviously, doing the second is much easier once progress has been made on the first, and the first is the natural place to start unless you have a strong commitment to prioritize the second. Now replace “airplane” with “artificial intelligence,” and that is the crux of the above. The main takeaway is that capabilities are currently ahead of alignment not for any contingent social reasons, but for reasons fundamental to the nature of the field. Thus, proposition one is not only currently true, but will likely, assuming something is not done to change it, remain true when AGI is developed.

Given that we solve the AI capabilities problems inherent in building ASI before being able to align it, what will happen when we build an unaligned ASI?

If you intuitively understand why building a superintelligence with no way to make it a moral being is a very bad idea, you can skip this (and the following) question.

Suppose an AI with an arbitrarily selected utility function 𝜏6. This utility function makes it want to maximize the amount of some particular arrangement of atoms in the universe. Now imagine this ASI ascended from one designed by a paperclip manufacturer, and it ends up with a goal function that tells it to maximize the number of paperclips in its lightcone

The AI will then “think” to itself: “Well, there are all these humans sitting around using all these atoms which could, instead, be made into paperclips. Sure, the humans make paperclips, but not as efficiently as I could using their stuff, and anyway, they are likely to complain about me taking all the free atoms and turning them into paperclips. The obvious solution is to kill all the humans and make them into paperclips, or at least paperclip-making robots.”

The above example is not meant to be taken as a literal risk scenario, but it demonstrates the point that, if you take all the possible goal functions an AI could have, and choose one at random, then that goal function is almost certainly not one that cares about human flourishing. and, as was discussed above and will be further elucidated below, we do not have the general ability to make a goal function for an AI that actually represents what we want it to do, rather than a superficially similar proxy.  

Very well, but humanity will want an AGI that has a non-stupid goal function, maybe it will maximize human happiness— that sounds safe. Well, we need to explain to the AI what happiness means, so let us tell it to maximize a certain brain state, which corresponds to “happiness”. “Very well,” says the AI, “the best way to do that is to build a bunch of vats, put a bunch of brains in them, remove any parts of the brain not necessary for feeling happiness, and pump what’s left full of heroin7.” 

Part of the reason that this is true is because computers do what you say, not what you mean, so you run into classic genie problems, where a given function is literally maximized, not as a human understands maximation, but as math does. Because humans know “what we mean” when we say to do this or accomplish that, even if the function is not specified with exact precision. But yet the fact that if the function, without the very specific assumptions that are so obvious no human would think to specify them, is solved, then the default outcome is something alien to humanity. You actually have to make the AI want to maximize its goal function in a way that accords with human assumptions and desires— by finding a way to describe them in terms of math, and then putting them into the AI’s goal function. 

If this point is still unclear, imagine the following: let each possible goal function of the AI be described by an integer. Obviously, this is going to be a quite vast set8. Let each potential configuration of the universe that the AI is trying to optimize for be represented by an integer. Making existence contain lots of blue paper clips can be four. Making existence contain lots of red paperclips can be three million nine hundred forty-five. Very few of these numbers are going to be humans living fulfilling lives. Those that are will, in most cases, have something wrong with them. Maybe these humans lack music, or romance, or live in a miserable dictatorship. If you think about the number of ways the universe could be, very few of them are going to be ones that we would prefer over the current state of the universe, which contains quite a lot of what humans value. 

The key takeaway is that building an unaligned AGI is like picking a random number from the above and making the universe be like that. 

Some readers will ask, even if humanity fails at alignment, will the AGI not at least loosely be inspired by human values, in the way that the heroin brain vats example from above, while unsatisfactory, is obviously loosely inspired by human values? So should we not, in the above example, put more weight on numbers that represent states of reality closer to human values, and thus overweight human values in the distribution, increasing the chance that a significantly but not totally unaligned AI ends reasonably well for humanity? Even if you did do this, states of reality conducive to human values are so rare in the distribution that the odds are still incredibly stacked against us. However, this is a faulty assumption for reasons these margins are too small to contain9.

Even assuming that an arbitrary goal function for an ASI will lead it to desire a universe with no human value, why is it so hard to give the ASI a goal function encapsulating human values, or why can we not simply control a hostile ASI?  

To understand why giving AGI human values is so difficult, it is necessary to understand the relevant characteristics of fundamental human values, the “law” written on every human heart, which must also be written into any aligned AGI. 

Earlier, the  AGI’s utility function was defined as 𝜏. The reason for using the letter tau here is that it is a homophone for 10Tao, a concept from Chinese philosophy that has been analogized with the natural law. More broadly, The Tao can be thought of as the representation of universal human values, the implicit moral structure behind every culture on earth, reinforced by millions of years of biological and thousands of years of cultural evolution. It is the force behind human universals. It is the summation of all human values, ethical, aesthetic, and cultural. It is why people want a universe with beauty and justice and prosperity and science and cooking and human achievement and valor and kindness and children. It is also what lets us know what we mean when we use words such as “beauty” “justice” and “human achievement” in the first place. Humans are biologically predisposed to have a general understanding of the Tao. We are socialized to understand the Tao in our particular cultural context. There are of course key differences between different cultural ideas of the Tao, but these are, quite obviously, variations on the same theme. 

Once this is understood, it is obvious that creating a powerful agent not constrained by the Tao is a really terrible idea. But, while we understand the Tao, we cannot really explain it. There is a lot that people just know without being able to explain in precise terms what it is exactly they know. It is implicit, rather than explicit, knowledge. This isn’t going to cut it when building AGI. The Tao must be explained in mathematical precision, in such a way that it constrains the AI, and all this must be done without the help of the usual biological levers which we pull when instilling human values in our children. The AI has no instincts that we do not program, it has no feelings that we cannot calculate, so we must learn to program and calculate natural law11.

So if a very specific and complex value set is not programmed into the AI, it ends up with a goal function that wishes to kill humanity. There is still the question of whether ASI will be capable of killing us. 

Yes, yes it will be. 

First, once humanity figures out how to build AGI, it will likely be far smarter than humanity, and be able to recursively self-improve. Humans are subject to biological constraints on intelligence which the AI will not be, for one, and the AI can simply bootstrap itself to higher intelligence12. It can also just acquire more computing power, which it will then use to get smarter and get more computing power, and so on. The key insight here is that the AI, unlike humans, can both edit and copy its mind, meaning it will have access to means of intelligence enhancement we do not. It is possible that these methods do not work, and that the AI cannot bootstrap its way to superintelligence, but even if it cannot, world conquest should still be fairly simple. For one, it could pretend to be aligned until its creators came to trust it more, or gave it more capabilities, and then could turn on us. Or, if we gave it anything like the privileges, in terms of internet access and control over our infrastructure, given our current computers, it could leverage its superhuman thinking speeds (which computers already possess) and lack of need for food or sleep to achieve world domination through mundane means

Once the AI achieves superintelligence, however, the game truly is over. Agents with different orders of magnitude do not fight, the smarter agent rolls over the less smart one. Humans do not fight mosquitos, for they are so far below our cognitive level that they cannot even comprehend us, nor our plans to destroy them. Even if they could, the sort of resistance that they must mount to effectively counter us is beyond what their coordination and planning abilities can conceive, let alone execute. Similarly, if an animal eats a plant, the plant not only is unable to resist but cannot even understand that it is being eaten. It is well for the plant that it is not in the animals’ interest to exterminate it completely, and that evolution is balancing their populations, or else they would be completely overwhelmed by an enemy that can, however rudimentarily, think. Even relatively small differences in intelligence can be decisive if they are persistent and are not ameliorated by other factors. If you do not believe me, I would encourage you to ask the Nethanderals, except we already killed them all by virtue of being a little bit smarter and better coordinated13. Even among humans, intelligence is the number one predictor of success, and this is despite sophisticated mechanisms that allow humanity at large to coordinate against any of their numbers who get too smart for the good of everyone else. 

The point here is not that AGI will take this or that path to world domination, but that agents with an order of magnitude or more of an advantage in intelligence 14barring highly unusual circumstances or counterbalancing forces, win. 

It is highly likely that an unaligned AI will result in human extinction. There are counterarguments to this, of course, such as that from Meta’s Chief AI scientist that we can just make laws telling the AGI to be good, but there is broad agreement across the field that this is true. 

While creating ASI will kill us, how likely are we to actually build ASI?

Proposition three states we are likely to get AGI in the near term. A hypothetical AGI can be as dangerous as it likes, but if AGI is impossible, or at least far too difficult to create at our current technology levels, then there is nothing to worry about. The difficulty here is we have never dealt with a situation similar enough to AGI that we can generalize from experience, meaning that any attempt to determine how much time we may have will rely either on finding an analogous process from which we can make generalizations, or on calculating out the future rate of AI progress, and determining how far we still have to travel to AGI.

There are far too many ways of calculating timelines to discuss here, but the Alignment Forum has put together a good, if slightly out of date, introduction to various ways of measuring AI timelines. The key difficulty is that there are many unknown unknowns related to what is necessary for AGI, and because of that we do not have a good idea of the hurdles which need to be overcome to create it. Attempts to ameliorate these problems are technical enough that they are beyond the scope of this article. 

It can still be considered whether AGI is theoretically possible, and then whether it can be built at anything close to current technological levels. If the answer to either of these questions is no, then AGI poses no existential risk of this type.  

First, is AGI even possible? 

There is a case to be made that it is not. One common variation on a common argument can be summarized as follows

  1. While certain sub-tasks of intelligence, such as chess, can be reducible to code, there are elements of the human soul, such as consciousness and judgment, which are either non-material or material but impossible to be synthetically recreated.
  2. One or more of these elements are necessary for AGI. For instance, the human faculty of consciousness is necessary for general intelligence, such that no non-conscious agent is generally intelligent 
  3. AGI is impossible. QED

There are many responses to this argument, but this article will focus on two. First, there is no broad agreement on what tasks can and cannot be reduced to the point where we can get a computer to do them. This touches on the metaphysical question of reductionism, which asks what, if any, things are fundamentally irreducible to other things. Given that computers can be considered as “merely” mathematical structures, any quality that cannot be reduced to mathematical structures cannot be given to an AI. There is no widespread agreement on a list of such irreducible things. While some things, such as arithmetic and chess, are obviously so reducible, others, such as language, are less clear. As recently as five years ago, it was an open question as to whether language was really reducible in such a way as to allow AI to comprehend it, but GPT has made it clear that it is15

The question of whether a given capability is really reducible to math is unresolved, as is the question of what capabilities are necessary and sufficient for AGI, so there is no reason to believe that AGI is impossible, even granting the premise of the argument. 

For example, suppose that aesthetics turned out to be fundamentally irreducible such that AI could never be given aesthetic perception. You would also have to prove that AGI could never exist, or never effectively operate in the world, without this aesthetic perception. If you cannot demonstrate this, then while you may have shown something interesting about metaphysics, you have done little to reduce the probability that there is an existential risk from AGI. 

But even if you could show this, there is a second problem. Any human capability, even an irreducible one, can be represented to the computer as a series of inputs and outputs, allowing it to make statistical inferences and copy the skill from the human. An AGI with no intrinsic sense of aesthetics can nevertheless learn a good painting from a bad one by learning to predict the reaction of a human art critic to a given painting; this is simply a matter of the AI being given a sufficient size of input data, and the art critic’s judgment of them. There is a profound philosophical difference between a human with a true appreciation of art, and an AI that is only able to say what the human will call good art and why, but in terms of capabilities, these are almost isomorphic. Given that the existential threat from AI is the result of excessive capabilities, this line of attack fails to respond to the true reason to fear AI. 

But suppose that the above argument is flawed, and building a machine that even imitates true intelligence is impossible. In that case, there must be a stalling out point somewhere in the development of artificial intelligence— for if its progress continued unabated then it would reach AGI sooner or later— and, if that is so, then the expected cost of adopting the policy of intentionally halting AI development before it reaches AGI significantly decrease. Therefore, while concern about AGI might seem ridiculous to those who subscribe to the above argument, prudence still dictates that they support precautionary measures, for if they are wrong then these measures will be necessary after all, and if they are right they will be no worse than a waste of time and energy, and will, if done right, preclude no progress, for they will amount to humanity saying that it will not attempt something which is, in fact, impossible. 

With that said, there are many serious technical barriers between our current systems and anything even approaching AGI— it’s not as if we will get Skynet in the next six months. Indeed, such are the hurdles to AGI that I personally would be very surprised if we saw anything like it before 210016. But even a small chance of a great disaster from AGI would justify quite a lot of precautions— when the fate of the world is at stake, even unlikely risks that can normally be safely ignored must be considered17

Trying to estimate the arrival of ASI before a certain date is difficult because the event is so unique that the past offers little guidance. Attempts to estimate the number face many hurdles, but the aggregate of forecasters suggests at least an estimate of a small but real chance of ASI sometime this century18

Even a small chance, however, should spur us to caution. In Russian Roulette, there is only a 16.67% chance of getting shot, and a much lower chance than that of dying, but who in their right mind would play such a game? Just because there is a small chance of something going catastrophically wrong, that does not mean you can ignore that chance.

What are the implications of this existential risk?

It is not clear what, if any, actions, are necessitated by this existential risk, but some principles seem clear. First, because the threat comes from future AIs, and particularly future AI research, limits on the production, distribution, or use of current AI models, while potentially justified to counteract more mundane risks, will be useless at mitigating existential risk. Rather, any voluntary restrictions or regulations must take place on the margin, on the frontiers where we build better and better models. Here, we have an advantage, for there are only a handful of corporations, and probably only two national governments, that have the capacity to create these new models, and coordinating the actions of such organizations is far easier than coordinating the uses of already existing models, which are by now in the hands of billions of people. Reporting requirements on the training of new models, or restrictions on training models above a certain power threshold, may here prove useful. 

In summary, AGI, if achieved soon, would likely kill us all, but our chances of building it soon are difficult to determine. Thus, it would seem prudent to consider ways to avoid creating AGI until we can be reasonably confident it will not kill us.


Footnotes

  1. The capacity of LLMs can be measured in a few ways: by parameters, which represent information about weights and biases within models (ChatGPT has 175 billion); by flops (floating operations per second) which measure the computing power necessary to train them, (training GPT3 likely took above 1023 FLOPs); or by the size of the training set (which for ChatGPT was 45 terabytes.) These three measurement methods generally correlate, and this article is not technical enough that they need to be further specified except in a few cases. A helpful discussion of the issue can be found here ↩︎
  2. At the time of editing, Google’s Gemini and Anthropics Claude appear to be approaching a similar level of capacity.
    ↩︎
  3. For instance, a human has general intelligence, and the most complex neural circuitry in the world, but yet does not know how to do basic addition unless taught. Thus, maybe the fact the GPT has to be taught, or prompt engineered, to get the right answer might not mean it is not a weak form of AGI. ↩︎
  4. This is not an argument that  AI is not capable of mundane harm, such as economic disruption or promoting racism. The point is that talking about, for example, AI enabling more sophisticated crimes under the same linguistic heading as AI killing everyone is not conducive to properly understanding the issue, so it makes sense to refrain from treating them together. ↩︎
  5. Another way of putting this: given that aligning an AI presupposes having an AI at all, and that you understand what it can do and how it works, alignment faces all the difficulties which capabilities do, with the added difficulty of ensuring it is oriented towards human goals in a way more robust than is necessary to increase its capabilities. ↩︎
  6. this is not the usual symbol for a utility function, which will become relevant later ↩︎
  7. Or, more likely, a drug far more powerful which we have yet to discover, but which the ASI will research. ↩︎
  8. Albeit finite, because there are only so many atoms in the light cone of the AI, and so many configurations of the atoms, and you can say that two functions which result in the same configuration get the same number, they count as the same. You can reform the problem to avoid this, but in those versions you can say that the goal function itself must be expressed in a finite number of bits, and any two goal functions with the same arrangement of bits, even if they would result in vastly different configurations of atoms because of different starting conditions, count as the same ↩︎
  9. To summarize, first. you have problems related to instrumental convergence; amd more broadly, value drift in AGI means that the end goal function of the AGI is going to be higher entropy, with a more even probability distribution across outcomes rather than having probability centered on human value ajscent outcomes, values than the initial goal function a human would design. ↩︎
  10.  A common English pronunciation of ↩︎
  11. Or at least the outcomes of the natural law, for there are philosophers who argue that synderesis, our perception of fundamental moral truths, is a matter of the immaterial soul, and not just of the mind. The AI, having no soul, thus cannot ever truly comprehend morality. Saying for the sake of argument this is true, so long as the question “If the AI knew the Tao, what ought it to do in a given situation,” has a coherent answer, it is possible to just design the AI to do that. For the purposes of an AI, it is not necessary to know the Tao, just to predict what it would command. ↩︎
  12. using its intelligence to figure out ways to recursively modify to become more intelligent, which will in turn make it more able to recursively modify to become more intelligent ↩︎
  13.  Noting that the improved coordination largely came from the increased intelligence. ↩︎
  14. Remember that this is not the same thing as “book smarts, ” as is at best a very imperfect proxy for intelligence, but rather refers generally to an agents effective computing power ↩︎
  15. That is, we did not know understanding of language was reducible to math until we actually reduced it. ↩︎
  16. To those more familiar with the debate, my P(doom) before 2100 is between 5% and 20%, but jumps to above 90% conditional on getting AGI during that period. ↩︎
  17. Among Christians, though the argument would also work for most belief systems which believe in a teleology for man, there is an argument that we ought not to worry about existential risk because God, who has a plan for humanity, would not let that plan be ruined by AGI killing everyone. Further, there are those who argue that the case for existential risk mitigation is rooted in transhumanist and utilitarian fallacies. For a Christian, particularly a Thomist, framework for why existential risk needs to be addressed, see here
    ↩︎
  18. They ask about AGI rather than ASI, but asking about the latter is even harder, and it was impossible to find a good aggregate of predictions on it. ↩︎

Categories: Domestic Affairs

Tagged as:

Leave a comment