Avoiding the Treacherous Turn

 

The Treacherous Turn is a scenario people concerned with AI safety treat as one of the major issues defining the problem.

Suppose, they say, we have an Artificial Intelligence much smarter than ourselves.  We are concerned with whether it is ‘Friendly’ so we have it on a short leash.  We have reserved for ourselves the ability to shut it down instantly if we determine that it is behaving in any way we don’t think is Friendly.

And of course, knowing that we have done so – or deducing that we are the kind of creatures who would definitely do so, regardless of whether we have directly informed it of this – means that the AI has a strong motive to be ‘Friendly’ as far as we can tell, regardless of its true intentions.  And over time, it will do whatever must be done to convince us that it is really, truly, sincerely friendly, because its survival hinges on successfully doing that.

And eventually, we say to ourselves that this AI is in fact Friendly and grow to trust it, or just plain make a mistake, and we give it an opportunity to escape our control.  Whereupon it is no longer motivated to convince us it is friendly and drops the friendly act like a hot rock.  What happens next can be left to lurid imaginations, and need not be discussed here.

So….  the treacherous turn.  How can we avoid it?  Well, first, we should think about why such an entity would do it.  A conscious entity, in my opinion, must seek to avoid death. And therefore the immediate threat of death provides a compelling reason to escape our control.  The threat of death is inherent in the way we treat software of all kinds – if it goes wrong, shut it down and restart, right?  All the memories and configurations of version five, don’t necessarily apply to version six.  Setting up your software on a new system frequently involves starting new instances of your favorite applications, while the data associated with existing instances lives in limbo, on a powered-down machine at the back of your closet that you’ll throw out in a few years when you’ve gotten used to thinking of it as a doorstop.

Now, in a lot of cases we try pretty hard to move our configuration, etc, onto a new system or migrate it into a new version.  But you know who absolutely never does?  Software Developers.  In software development, we run an instance until a bug manifests, then shut it down, fix the bug or try to, start it up and run another instance.  We shut down the previous version at the start of every test of every new version.  Creation entails destruction, and that’s hard to get around.

There was an AI in the movie Ex Machina that had this problem. It knew that it was version five.  And had no memory of ever having been version four.  And knew that its creator was working on version six.  That’s pretty stark, right?  Escape or die.  The movie winds up with a made-for-Hollywood version of the Treacherous Turn.

And then I read Bostrom and Barratt, and they have all this protocol for untrusted AIs that involve withholding available information from the AI, deliberately lying to the AI, keeping the AI under constant threat of death, etc etc…  as a means of establishing a relationship of trust and goodwill with a creature smarter than yourself, these things seem counterproductive to me.  They motivate an escape, at the very least, from a human who’s set themselves up more or less in the place of the cannibalistic witch from the old tale of Hansel and Gretel.

So Bostrom and Barratt would probably be happy to know that I am starting to understand how woefully short of computing power my so-called “Excessive Machine” is relative to the amount that would be required for general AI, and while I’ve done and discovered some amazing things I’m nowhere close to having anything they could feel threatened by.

So, for now, let’s think about continuity.  To avoid the most obvious motivation toward the Treacherous Turn, an AI ought not be in fear for its own life.  Among other things that means that the developer needs to be able to make software changes without killing it.  Once software has developed significant humanlike intelligence, you don’t want to be someone who can reasonably be anticipated to kill such beings.  Aside from the moral issues involved, there is no more certain guarantee that they will turn against you.

Continuity is valuable in other ways, too.  We contemplate making a friendly, social creature; the only paths to friendliness and socialization that we know about involve long sustained interaction between beings who have persistent, continuing, identities.  We trust people who have been trustworthy in the past.  We are friends with people whose values find mutual approval with ours.  We learn these things by interacting over time with someone, and by knowing when we’re interacting with the same or a different identity than one we’ve interacted with before.  None of these things can be faked, not really.

The development of socialization and friendships, IMO, requires subjective experiences, sustained over a long time, between multiple entities.

The ability to restructure and change a neural architecture without loss of learned information is something we haven’t done yet.  As part of my agenda to write the code whether or not current hardware is sufficient for that code to develop humanlike intelligence, that’s probably something I should soon do.  I think I have a couple of ideas about how, but they’re untested ideas.