Training Without Examples

In a lot of machine-learning techniques, including neural networks, what you wind up doing is training reflexes;  That is you have a whole bunch of examples where “this input ought to produce that output” and what you want the system to learn is a sufficient understanding of the underlying system to figure out why those examples work, at least well enough to guess what output ought to be produced for cases it has never seen.

But how do you train something to interact with a complicated system in a way that brings that complicated system toward a specific state, when you don’t really have any examples of such an interaction to train on?  When maybe you don’t even know exactly what the outputs to produce a good performance ought to be?  You’re asking the system to learn, not just from examples, but from ultimate results.

There are actually two different issues here.  First, our usual training methods apply mostly to single instances.  That’s great if we want a reflex action, or a decision based on a single point in time, or a classification.  But if we want an interaction instead, then the history rather than just the single point in time matters, and we are now needing a recurrent system.  The examples of correct action we’d be using to train now become full sequences rather than just input/output pairs, and the set of possible examples is exponentially larger for every step in the length of the sequence.  Doing backpropagation with recurrent systems is hard too; the algorithm is called backpropagation through time.  It works, but it’s very hard to reinforce or downregulate any significant length of interaction if we’re just doing backpropagation from the very end of the process (where the “real” results become available).  So we need something that can give us some feedback every round about whether we’re (probably) doing the right thing.

The second issue, and more fundamental in this case, is that when we’re judging the interaction according to its result, the result doesn’t tell us how we ought to correct the interaction.

In some machine-learning techniques, these problems don’t really come up; for example with genetic algorithms we let the evolved individuals interact with a simulation of the complex system. When they’ve run we then judge their fitness by the result they achieve.  We don’t have to have any idea in advance how they’re going to achieve it.

For an example, let’s say we have a system that provides control outputs which run the electrical motors in a robot arm.  Now, we want an interaction with a somewhat unpredictable system. We want the system to use its arm to pick up a ball and drop it through a hoop.  As fast as possible, as many times as possible, while dealing with an unpredictable world; for example there may be a kitten in the box who is much faster and stronger than the robot arm, and who likes playing with the ball.

If this were a genetic algorithm, we’d just code up a value function.  Did the ball get dropped through the hoop? +100 points.  Otherwise start adding up smaller partial achievements or add penalties for partial failures.  Subtract a quarter-point per turn spent standing still.  Subtract a half-point for every centimeter of final distance between the gripper and the ball.  Subtract a point for every centimeter of final distance between the ball and the hoop.  Is the ball above the hoop?  Add ten points.  If the ball is lower than the hoop, add three points if it’s being held by the gripper.  And so on.  The key to these things is that any relevant difference in performance should lead to a relevant difference in score, even if you wind up assigning scores to things that are still far from being your ultimate desired behavior.

Then we’d make a system that relates the input information about the locations of the ball and hoop and the arm and a simulated kitten, to the outputs that run the motors, and let instances of that system evolve. Measure performance and reproductive fitness by the results of the interactions, until we come up with individuals that have discovered SOME way of getting the gripper to the ball, getting the ball closer to the hoop, grabbing the ball, getting the ball above the hoop, and dropping the ball through the hoop, while dealing with randomness introduced by the kitten.  Then we let it run with a real robot arm and a real kitten, and see how it does.

But with value propagation networks, including artificial neural networks, it’s difficult to take a “success” score based on the result of an interaction and know what to reinforce or downregulate as a result.  How in the world do we take our ‘value’ function and use it to figure out what the motor outputs given a particular sequence of inputs and previous outputs should have been?  Knowing that the ball has or hasn’t been dropped, or whether or not it’s anywhere near the hoop, etc, doesn’t directly map to “this motor should have run longer” or “the gripper should have been released here” or etc. The actual outputs we’re producing determine how long motors run or when the gripper  closes and releases, and knowing a final result doesn’t tell us what values you ought to have produced in order to achieve a better result.

So how do we take a value-propagation network and train it to run its motors in such a way that the result maximizes that arbitrary function?

We have to be a little bit sneaky and use feedback to bootstrap the system.

Let’s say we have the same value function we’d have written for the genetic algorithm, and we want to use it to train a neural network.  First we add outputs to to recurrent artificial neural network to predict, every turn, how well the network will score on that function during the current interaction.  This at least is something we can get a specific value for, so we can train it.  Let it run, figure out what score it got, regulate all the predictions it made along the way using backpropagation through time,  repeat as needed. This score will rapidly converge on the system’s actual performance, and if we train it along with the system, then whenever the system learns to do better the prediction will also learn to predict it doing better at the same time.

Now comes the magical bit.  Every time the anticipated score improves between one round and the next, we must have done something right.  Every time the anticipated score drops between one round and the next, we must have done something wrong.   And now, while we still don’t know what values the motor-control outputs should be producing, we have a way of telling whether they got a good or bad result, every turn.

So depending on the change from one turn to the next of our anticipated score, we can either reinforce or downregulate the output just produced.  And using backpropagation through time, we can reinforce or downregulate whatever internal states and responses to those states led us to produce those particular outputs at that particular time.

The system will start by learning some very low expectations.  But those expectations depend on its performance (and that of the kitten), and it will learn to maximize its performance relative to those expectations, and as it does so its expectations of its own performance also improve.

As its expectations of its own performance improve, it learns to expect a really good score, and to expect of itself only the kind of performance that can lead to that score.  Let’s say its prediction is that it can make ten baskets in the number of turns remaining, so it’s predicting 1000+ points.  But then the kitten manages to bat the ball away just before its gripper picks it up.  The ball happens to roll toward the hoop.  At the beginning of the training progress, this would have gotten positive reinforcement, (in hockey they call it an “assist”) because the ball winds up closer to the hoop resulting in a score that beat the (very low) expectations.   But late in the training process, this results in a missed opportunity to make a basket.  The expectation drops by 100 points or so because the system has “wasted” time and now the prediction based on the new situation will be that it can only make nine more baskets.

Now there is an interesting thing; in order to bootstrap its performance  here, the system had to form an expectation of the results of its own action, and constantly compare its own performance to its expectation of its performance.  On some level, it had to take its own actions into account when formulating its expectations of the future.  And that is perhaps similar, or perhaps different in degree rather than kind, from  things that animals do when they engage in what we call self-awareness.

Carried out with an ungodly huge amount of computing power and time, one can imagine this system eventually understanding the emotional states of the kitten, at least insofar as they relate to the accuracy of its own predictions about its success.

Even under such wildly optimistic definitions, and even if it understands the emotions of the kitten, it wouldn’t have anything in common with those emotional states.  An all-overwhelming urge to go on dropping a ball through a hoop is not a basis for empathy or communication or social interaction or curiosity or anything else.  Even if the system was a so-called ‘superintelligence’ capable of predicting human cognition and anticipating every word anyone said, having the entire universe come down to whether or not you can keep dropping a ball through a hoop isn’t a sufficient basis to provoke anything like the complexity of a subjective experience with a point of view.  No matter how smart a system like this is, and no matter how complete its model of itself and others has to be in order for it to do its job, there’s going to be ‘nobody home’ in any sense that we care about.

If it understands anything about that kitten, then it does so solely in relation to how it affects the odds of getting that ball through that hoop.  Similarly its self-awareness is limited and alien.  The only reason it would care about its arm, its motors, its ability to move — is because those things affect its ability to get that ball through that hoop.  This isn’t anything like ‘consciousness’ or ‘self-awareness’ as we understand it, even given infinite computing power. And it isn’t even the glimmerings of ’empathy’ or ‘fondness’ or anything else on our emotional radar.

In other words, we can create a recurrent Value Propagation Network – in this case a recurrent Artificial Neural Network – that learns from feedback without examples, optimizes behavior to get results in a chaotic world, has some kind of self-awareness, and is in some limited way even aware of the state of another being. But that is not the same as creating an individual with empathy, emotions, or moral agency.  It’s necessary, but not sufficient.

Real artificial intelligence – that is, trying to create a genuine synthetic mind as opposed to a function maximizer – remains hard.  Not impossible, because we have an existence proof inside our heads that such a thing can exist and be made of matter behaving according to the laws of physics.  But hard.  And the process of building it will teach us tons of stuff we won’t really understand yet about our own minds until we relate the AI experiments back to our own example.

Building this function maximizer in terms of genetic algorithms, the way Mama Nature did, would probably have been more straight forward, but then I wouldn’t have gotten to talk about how some degree of self-consciousness and self expectation, while necessary, is not sufficient for real artificial intelligence.  And I guess that’s where I was really going.

Maybe real artificial intelligence does turn out to be a function-maximizer – but understanding that function is difficult, and if that’s what it is, then  nothing says it’s a function that always leads to a ‘mind’ as we understand it.  For biological intelligences, our bodies(including but not limited to our complicated brain meats) arise as function maximizers for genetic continuity.  The fitness function for that involves subgoals like staying fed, having babies, not getting eaten, not dying of exposure, and so on.

But look around us.  The function that worked to produce our minds – our brains – is anything but reliable for that purpose. Take that same  function, maximize it a few tens of millions of times, and you get a few tens of millions of different species optimized for hundreds of thousands of different genetic-continuity strategies.  Historically – and I’m talking about 365 million years of history here – we got real intelligence of our kind, exactly once.

And as far as we can see – thousands, even millions, of light years in every direction – as far as we can tell, still exactly once.   Minds can’t have been very likely to arise naturally, and so they’re probably not the sort of thing we’re likely to duplicate accidentally.

Which isn’t terribly relevant I suppose, because there’s no ‘accidental’ about it.  We’re trying, and we’re trying hard.  The design space of things we can build while searching for or learning the way to build real synthetic minds is huge and encompasses a lot of function maximizers that aren’t minds in any significant sense.  Some of them are frankly terrifying.  The fact that many of the terrifying ones are profitable, or serve power, at least in their early versions, is even more terrifying.

We’re going to solve Artificial Intelligence.  Then it’s going to solve us.  What good is science if nobody gets hurt?