So if we consider the prospect of a well-adapted “creature” made of bits, what world does that creature inhabit? The usual setup for function maximizers, where they have senses and outputs that are incredibly specific to a single task and a simple fitness function that reduces the entire universe to performance on that single task, is no good for a general intelligence; even a hamster has to be able to see and do a thousand different things in hundreds of different contexts. In fact that’s what we mean by “general” when we talk about “general artificial intelligence” – one that can undertake new tasks and perform them competently.
To continue my ideas about provoking intelligence, a general intelligence needs to have a diverse set of needs, a set of senses capable of perceiving things in its universe that it can use to meet those needs, and the ability to act on its own universe and perceive the results of its actions.
The kind of virtual-reality that people doing situated AI have been using, is just embarrassingly wrong, because it projects an AI into a model of a physical world with completely irrelevant features like simulated cliffs and simulated ladders and simulated bunches of bananas. And in the context of that model, there is nothing the AI can see or do that could ever be in the least bit relevant to its own needs or to what we value in abilities for an AI. It may be given simulated needs that it can meet in its simulated environment, and become adapted to that simulated environment – but that’s not what we mean by a well-adapted creature. We don’t want something that can only exist in a simulation of a world that doesn’t, and something that can take effective actions only on modeled entities that don’t map to anything we care about. We don’t want something whose perceptions are limited to that model. And even if such an “intelligence” could perceive things outside the model, having no needs related to things outside its model means it would have no reason to ever be aware of such perception.
An AI lives among files, programs and processes, network connections, data streams, and communication channels. We value its ability to effectively perceive, understand, and manipulate those things. The real capabilities of the AI include pretty much everything you could do with a computer UI.
It needs a general way of interacting with such things – like our hands provide us with a general way of interacting with the physical world. Hands are awesome. They are multipurpose manipulators that we use for everything from peeling oranges to swinging sledgehammers to casting fishing lines to typing words into a blog. They are exquisitely sensitive sensory organs that we use to test temperatures, textures, hardnesses, weights, electrical charges, vibrations, and exact and subtle motions.
So what can we construct that plays the role of a hand, in the context of a universe made of bits? I’m going to call them “foci” after their role as focus of attention. Mostly, I think “foci” would be a cluster of information channels – text, image, audio, and video, input and output. “Input” would play the role of senses, “output” the role of manipulation. The AI would probably need at least two foci – for some jobs you need to pipe input from one file or device directly to output at a different file or device, and the files/devices are in different organizational relationships to one another – as humans we’re inclined to say they’re in different places, but the inside of an information system has organizational contexts rather than locations, and they’re not quite the same idea. Then again, much of UI design for humans involves taking information in different organizational contexts and presenting it or enabling it to be manipulated as though it exists in a single context. So maybe an AI would only need a single focus plus some interfaces (like a bash shell) that allow it to span contexts.
Anyway, what is a focus? It’s a thing that the AI can direct to a specific set of bits – file or device or whatever, and contains most of the “ordinary” ways to make sense of those bits. So, it needs to have a convolutional network channel that can “see” the images in graphics files, a recurrent channel that can “read” text or other linear formats, a recurrent-graphics channel that can “watch” video, a recurrent waveform input channel that can “hear” audio formats, etc. All of these different input layers (and in some cases stacks of layers starting from an input layer) would be connected to a common “symbolic” level of input so that information arriving via multiple channels could provoke the same symbols for further processing. For each type of input, a corresponding ability to produce output for that channel, simultaneously with experiencing the output, the way a child can hear herself speak or see her own drawings. Most of these channels would be silent (and not consume compute resources to run that part of the neural network) most of the time.
Additionally the focus (or maybe the environment) ought to be associated with a set of input channels representing events that should attract the AI’s attention – ie, things the AI becomes aware of because they happen, rather than because it’s specifically paying attention to them. In a Unix system, the logfiles constantly going by and the files in the /tmp directory getting written/read/erased would most likely be something like the sound of the wind to us. Unusual events in that stream would capture the AI’s attention the way unusual noises in the wind (like a freight train or a thunderclap) gets ours.
And finally, there ought to be a full set of tools that can be used via the foci to do things the foci by themselves can’t. Very much in the same way we use tools with our hands. Those tools would be programs, possibly the same programs that humans use to accomplish work on the systems. So, if there’s a file that isn’t in a format the AI understands, it could use a format converter to make a file that it does understand. The level of intelligence required to use general tools effectively is probably out of reach this decade, but in principle the AI ought to be able to use anything that a human can use, with a virtual terminal “seen” via the same convolutional subsystem that’s engaged when it reads graphic and/or video files (or if it’s a text-mode terminal like a bash shell, the same sense that’s engaged when it reads/writes text).
In short — screw simulated environments that are just meaningless mockeries of a physical world. Let an AI experience the world as the world is where the AI really lives.
Now, as to what needs or desires the AI should have that would compel learning to perceive and understand these things, and learning to use its foci and tools, I know such needs must exist but they’re difficult to enumerate. I know that it has to have a basic need to consume many different kinds of data from many sources, and improve its own ability to predict the future data it consumes. When it looks at those pictures, or those text streams, or whatever, it should be learning to understand (and imitate) what it sees.
That motivates understanding its world, but not to understand and be aware of itself. To become aware of itself, it needs to have appetites that it can fulfill via its actions, needs to be able to anticipate and perceive the results of its own actions, and actions to meet these desires need to be (or are much more effective when) responsive to the particulars of environment and context. Does simply acquiring the data for a great variety of input types and sources qualify? Maybe?