Well, I think I can now wind down another major part of my crazy AI project. I now have a neuro-evolution algorithm that appears to be suitable for pretty much any neural network, including recurrent networks at the largest scale.
There’s nothing particularly specialized about it. It can be used for any type of artificial neural network. I’m interested in AI, so I’m most interested in recurrent networks with continuous I/O, but it’s equally applicable to pure feedforward networks of the conventional type.
It’s surprisingly short and simple. Good things always are, it seems. But to get here I’ve written many times as much code as now exists in these files. I try never to call them failures. They were necessary experiments to gain the knowledge I needed to write the success. When you don’t get what you wanted, you still get experience. But holy cow, this took a lot of experiments. I must have written and ripped apart twenty or thirty things more complicated than this along the way.
So, a little exposition of the algorithm. I’m glossing over a lot of details for the sake of brevity.
It’s an Evolutionary Algorithm – an EA seeks a balance of population fitness, where a classic Genetic Algorithm seeks a single most fit individual.
It evolves a neural network in the context of a cartesian map of locations, in 2 or 3 dimensions, which introduces ideas of distance that turn out to be useful. You start with a “blank” map where the input and output nodes are mapped at some location, each associated with one or more units of source or destination bandwidth. The blank map should have equal amounts of source and destination bandwidth, corresponding to system input and output respectively. (The notion of evolving a network mapped to a coordinate system came from NEAT and hyperNEAT; the allocation of connections by bandwidth is my own idea). Each node you add to the map will have an equal amount of input and output bandwidth corresponding to connections accepted or output from that node – to or from input/output, or another node.
Each genetic “carrier” or individual represents parts of a neural network. One or more neural connections with origin locations, destination locations, weights, and bandwidth, or one or more nodes with locations and whatever other attributes you’re giving nodes. Additionally each carrier has several other attributes, including its own mutation probability. This is very important – mutation probability is coded individually for each carrier and subject to mutation, not fixed for the gene pool as a whole. Otherwise this algorithm won’t scale to the very largest sizes.
To use a set of genetic carriers to build a neural network, you take the whole set, add up the bandwidth of all the connections specified by the carriers, and let that be the number of nodes in your network. You either pick node carriers according to their fitness until you have that number of nodes, or if you’re not using node carriers, you can select a location for each node at a random point in your map. Each node is assigned a source and destination bandwidth equal to one.
Then you iteratively find the connection whose origin and destination points are the shortest total distance from nodes with nonzero source and destination bandwidth respectively, add that connection to the map at those nodes, and subtract that carrier’s bandwidth from those nodes’s source and destination bandwidth, until you have processed all the carriers. This process, assuming you handle the overflows and roundoffs right, will use up all of the bandwidth available in the nodes and at the inputs and outputs. The result is your evolved network.
So there is a form of artificial embryogenesis involved in the creation of individuals for test, and the individual genetic carriers involved in creating the individual are the population or gene pool of the evolutionary algorithm.
Two important numbers we have to decide are:
G, the number of carriers in the gene pool.
A, the number of carriers used to build an Agent. (it must be substantially less than G)
R, or the learning rate, should be “small.” 1/T where T is the number of test cases would be optimal (according to Bayes’ rule) if our estimates of problem difficulty were perfectly accurate and our test cases were perfectly uncorrelated. But our estimates lag the reality, and test cases are usually correlated if there are patterns to be learned, so R somewhat larger than 1/T is usually reasonable.
There are test cases that the system is being evolved to work on, and each test case must be associated with a difficulty D. D is the difficulty of the test case, defined as the percentage of tested agents currently failing that test. D will change as evolution changes the capabilities of the agents being tested, and can be tracked for each test with a simple running average.
The competition of agents under each test is implicit in the tracking of D. Passing a test has more value the higher its difficulty, and failing it has a higher penalty the lower its difficulty.
Associated with each carrier is a number F, its fitness. The fitness F is defined as the current estimate of the probability that that genetic carrier is a member of the optimal population available for constructing an agent. As such it will range from 0 to 1, and its mean across the population will be A/G.
Each time an agent succeeds in a test, the fitness of its carriers is updated by the rule
F’ = F + R * (1 – F) * (D/(1+R-D)) * 1/A
and each time an agent fails in a test the fitness of its carriers is updated by the rule
F’ = F – R * F * ((1-D)/(D+R)) * 1/G
Iteration of this procedure results in a population of carriers, each with a fitness score that is an estimate of the probability that it is included in the optimal set of carriers for an agent. The rule for selecting a set to make an agent for test is simply to iterate through the entire population, selecting each carrier with a probability equal to its fitness score. Usually, this will result in about A carriers, as a result of the fitness scores that the update rules will asymptotically approach. I’m still trying to figure out whether to actually enforce A as an upperbound and/or lowerbound.
This update rule replaces my earlier reliance on the Delphic Casino method for this kind of EA carrier selection. It is more precise and considerably more efficient. Deriving it mathematically was surprisingly hard.
In an EA there is no need for crossover or recombination on the level of the population; recombination of genetic material takes place in the process of constructing testable individuals. Conversely, in an EA we are not looking for a single “best” individual in the population. Instead, we are optimizing to produce a mix of genetic material most likely to produce good testable individuals.
What in a conventional GA is done by changing the frequency with which a particular gene appears across the entire population, is done in this type of EA by increasing the fitness score of the carrier; there is no need to make more exact copies of the same carrier, because increasing its fitness increases the percentage of time it appears in testable agents in the same way additional copies do in a standard GA. The needs of evolution are served by introducing mutants occasionally.
Accordingly, when the fitness of a carrier has dropped below some threshold point, you can replace it with something else. You can make a ‘mutant’ copy of an existing carrier, or create a carrier by whatever other method you like, assign it a low fitness, and drop it back in.
Some of these mutants are indeed ‘flawed copies’ of their parents, with a location or a weight or a bandwidth slightly tweaked; these exploit opportunities for hill climbing. Some are ‘Lamarckian’ variants of their parents, exploiting the fact that the embryogenesis process forces some approximation. The same carrier can be expressed in different ways. A ‘Lamarckian’ variant produces a new carrier that is more likely to be expressed by the current phenotype, than the carrier that caused the current phenotype. Like flawed copies, Lamarckian variants exploit hill climbing possibilities. And some mutants, a very few, are purely random.
You can make mutants algorithmically if you know something about the map; for example if you’ve got inputs and outputs mapped in a bilaterally symmetrical way you can make a mutant by mirroring a carrier’s information across a bilateral axis of symmetry, on the supposition that an analogous structure may or can exist on the other side.
You can also make mutants algorithmically if you infer something about the structure that the carriers are inducing. For example if you’re tracking excitation states by location and note that increased excitation near location X is always followed two or three steps later by increased excitation near location Y, you can create a mutant that directly connects location X and location Y, on Hebbean rationale, to see if that makes a more efficient, advantageous connection. If you see several carriers with low bandwidth and high fitness all going from nearly the same location to nearly the same location, you can see whether a higher-bandwidth carrier having a higher weight will be advantageous as a way to achieve the connectivity using a smaller number of carriers.
You can even perform a wholesale mutation algorithmically. If you have a structure and you suppose that an additional copy of it would be useful, you can use geometry to remap carriers in the cartesian space and make room for an additional copy, and then make geometrically transformed copies of the carriers for the original structure to occupy the new space.
In the usual case you should have ‘parent’ carriers whose mutation probabilities allow you to create mutants in their area or as variants specifically of them; otherwise you’re looking at delicate structure you shouldn’t disrupt. While it works as well as any GA without respect for this rule, to be fully scalable it has to simultaneously allow mutation in unsettled or non-critical parts of the map, and preserve delicate structure in parts of the map where low mutation rates restrict change.
The specific methods of making mutants available are amazing, but the reason why they’re possible is what really matters about the map, and it would matter even if the specific methods of mutation above didn’t obtain. The space metaphor gives the carriers a common context of location that they can refer to without referring to particular nodes or particular other carriers specifically. That means that the evolution process is far less structure-dependent than any other neuro-evolution process I’ve seen described, and that is what gives it its particular power.
Because it isn’t dependent on its structure, it isn’t limited in what final structure it can find. Like water or sand, it can take any particular shape. I’ve seen a lot of algorithms where the author has the temerity to prove that they can find any neural structure, but the proofs offered rather remind me that someone could prove that it’s possible for one-franc coin dropped from the top of the Eiffel Tower in a windstorm to land, on its edge, on top of the Arc De Triomphe. Possible perhaps, but who’d believe it? And would such a windstorm really leave either structure standing?
This algorithm, on the other hand, does not require any information about some earlier node or connection to propagate through it in order to enable some later node or connection to become connected, or vice versa. It does not require tracking or reconciling the evolutionary path of connections. It does not require reconciling genomes of different length, or tracking derivation order of connections, or any of a thousand other complications. Connectivity does not depend on evolutionary path. The proof that it can find a network of any given topology is as simple as the proof that water can be any shape.
So, finally. I have a neuro-evolution algorithm that I’m truly satisfied with. One which I believe is fully up to the challenges ahead, whether those challenges are as humble as optimizing traffic flow by regulating the timing of traffic lights, or vast beyond our current imagining.
Not that that satisfaction will prevent me from tweaking and experimenting and trying to refine it and so on – that’s what I do, after all – but I finally have something I believe in.