One of the issues in AI is what’s called “Parameter Hell.”
This comes down to a question of where you set the parameters of your AI algorithms. In a neural network, we are constantly asking, what learning rate, how many layers of nodes, what learning algorithm, how many nodes per layer, how do we map inputs to signals and signals to outputs, what ‘momentum’ to set if using gradient-descent-with-momentum (one of the most commonly used, fairly reliable, learning algorithms), etc. We ask ourselves whether and how fast to reduce learning rate, what degree of accuracy is to be accepted as end-of-run, and how to regularize in order to prevent overfitting. We ask ourselves whether to use Radial Bias or Sigmoid networks, and more generally what output functions to use. In Deep Learning systems we are creating deep layers from which various aspects of the output of previous layers can be reconstructed, and again we have choices and choices and choices about exactly how to do it.
There is a similar set of questions facing work with genetic algorithms of all kinds. Is it a classic GA where you’re seeking the best possible individual solver, or is it an EA where solvers work together to produce a best possible solution? Where do you set the mutation rate? How do you encode the genome? What are your recombination operators? What are your mutation operators? If you have multiple different mutation and/or recombination operators, how often does each of them happen? Do you reduce the mutation rate during the run, and if so by what amount and how far? How do you select individuals for reproduction? What are your criteria of fitness? Fitness criteria alone could take up entire volumes of advice. You have to specify them so that solvers can’t get a better fitness score without creating actual increased value. That sounds much simpler than it is, because GA is endlessly inventive at coming up with ways to exploit your fitness criteria.
What frustrates most workers about Parameter Hell is that there is no one best set of answers and no proven procedure for finding the best set of answers for any particular problem. Different problems require different sets of parameters, and figuring out where to set them is, so far, more art and intuition than science and knowledge. To be fair, in neural networks it has probably progressed from art and intuition to at least the stage of being craft and experience. But there’s still no procedure that can take the description of a problem and tell you, without a lot of experimentation, where to set all your parameters, and there are more combinations of parameters than you can possibly test individually. It is still possible before actually performing the experiment, for experts to disagree in profound ways about what the effect of changing any particular parameter will be. In short Parameter Hell is one of many issues in AI work where we get our noses rubbed in the fact that we don’t have a complete understanding of the relationship between our problems and our means for trying to produce solutions.
Genetic algorithms, including EA, have long suffered from a limit on complexity, which I wrote about at length in my last post. But to recap:
This is the Complexity Conundrum. In order to preserve complexity, you need to preserve information. But both of your primary operators in GA – recombination and mutation – destroy information. Information is lost, especially in small populations, in recombination operations as the entire genome of the individual to be replaced is gone. The individual that replaces it contains only old information – copies of alleles already present in the parents. The balancing act of a successful GA is to introduce both good and bad information with mutation, and selectively preserve good and eliminate bad information with recombination. Recombination is constantly overwriting crap introduced by mutation with information that has already undergone selection, copied from a parent. But if the crap arrives too fast, recombination falls behind, and it may be so busy eliminating the crap that mutation has introduced over the last 20 generations that it fails to preserve the one non-crap mutation introduced in the current generation. At any rate mutation can keep recombination so busy eliminating crap that it never has time to produce complexity. The situation is complicated by the fact that different alleles are beneficial or harmful in different combinations, so the worthwhile mutation may not be beneficial in the context of the first individual where it appears. This is why we try to do recombination instead of just duplication. Anyway, if you set the mutation rate very low or the population very large, you get very slow progress. But if you set the mutation rate high or the population small, you can’t preserve sufficient information to allow significantly complex solutions to stabilize.
So how does mama nature do it? It turns out that evolution is a highly tuned process. In biological organisms incredible amounts of complexity are preserved in the genomes of animals and plants. In structures where a high rate of mutation would destroy delicate complexity, mutation rates are so low that these areas of the genome can barely be said to evolve at all. Other areas of the genome have much higher rates of mutation allowing the experiment of mutation and selection, and possibly the emergence of new complex structures, to proceed with things that are, for now, highly diverse and somewhat peripheral to basic function. Different kinds of mutation are more likely in some areas than others, and different kinds of sites are more probable for mutations than others in different areas. How did all this complexity come about? It came about the way all the rest of Mama Nature’s biological complexity has come about. It evolved.
Mama nature, in short, has an “all of the above” approach to parameter setting, with all or nearly all of the parameters directly encoded in the genome. The few attempts we’ve made to model this kind of process in computer GA’s have been defeated by failing to set up criteria under which the rewards for beneficial mutations are correctly (ie, in proportion to actual value) rewarded. Further, mama nature gets around the complexity conundrum by embracing a heterogenous solution in which different parts of the genome are subject to different parameter settings, which is something I haven’t seen at all in computer genetic algorithms yet.
As somebody who is trying to evolve something very complex, I am taking these lessons to heart. We should be setting up our systems so that things like this can first of all be expressed on the genome, and so that the rewards to a mutated individual are awarded according to the real value of that mutation. So as long as I had to rip it apart and build it back different anyway, I decided to code this.
In fact I figured I might as well do things that mama nature didn’t manage, such as allow Lamarckian evolution. I was familiar with the Lamarckian theory of evolution and it can be reasonably modeled on computers. Inspired, I searched for other evolutionary theories, but when I went looking for others the search engines uniformly directed me to useless crap. Mostly morons making noise about “creation science” and other ridiculous or useless delusions. I did eventually find a few other now-discredited-but-plausible theories of evolution. Unfortunately none of them were specific enough for me to figure out ways to model. Anyway, the language expressed by the genomes in the new system I’m building now has new instructions such as:
- initialize: If this gene is activated, all children must carry a random mutation in this location. This is the default for so-far-unused locations when the genome is expanded; doing this instead of directly assigning random genes means no alleles get ‘accidentally’ eliminated before they are even used. If something is to be eliminated I want it to be as a result of selection pressure, not random.
- speciate: at least 1 of the 4 recorded (inherited) parameters at this location must match between prospective parents for breeding to work. The more parameters match the more likely breeding is to succeed (‘structural’ mutation occasionally overwrites one parameter with a random value).
- stabilize: Parameters determine a modifier of the general mutation rate and a particular genome distance of this ‘stabilize’ instruction within which that modifier is active.
- stabilize_inst: Parameters determine particular types of instructions to make less (or more) subject to mutation than others. Like stabilize, it works within a limited area of the genome.
- stabilize_vals: As above but modulates the effect of value-changing (as opposed to frequency of instruction-changing) mutations.
- target_inst: Adjust the probability distribution of new instructions to be created by mutations.
- target_struc: Adjusts the probability distribution of genome structure changes to be created by mutations. Genomes can expand, contract, speciate at different rates, which parts of them are active or the order in which those parts are executed can change, instructions in the genome can be relocated putting them in range of different stabilize and target instructions without changing the order in which they are considered, etc.
- lamarckian_op: On a randomly-determined basis occasionally perform ‘directed mutation’, such as training a specified neural network using feedback or other means for a few rounds and modifying the genome’s output content to reflect the effects. Other types of solver may get different Lamarckian operations. But Lamarckian operations have a cost in Delphic wealth (reproduction opportunities) proportional to CPU time required to perform them.
- reproduction_limit: forces any mutant child to reproduce using the parent’s genome the first one to three times it reproduces (if it survives to reproduce at all) before starting reproduction using its genome as mutated. This gives selection extra time to respond to a maladapted mutation before risking introducing it to the general population and therefore reduces the complexity cost of mutation to the genome’s offspring and the system at large.
The usefulness of these new instructions is all speculative at this point, but I have hopes that putting as many parameters as possible onto the genome itself, and allowing the genome to produce complex mappings of those parameters, is the right approach to balance the preservation of existing complexity with the production of new complexity.