So, I’ve already decided that my current effort will be to produce a Recurrent Signal Propagation Network. Recurrent Signal Propagation Networks are considered to be extremely hard to train, and therefore lots of people have given up on them saying they don’t scale to “real” problems. I’m so crazy that I’m going to try to use it on the biggest problem that exists – consciousness.
I’ve already decided that some parts of its structure will need to be built using Knowledge-Based Cascade Correlation Learning to get very efficient subnetworks. I’m probably going to use a modified KBCCL where I consider possible added nodes in small sets rather than one at a time.
I know that even where I can’t necessarily make progress on a visible goal, I can use KBCCL (or other techniques) to produce stackable autoencoders, and that doing so is likely to produce reduced-dimensionality signals that can be used to make progress on an ultimate goal.
For some things (such as sensory inputs) I’m going to need convolution networks of several different degrees. It may be possible to train convolution kernels using KBCCL.
I know that some parts of its structure, including those built by KBCCL, will need to be trained (or maybe only finetuned) using backpropagation. Recurrent neural networks are inherently deep (infinitely deep in fact, when regarded as a time series) and therefore have a reputation for being dastardly hard to train with backpropagation. Even with very small learning rates, it often produces wild shifts in behavior. I have the hubris to believe I can use backpropagation anyway, because I think that I understand why these gradient problems happen and how to ameliorate them. Of course, I could be wrong.
If I’m wrong then I’ll have to use genetic algorithms for training. Genetic Algorithms are currently among the most popular ways to train and structure Recurrent Neural Networks. GA are inherently able to ignore locally extreme gradients that can screw up BPTT training – genes that produce behavior uncorrelated with good performance, simply are eliminated. I’ve used them before but I dislike them. Getting them to scale, like getting Recurrent Signal-propagation networks to scale, is dastardly hard.
Evolution of Signal Propagation Networks using GA can handle some signal types (literally, types – things like characters and strings) that are inherently non-differentiable and therefore cannot be trained by backpropagation. On the other hand, using some output signals to control a stochastic probability of doing different things with signals of those types can also be effective, depending on whether you can make nodes that have differentiable inputs and outputs whose “error” can be quantified in terms of what ought to have been done with the non-differentiable inputs and outputs.
Because I’m unlikely to be able to do this in the first place, I’ll be unlikely to be able to do it without Genetic Algorithms. So I will probably wind up implementing GA regardless. You can’t call it a failure until you’ve tried everything you know how to try.
One thing I know how to do that I haven’t seen anyone else talk about doing is train a recurrent network using retrospective backpropagation through time. Properly speaking, they may not have talked about it because it’s an implementation detail. it’s just an optimization; it doesn’t allow any fundamental capabilities that other techniques don’t allow, so researchers may have used it without thinking that it was worth writing about. But it does make exercising those capabilities more efficient – to the point where it could make a major difference in scalability. In a recurrent network you have a set of output nodes that produce output on every iteration. A straightforward step when trying to train a system, is to train some of those nodes to predict how well it’s going to do in the future. Give it feedback on the accuracy of its prediction and on the degree by which it beats or falls short of its prediction. This is particularly effective because the same information that the prediction task extracts from lower layers and earlier iterations is highly useful to, and made available to, the nodes that decide what actions to take.
But the information that you can use to train the prediction network (and the actions) isn’t available until later. Eventually, a predicted event occurs, and it gives you feedback for all the predictions made along the way. Or you have a final outcome/reward, and you want to positively or negatively reinforce the actions that led to it at every step along the way. This gives you a bunch of feedback relevant to different moments in time. In principle, or in a naive implementation, you train by doing backpropagation through time on the whole retained history prior to each iteration you have feedback for, starting from each of those moments and reaching into its past. Retrospective backpropagation through time means you can start your backpropagation with the current output, then defer handling signals propagated from earlier iterations until completely finished with the subsequent iterations. This allows you to combine the backprop-through-time feedback from your subsequent iterations with the direct feedback for earlier iterations. In effect this means you can do a single BPTT iteration that uses all of the feedback that has become available, even when different parts of that feedback are applicable to different iterations.
Is that about as clear as mud? Not that it matters much. This is just my implementation notebook, mostly for putting my own thoughts together. I don’t really expect more than a dozen people to ever read the blog. After all, I’m speaking Geek.
Anyway, the substructures that can most easily be made available as nodes within the Signal Propagation Network are Artificial Neural Networks (or individual artificial neurons or layers of them). This specifically includes Feedforward networks, which are familiar and can be independently trained outside of the Recurrent context. It includes Hopfield Networks, which serve as a kind of associative memory, and Bidirectional Hopfield Networks. Bidirectional Hopfield Networks are essentially two Hopfield networks with a common hidden layer – they serve as BAM, or Bidirectional Associative Memory. It includes LSTM or Long Short Term Memory, and MTRNN or Multiple Timescale Recurrent Neural Networks (which have different types of neural activity that take place in different timescales, but in which each type of neural activity depends on training which takes place in both its own and other timescales).
Other substructures that can be made available, with increased levels of trickiness, include various types of interfaces to machine memory – stacks, queues, and arrays, with operations on them controlled by propagated signals. This includes things that have been called Neural Turing Machines Neural Finite-State Automata, and Neural Pushdown Automata.
Turing Machines operate on a “tape” or one-dimensional memory. One generalization of this is Tur-Mites, which are the same idea operating on a multi-dimensional memory. Tur-Mites, in turn, are a special case of Cellular Automata in which at any one time, only a single cell contains a “value” inducing further changes. And Cellular Automata, in their turn, are a special case of Recurrent Signal-Propagation Network in which the new value (output) of all cells is determined by applying an identical convolution kernel to the previous outputs of itself and its local surrounding cells. This brings us back to the Recurrent Signal Propagation Network as a model of general computation.*1
Some “appliances” could be made available to directly handle specialized functions, including operations on strings, characters, graphics, etc. These include integer calculations, casting operations, grammatical transformations such as suffixing, prefixing, and stemming, interpreters, parsers, graphics filters and editors, compilers, database access, etc. Some of that is pretty darn wild and woolly, hard to imagine ways to integrate with a RSPN, and quite possibly are things I’d give up before actually doing. Other parts of that are flatly necessary, the way eyes and ears and hands and the ability to produce language and a world to perceive and act upon are necessary to us meatbags. They are just things I have to do if I’m to provide an environment sufficiently interesting and stimulating to provoke the kind of intelligence I’m hoping for.
It’s not appropriate to model each of these things as Recurrent Signal Propagation Networks directly, because each of these specializations is useful. They give us different tradeoffs of memory, efficiency, and behavioral complexity, in many cases exceeding the capacity that we could directly model as an RSPN.
So how do we interface each of these things with our RSPN, without negating those advantages by modeling them directly as RSPNs? For a Finite State Automaton, a connected, trainable, SPN can be its transfer function. For a Turing Machine or a Tur-Mite, a Finite-State Automaton (and therefore a connected, trainable SPN transfer function) governs the action of the ‘head’ as it reads and writes. A Cellular Automaton is a special case of a Recurrent Signal Propagation Network in itself, having the property that the transformation function is decidable on the basis of locally connectivity and identical for all cells – it’s a convolution kernel.
Making it easier for all these things to function means adding some I/O not necessarily associated with their usual model: for example the decision to pop or push from a stack (or a queue) may be easier to make if the top three values of the stack are visible, and the control of a Tur-Mite would be much simpler to learn when the surrounding cells are visible.
The issue in every case with these trickier constructs – most of which are for memory access but some of which are for interaction with the world – is how to train them. In some cases stochastic selection among discrete options, with error feedback affecting the probabilities, is clearly possible. In others, things are not so clear and genetic algorithms may be required.
Finally, I have to optimize the system to do what it does efficiently. This is not usually considered in Recurrent Artificial Neural Networks beyond attempting to find an optimal topology; the system simply has a topology, and its topology determines how much compute power is consumed in each recurrence, and that’s that. But if anything reasonably intelligent is going to run on “normal” computer hardware, I need to avoid wasting any cycles because there aren’t nearly enough cycles to do it that way.
I want to do many different things, even though doing all of them in all recurrences would be prohibitive. So, the Recurrent Signal Propagation Network will need to contain nodes spanning medium-sized sets of input and output, where one (expensive/slow) thing is done with the inputs to produce output in some circumstances, and a different (cheap/fast) thing is done with the inputs to produce output in usual circumstances. Maybe the cheap/fast approximation should identify cases where the expensive/slow thing happens. One special case would be a subnetwork – a full-on RSPN or RANN in its own right – that only gets an iteration when the “outer” RSPN calls for it – or maybe at some fixed, slow ratio relative to it.
And the point of writing all of this is to have it all in my mind while I go and code expanded functionality into my RSPN implementation. Which happens now.
*1: It should be mentioned that in each case – whether the memory controlled and accessed is a stack or a queue or the tape of a Turing Machine or the space of a Tur-Mite or the field of a Cellular Automaton, the theoretical model is of a device with infinite memory – which alas, is not the case in the real world. This makes any particular implementation of any of these things, in practice, a special case of the Finite State Automaton.