Architecture of AI, part 2

This is a continuation of my last post.

I spent yesterday reading about a whole bunch of different machine learning techniques, and I think I’ve figured out what I want to develop.

Background: The best known type of signal propagation networks are artificial (and biological) neural networks, but there are other types of signal propagation networks. In an SPN, values appear at inputs, further values are calculated from combinations of them at hidden nodes, and eventually values are calculated at a set of nodes designated to be output nodes. In ANN’s these values are usually real numbers bounded in a finite range, but that’s not the only kind of value that a SPN can handle.

There is a strategy for growing signal propagation networks which is called Cascade Correlation Learning. It is these days mostly considered to be an also-ran technology – it showed great promise and worked wonderfully well for “toy” problems, but didn’t scale to larger and more subtle problems the way Artificial Neural Networks (eventually) did. And for evolving the topology of artificial neural networks, when anyone bothers to even attempt it, Genetic Algorithms are today’s goto technology. So CCL is almost forgotten.

Essentially CCL means add a hidden node, train it to have the maximum possible correlation with all remaining error (under the assumption that the behavior of all other nodes remains constant), and add it to the network, rinse repeat, until the output accuracy rises to whatever level you’ve deemed acceptable. Once a hidden node has been added you never train its input weights again. The assumption that the behavior of all other nodes remains constant is in fact not an assumption at all, it’s a rule. The only nodes whose input weights ever get training are the output nodes.  One really nice thing about this is that it completely avoids the problems of exploding and vanishing gradients as existing weights in deeper layers are changed.

Each added node in a Cascade Correlation Learning network, at least in classic Cascade Correlation, can sample output from all previous nodes, so each new node is effectively its own “layer” that can get input from all previous layers. Each new node takes input from all input nodes and previous nodes that are revealed by its training to have nonzero inputs required to achieve maximum correlation, and sends output to all the output nodes having an error with which it has a nonzero correlation, so connections to input and output nodes are ubiquitous throughout all layers. Hidden nodes are removed if and when they no longer have output edges. This happens if all the output nodes have decided they don’t need that hidden node’s output, and other hidden nodes taking its output have gotten deleted.

Typically Cascade Correlation Learning trains far faster than backpropagation and finds much smaller networks. One of the reasons it’s faster is because it doesn’t use backpropagation except on the single node being (discovered and) trained. One of the reasons the networks it finds are smaller is because it doesn’t train nodes that compete with each other for jobs and then create a network that requires multiple nodes doing each job. Each new node does something different than all the existing nodes.

There are a couple of additional refinements to Cascade Correlation Learning. One of them is to try to limit the depth of the CCLN. To do this, instead of just adding a single node that can take input from literally all previous nodes, you evaluate the addition of several different nodes, restricted to sampling the outputs of limited ranges of earlier hidden nodes, and prefer the one that gives you your best remaining-error correlation while sampling outputs of the hidden nodes closest to input. This grows CCL networks to a similar accuracy and about equally fast, but uses fewer connections and builds shallower networks which allow greater parallelism.

A second refinement to Cascade Correlation Learning is “Knowledge Based” CCL. This means using nodes that know some more complicated activation function “right out of the box”. This can be as complicated as a whole separately-evolved subnetwork, or a circuit diagram that performs arbitrary binary logic, or whatever else. It turns out that if you make these “weird” nodes available as candidates when you’re presenting possible new nodes, and keep picking whatever gives you outputs (in the case of weird nodes, possibly multiple outputs) with the greatest available correlation to remaining error, KBCCLN will recruit and incorporate these weird nodes, usually when and as appropriate, often making a much faster and much simpler (except for the diversity of nodes) final network.

I’ve already extended neural-network architecture to accomodate arbitrary (not necessarily layered) “weird” nodes that can have multiple outputs or outputs that aren’t the result of the usual activation functions, so this fits into my architecture just fine.

CCL has some problems, one of which is that it doesn’t do deep learning. If none of your available inputs and no combination of your available inputs shows any non-accidental correlation with measured input, then CCL will add nodes on the basis of spurious, sample-dependent coincidental correlations, and never get much of anywhere until (unless!) randomly-created higher-level structures start showing some real, non-spurious correlations or interactions that it can detect. KBCCL was an attempt to deal with this, and it helped, but it didn’t help nearly as much as people hoped. While CCL is great for building very efficient networks on “toy” problems where the inputs are related to outputs in some comprehensible if complex way, it has no advantage, and in fact, is usually a futile exercise, in more difficult operations such as those deep networks are usually used for, like reading a grid of pixels and saying “cat” or “tennis player” or “guitar”, or sampling a bunch of waveforms and identifying the words “cat” or “tennis player” or “guitar” as voices speak them.

To deal with this, one of the things you can do is alternate between adding nodes and doing training passes (doing ordinary backprop) on the nodes you’ve already got. This sacrifices a good chunk of CCL’s speed advantage but it can get both dynamics working.

Back when CCL was seriously considered, we were all using exponential-based activation functions like the logistic function and tanh, and those were the goto functions when we wanted activation functions we could train using backpropagation. CCL as normally used builds deep networks without doing deep learning. They were stable when we didn’t train node inputs once the nodes had been added. But when we attempted to train deep networks using exponential-based activation functions and simple backpropagation, of course they exploded. So CCL was abandoned because it didn’t scale.

Of course you can see where this is going, right? The strategies that make deep neural networks trainable at the bottom layers where feedback is diffuse and confused and its net effect hovers near zero – that is to say non-exponential activation functions, convolution kernels, and stacking autoencoders – can help with the trainability of CCL networks too.

Finally, CCLs can be made recurrent, and I already know that I need a recurrent network to do this with.

Is it a win?  Possibly.  I don’t know if I can do what I need to with it, and particularly whether I can get decent behavior out of it in recurrent networks. But it’s a good possibility, anyway, and one of the few ways we’ve got to build efficient networks that don’t have a whole heck of a lot more connections than they need.  And at the very least it can be used to create subsystems useful to a larger project.

So in my estimation it’s time to drag CCL networks out of history’s scrapheap and swap out those old sewing machine motors and training wheels for something a bit more industrial.  I am sometimes wrong, of course.  But that’s my estimation, because the efficiency of the final system is very important if I’m going to implement it on “normal” computer hardware.

“Normal” computer hardware isn’t essential to the project, of course, but it would make it a heck of a lot simpler to create.