The architecture of AI.

I’m going to start this post by apologizing to my few regular readers. I’m going to be speaking hardcore geek here, and mention a bunch of incredibly complicated things and obscure problems with them, that the layperson won’t even have heard of. And I’m going to do it without explaining them. Explanations will come in later posts, I hope; this is just a brain-dump of what’s going on in my mad-computer-science AI lab these days.

Here’s an interesting question. What implementation strategies can support real (conscious) artificial intelligence?

Let’s take a short survey of some available implementation technologies.

Artificial neural networks are the obvious one, because they’re the closest model to what produces the consciousness effect in our own heads.

That said, you have to be using a recurrent network — ie, part of its previous state is used as an input to the current state, and it just keeps cycling. Feedforward (classical) networks produce stimulus-response reflexes, not the continuity required for thinking and consciousness. And the techniques we’ve learnt for training standard feedforward networks (ie, backpropagation) have at best limited application to recurrent networks.

First, you can’t use momentum to interlinearize training. Second, the Credit Assignment Paths (backpropagation paths) are infinite and require you to keep previous activation states. Third the network is too big for all the second-order methods of optimizing feedback, which require global analysis to find n-dimensional vector products. Fourth in a recurrent network you don’t just need second-order techniques, you need third-order techniques, so all your activation functions would have to have smoothly differentiable second differentials. In a deep network that pretty much limits you to some hyperbolic version of the softplus function, because the most of the classic sigmoid functions (tanh, logistic, etc) are unstable in deep networks. The best-known stable sigmoid in deep networks is softsign, which makes training very slow. I have discovered another sigmoid (I call it “magic”) more stable than softsign, which tolerates even very bad initializations and other metaparameters and is very reliable in convergence even in deep networks, but which is even slower to train. Using the hyperbolic softplus would mean you need at least twice as many nodes, so training that would also be slow. On the other hand, the technique of varying learning rates by node, which is my current experiment, might address the instability that would otherwise require the slow-training sigmoids or the excessively-simple softplus activation functions. But that’s another experiment to do.

The short version of the story is that training recurrent artificial neural networks is hard. Usually the examples of recurrent networks in the literature (and in actual use) have been “toys” of no more than a few dozen nodes, and trained by such crude methods as output forcing. They do little other than learning how to settle into simple periodic patterns of time-varying output. You could use output forcing to teach them to produce *different* time-varying periodic patterns given different (stable) input, and that might turn out to be necessary and worthwhile. But there is no denying that it would be slow going.

So I really hope the idea I outline below, of having the network competing against its own predictions and getting post facto backpropagation on prediction accuracy, works. If it does, it would be so much faster than any of the other available techniques.

I think I have an approach to scaling recurrent neural networks, but I don’t know yet whether I’m right about it. If I’m right it will allow feedforward training techniques (such as stacked autoencoders) to be used in recurrent networks, as well as providing opportunities to do backpropagation on earlier cycles based on later information. Here’s how that works. Feedforward neural networks are good at learning to make predictions. If the system is continually predicting its future rewards based on current input, and I save its state (inputs and all the activation levels) as it runs, then when the timeframe of the predictions comes around, I can do feedback based on those older activation levels. In fact I can do this continuously, on every cycle of the network for every previous cycle I have saved. I can provide feedback about whether old reward predictions came true, using the activation levels that were in force at the time of those predictions. Given this state of anticipated reward, the system can also be trained to produce outputs which have the effect of maximizing its value.

Backpropagation through time from the later cycle into the previous cycle can be combined with this in a single pass (and I have source code to prove it!), so there’d be a single feedback process going back making one pass through all of the saved earlier cycles, given the feedback from later cycles plus all of the saved earlier output and feedback on the anticipation outputs. It all sounds plausible, doesn’t it? There is one more complication, which is that you have to make sure the feedback you’re giving each node is the *MINIMUM* feedback that would be produced by the network’s past and current states. And I don’t have source code to prove how easy or hard *that* is.

So backpropagation through time would have to be limited carefully, reduced exponentially, or stopped completely when applied to previous activations done prior to earlier training. On the other hand the previous input and current rewards are known, so training the predictive guesses would be relatively simple – except that on the gripping hand training means the system would likely have produced different *actions* since then, which, because its actions have an effect on the world and therefore the rewards, would affect the validity of the prediction. It’s all very intertwingled and woolly, but it’s at least heuristically approachable simply by applying the minimum training indicated by past and future state. Well, that’s an experiment I’ll be doing.

Right now I’m working on a different experiment – probably another blog post’s worth eventually. The current experiment involves having different nodes of the neural network have different learning rates. I mean drastically different, spread out in an exponential distribution from normal to a thousandth of normal. Contrary to what you might expect, this produces highly adaptive, self-stabilizing training, even in deep and/or recurrent neural networks. It also reduces catastrophic forgetting and has equivalent action to optimal continuous adjustment of the learning rate (annealing schedule).

If the training rate at any level is so high that it drives some nodes to instability, their output becomes unreliable and the downstream nodes learn to rely on more stable (lower learning rate) nodes, where the same amount of feedback produces useful results but not instability. And it’s even better than that because when downstream nodes reduce their dependence on the higher-rate nodes, the higher-rate nodes are therefore subject to less feedback, and become stable again. A balance is reached where learning, balanced between high and low-rate nodes, proceeds rapidly but relatively stably. This results in self-optimizing the gross training rate at each level of the network, by “grounding” excessive feedback in low-rate nodes which will be constructively affected, but won’t be unstable under it. It also handles changes in learning rate usually done by “simulated annealing” because the balance changes favoring the lower-rate nodes as the system gets close to an optimum. Finally, it helps a lot with the “catastrophic forgetting” problem, allowing the system to rapidly learn new patterns (by moving its high-learning-rate nodes) while not forgetting (much) of its already-acquired knowlege, because the low-learning-rate nodes are going to be very little affected by the new training, forcing the new task to find a use for them close to their current output which leads to generalizing.

I’m immensely pleased with it. That’s why it’s the current experiment by itself, and I haven’t attempted combining it with my bizarre alterations on backpropagation through time in recurrent networks until I’m sure of its properties.

But artificial neural networks trained by backpropagation, or backpropagation through time, aren’t the only answer. They may not even the best answer, although with my new technique of differential learning rates and if I’m right about the above speculation on training methods they may be the easiest.

Beyond some level of complexity where our current methodology doesn’t work, more than a few people train neural networks by using genetic algorithms. Given most methods of encoding neural networks in an artificial genome, genetic algorithms are approximately equal to a simple stochastic hill-climbing process with multiple starting points, which is to say, it’s quite slow and even more data-hungry than the above backprop through time technique, which is already so slow and data-hungry that it’s a bit ridiculous.

However, I admire genetic algorithm training for its ability to discover functioning network topologies which are at least moderately efficient. The topologies they discover are usually only about two or three times as complex as the problem actually requires, and often that’s more efficient than the ones you’d come up with setting meta-parameters by hand. The distinction is that with a GA, your topology *IS* related to the complexity of your problem, even if you start out not knowing the complexity of the problem. I think that’s probably too important to ignore. So I’ll have to get to them (or find other ways to auto-optimize network topologies) eventually.

However, whether trained by backpropagation through time or by genetic algorithms or both by turns, neural networks themselves aren’t the only answer; only the most intuitive.

Pretty much any machine learning algorithm applicable in recurrent models could be used, if you can figure out how to train it. And if you can’t figure out a reasonable way to train it there’s always genetic algorithms, which aren’t reasonable at all but eventually do produce results. They are so slow for any nontrivial system that I’d almost despair if they turned out to be the only training method available, but they do eventually work.

There’s a very sophisticated prediction/planning/action model that I haven’t gotten around to implementing yet, but which actually has some promise to be better (more tractable, computationally cheaper) in recurrent application than neural networks. Bagging Voronoi classifiers based on cluster analysis, in a metric space defined by time taken for temporal transitions. Can you say that three times fast? Anyway, they can be transformed into graphs with edges labeled by probabilities and transition inputs/outputs, which are navigable with fuzzy pathfinding algorithms and fuzzy search algorithms. I’ll do a blog post about this sometime.

This type of machine is actually better than neural networks for discovering temporal patterns and making and executing plans, because you can do fuzzy A-star directed searches for your desired state nodes, and the edges you need to traverse to get there from your current state node are labeled with the outputs you need to produce to effect that traversal. At the same time it formulates a plan (discovers a fuzzy and/or probabilistic path) to reach the desired state, it learns what to do (outputs to produce to maximize the probability of traversing an edge) and what to watch for (inputs expected associated with each edge, which predict the probability or degree to which that edge is activated) along the way.

There are some kinks trying to use the temporal Voronoi graph for classification problems and learning patterns in raw data, because classical Voronoi regions have to be linearly separable. That said, Voronoi spaces are cheap in memory, though complicated in higher-dimensional geometry, to implement. The regions are relatively independent and local, and the training rules are pretty simple, so you can segment the space into more regions easily, and handle non-convex regions simply as sets of simpler adjoining regions that happen to wind up with identical edge connections.

Truth markets are another viable approach. In a truth market, randomly-generated agents buy and sell information from each other on a competitive market and make or lose money by wagering against each other on predictions based on that information. Whenever one goes broke, it gets erased and another (randomly generated) takes its place. Most of the randomly generated agents die quickly and without much effect on the market, but if they happen to have a good strategy they rapidly acquire wealth and their wagers begin to dominate the market. Thus the market tends to produce the “best” predictive approximations because the “best” agents control the wealth that defines the market, in delphic proportion to their fitness. So even when most of the agents are terrifyingly stupid, the output of the truth market is the best estimate available to what will actually happen, in the same way that even when most of the bettors are terrifyingly stupid, the odds on a horse to win at the race track are the best available estimate of its actual winning odds. This happens because the agents whose estimates are most accurate THEREFORE have more money to bet and more effect on the market. This applies emergent market dynamics to predictions and outputs, and auto-simplifies itself by simply having any agent that learns a simpler pattern (a pattern cutting out one or more information-brokering middlemen, in market terms) be more efficient. This eventually starves out its competition along with the middleman agents that the competition depended on. Truth markets are a little bit awesome in that they tend to produce results (in the form of the rulesets followed by the wealthiest few agents) which can be understood by human beings. This is in stark contrast to most machine-learning techniques.

Realizing that pretty much any learning algorithm at all that can be applied recurrently is a possible implementation technique, really does open up the field.

To some extent it’s discouraging – all these experiments to do, and write up, and that’s hard work! But in another way it’s very exciting; all these discoveries to make!

Will I ever manage to make something as conscious as, say, a mouse? I dunno. But one way or another I’m learning a lot along the way.