Recently I’ve been happily working with and on Neural Networks – there is enough new knowledge and there are enough new techniques, especially for working with networks deeper than just a couple of layers, to make it very exciting just now!
Thinking back, the last time I felt this much excitement about neural networks was when I was still completely new to them and had just written my first truly functional neural network program and watched it sort out its first fairly complicated problem.
And thinking about that experience led me to consider the long period of frustration that went before it. The period where I was looking at the Greek-alphabet soup of notation that most workers use to describe backpropagation and trying to work out what the hell they were referring to with each symbol. And because the Greek alphabet isn’t one I use all the time, sometimes even getting confused about which symbols were supposed to be the same and different. Which symbol refers to an individual value, and which to a vector of values, and which to a summation over a vector, and which to a matrix? Hint; some of those assholes use different typefaces and sizes of the SAME variable name for ALL of those things! IN THE SAME EQUATION! The result is that even though the actual math is not too difficult, I had to get the math without any help from the equations. The math served me as the key, the equations as the ciphertext, and using the math to decrypt the equations, all I figured out is what the hell they were smoking when they made up their notation. And that information was completely useless, because they were smoking something else when they make up the notation to describe the next system I needed to figure out.
You don’t have to be stupid to not get math notation, folks. I happen to have an IQ closer to 200 than it is to 100, a couple of patents, etc. I do hyper-dimensional coordinate transformations and geometry for fun, and apply differential calculus to solve genuine problems about twice a month. But the dense notation that mathematicians use often escapes my brain. Especially when, as with neural networks, they’re talking about long sequences of operations involving differential calculus,on sets of long sequences of matrix values.
Even though I do the math, in deadly earnest to solve real problems, my thinking about it is in unambiguous terms of the sort I can express in computer code. With symbols that actually mean something to me instead of being single letters. Without using symbols that aren’t on my keyboard, without using the same symbol to mean different things in different sizes and typefaces, without having to guess whether a juxtaposition means multiplication or subscripting, without having to guess whether a subscript means an array index or per-value or per-instance or per-set or per-iteration or is-a-member-of, without having to guess whether they’re using multiplication to mean matrix or vector or scalar multiplication, and without reusing the same operator symbols to mean different things in every publication. I actually live in the land over which mathematicians have jurisdiction. And I wish they had a little more damn consideration and respect for their citizens out here and would write the laws of the land using a language that didn’t leave so many ways to misinterpret it.
Lucky for me, I can eventually understand the relationships without much help from the gods-damned obscure equations. Usually. And I guess that puts me ahead of a lot of people who have trouble with the notation. Or maybe if I couldn’t get the math any other way, I’d have done a better job of learning the mathematicians-only language they use to make their descriptions of the math absolutely useless to everyone else. Then I’d be so over this problem, right, and think it’s something that only bothers stupid people? The way a lot of the guys who write those equations evidently think?
Hey guys? Buy a clue from programming languages. USE A DESCRIPTIVE WORD OR TWO to name your variables! Then FORMALLY DEFINE what types your operators are working on, and not just in an offhanded sentence in the middle of text six pages previously that refers to other definitions made also in the middle of text nine pages subsequently, nor just by inference from the context in which the idea that led to the variable came up! Vectors, Matrices, or Scalars? It’s a simple question with a simple answer, and you’re allowed to say what the answer is! Then use a DIFFERENT DAMN NOTATION when you mean a different operation, and a consistent bracketing convention when you mean a subscript and DIFFERENT KINDS of bracketing notations when you mean different things by your subscripts! Grrf. Sorry, I’ll try to be calmer. As I said, I was remembering that period of frustration when I was trying to understand the process from the equations instead of being able to decipher the equations because I had finally figured out the process.
The operations on Neural networks are sequential operations on multiple vectors of values using partial differentials, so it is a “perfect storm” as far as notation is concerned, where I was looking at something encrypted in three different ciphers, and eventually I had to work it ALL out for myself before I could even begin to see how the Greek-alphabet soup in front of me even related to the clear relationships that were, you know, “obvious” in retrospect. Most equations involving deriving real information from data using statistics and big-data operations suffer from the same problems.
I *STILL* don’t have any intuition when typefaces and sizes are different, even now that I know that has to mean something, as to exactly what the different things they mean are likely to turn out to be. I have to fully understand the mathematical relationships and processes that equations DESCRIBE in order to understand by a sort of reverse-analysis, how the heck the equations are even relevant to those relationships and processes. Which is sort of ass-backwards to the way it’s supposed to work. I wind up using the relationships to explain the equations by inferring the notation. The way it’s supposed to work is that people should be able to use the equations to explain and understand the relationships because the notation is obvious.
This post is a digression. I intended to explain neural networks and gradient descent in a straightforward unambiguous way, and instead I have rambled on for a thousand words or so about how mathematical notation, for those not steeped in it, serves to obscure rather than reveal and so why such a straightforward explanation of neural networks is necessary.
But you know what? I think maybe this rant about notation is something that needs to be said. So I’ll leave it up here and save the unambiguous explanation of neural networks and gradient descent for next time.