Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Monday, Dec 17, 2012 - 00:02 SGT
Posted By: Gilbert

- -
A Tangled Web

After the weakly amusing downers of the previous days, it's time for some pick-me-ups:



(Source: redcafe.net)


And others:


Neuraltic

I suppose it's time to try this neural network shtick in earnest, motivated by the guy who ended up winning the Mitosis Detection contest having won quite a few similar ones as well. I mean, if a particular technique seems particularly suited to a particular class of problem, it only makes sense to adopt it (perhaps unless one actually invented the other technique); otherwise, it would be a little like sticking with a screwdriver when confronted with a nail, just because the screwdriver was the first tool one picked out from a toolbox, or because some smug rival you don't like used the hammer before you.

It might be noted that SVMs were probably the most popular classification scheme used in the conference contests because easy-to-use libraries are common (have I mentioned that if a researcher wants to popularize a new method, assuming that it is actually good, by far the best bet is probably to release it as a standalone multiplatform program with many examples?) Still, simply being "in" is however no guarantee of performance, as sparse representation, for example, seems unsuited for this class of problem (but don't quote me... well, as if there's any chance of that)

Basically, if facing a classification problem, this is my opinion on how it should be handled:

  • Generally - apply favourite techniques, quite often nearest-neighbour or regression (many researchers will stick with what they know best, at least at first)
  • Then - try radial basis function SVM (search for parameters properly!) and random trees. Achieved maximum accuracy/F-measure shouldn't be too far off what is possible with a lot more work
  • Image patterns - try convolutional neural networks (today's topic, considerations covered later)
  • If significant prize money is involved - it's a highly-tuned ensemble composed of a Frankenstein mishmash of arcane techniques, baby! Nobody will be able to explain how it works, or maybe even why it does at all, but who cares if it's able to squeeze out the last 0.001% that wins the million dollars?
Of course, neural networks have gained a not wholly-undeserved reputation of being slightly "black-magic" by themselves - it's often not easy to explain exactly how they produce results, unless you simply look at them as heuristic error function minimizers (some see them as a sequence of matrix multiplications); they are known to be universal function approximators, which as I see it basically means that a properly set-up network should be able to classify (near-?) perfectly assuming sufficient noise-free data - a guarantee that, like so many theoretical ones, is unfortunately largely useless in practice.

A comparism with SVMs has been attempted here, where it was noted that neural networks were "(relatively) easier to implement from scratch". Well, the time has come to put money where mouth is, so here's a pure Javascript implementation of a general fully-connected network:


A short FAQ: Click "Build" to, well, build the network. The "2,3,1" in the adjacent textbox describes a network with 2 input nodes, 3 nodes in the second (hidden) layer, and 1 node in the last (output) layer. Try "2,3,2,1", for example.

Input nodes are in green, output nodes in red, and intermediate hidden nodes in grey. Each edge connecting nodes has an associated weight, next to the dot on the edge; while I have tried to spread out the numbers as much as possible, overlaps are inevitable with more nodes/edges due to the limited space.

A freshly-built network doesn't usually do anything useful, so we have to train it by giving it data. Note the textbox "0,0,0-0,1,1-1,0,1-1,1,0". Each set of inputs/desired outputs is separated by a dash, so what this actually says is:

Input Node 1Input Node 2Desired Output
0
0
0
0
1
1
1
0
1
1
1
0

which is the XOR function (known to be unlearnable by single-layer perceptrons, i.e. without the hidden layer)

[N.B. Being a rudimentary implementation, there is minimal input checking, so don't be surprised if it blows up with input of the wrong format]

Clicking the "Train" button trains the network such that it hopefully produces the desired output when presented with the given input. Note the "10000" and "0.05" values. "10000" means that the network will be trained for 10000 epoches, or "iterations using all the training data" (four sets, or vectors, in this case), while "0.05" is the learning rate, or how much the edge values can be adjusted in each epoch (more later). Note that the edge values are changed by the process, and the error after every 1000 epoches is displayed in the textbox below the network.

Finally, "Test" runs the given test data (default "1,1") through the trained network, which should give the desired output "0".

So how does this magic work? A more complete explanation would come from Wikipedia, or this project (which was perhaps the best code-cum-theory all-in-one I could find), but I'll try to make it very quick. How (not why) a neural network operates once it is trained is actually very simple. Note that training only ever affects the weights of the edges connecting the nodes. Everything else - the structure, the number of nodes, etc, is untouched.

Using a trained network is simply a matter of passing inputs up, layer by layer. For example, the leftmost hidden node (layer two) has two edge connections to the first layer, and say that their weights are 0.45346 and 0.46975 respectively. Then, the value of the hidden node is just F(0.45346+0.46975), where F() is some squashing/activation function that ensures that the output remains within a certain (small) range regardless of the inputs. In this example, the hyperbolic tangent is used, so all input values above 3 get reduced to (very near) 1.

The same goes for all nodes and... that's it.

Yes, that's it.

This is the forward propagation part. The only really tricky part, which can probably be described as being the heart of the network, is how to learn the edge connection values in the first place, which is backward propagation. The basic idea is actually not that farfetched - simply pass your inputs through the current state of your network, and discover how different the output is from the desired output, and in what direction it is different. The weights are then adjusted slightly in the "correct" direction, again layer-by-layer but backwards this time, hopefully producing a smaller error in the next training epoch. This is repeated until the error is small enough.

There is no actual guarantee that the true global optimum will be reached (especially with a ton of weights to determine), but results don't lie (if achieved). A historical note is that weights were once adjusted only once each epoch (batch learning) instead of for each training vector in the epoch (incremental learning), but this has been convincingly argued to be inferior.

Do note that in this implementation, which has no bias nodes, the output when all input nodes are zero must necessarily be zero (as any weight multiplied by zero is still zero). Come to think of it, Herr Ahm might see an analogy between this and his hierarchical message propagation distortion theory...

That's it for basic neural networks. Convolutional neural networks follow the same basics, but share weights between edges, and are no longer fully-connected, i.e. not all nodes in the previous layers connect to each node in the next layer, a way of reducing complexity in a clever and biologically-inspired manner. Not too ungraspable either. There have been proposed modifications, but those are details.

However, even after this simplification (for the input, not for the programmer), such networks were still not too popular, as they took an awfully long time to train - it was mentioned that the contest winner would have taken a full year on a CPU, which is where the GPU comes in. Note that their power at number-crunching has already been acknowledged in our Bitcoin introduction last year.


Can't Run From The GPU

I'll be frank: learning a new programming paradigm can be a pain, especially under pressure, but in this case it might well be unavoidable. The future of computing is, as colourfully illustrated here, more tractors instead of faster tractors, due to heating constraints. Remember the giddy days when manufacturers boasted of faster processor speeds? I started out with a Pentium 166, upgraded to a 350, then lost track. Today, it's dual, quad, octa-cores.

Now, coding to take advantage of a GPU's capabilities is quite different from the CPU-based programming most are accustomed to (to the common code monkey, anyway). The GPU can do a lot of things simultaneously - just that they come with quite a few restrictions. It's up to the programmer to structure his algorithm logic such that the things that are done together are actually parallelizable, and unfortunately as of now he probably also has to take care of all the muck like shuffling memory between the CPU and GPU and whatnot.

As for the framework, the choice is largely between OpenCL and CUDA, with OpenCL (in theory) supported by many more vendors, and CUDA being Nvidia-specific (but still slightly more popular). Discussion with a lab senior has persuaded me to kick off with OpenCL for its portability, even though there appear to be more and better tutorials for CUDA. And yes, where there's speed, there's money.

While we're still on new (okay, currently less-popular) paradigms, good cases have been made for shifting from imperative to functional programming, it being hard but good. It might be a long wait before this percolates to industry, but Haskell isn't actually that tough, and *gasp* can actually do things! Who would have thought?

Oh, and it's always fun to rag on Java - see this gem, including sage observations like "Advocating Object-Oriented Programming is like advocating Pants-Oriented Clothing"

For those who just want to try GPU-based CNNs out (without worrying about integration into existing software), Theano and cuda-convnet are good bets, and there are a couple other projects out there for those interested in sample code. It should be remembered that a GPU implementation doesn't actually improve a CNN, though - it just speeds up the training a lot, which can be important in practice if not in theory.



comments (0) - email - share - print - direct link
trackbacks (0) - trackback url


Next: Financials


Related Posts:
You Bet Your Life
What I Do
Groundwork
Ghosts of Games Past
Robo Goes App

Back to top




Copyright © 2006-2024 GLYS. All Rights Reserved.