Reflections on NIPS (Part 2)

One nice thing about going to NIPS is how intellectual stimulating it was. I’ve spent most of the past few months thinking about other things, so being immersed into the machine learning world again was definitely reinvigorating. Before the intellectual high fades, I thought I’d sketch out the most interesting puzzles that I’m thinking about thanks to NIPS.

Learning Structure

Humans can learn from an incredibly small amount of data. For example, if I showed you a single picture of an Okapi, you could probably identify an Okapi if you ever saw one again. A neural network like Inception or AlexNet on the other hand, requires much more than a single training example to start identifying Okapis.

Josh Tenenbaum gave another stark illustration in his talk in the Deep Reinforcement Learning workshop DQN, the algorithm DeepMind used in its original Atari paper, can take hundreds to thousands of hours of playing time until it sees noticeable improvements. Humans can make huge amounts of progress in 15 minutes.

There are a lot of different approaches to this kind of few-shot learning problem (learning from a small number of examples), but in my opinion the obvious place to examine is the prior knowledge humans have before starting a new task. Some of this knowledge is just facts about the world (e.g. penguins live in Antarctica), but some of this is just fundamental assumptions about how the world is structured (e.g. that the world is composed of discrete, independent objects/entities).

This type of knowledge doesn’t need to be learned – cognitive science s full of examples of “built-in” knowledge that humans have about the world before any learning takes place. While it might be possible to learn something like “the world is made of discrete actors” purely from raw data, I’d bet that it’s pretty hard and super data-inefficient (and arguably impossible, depending on how much you believe certain cognitive science schools of thought). Instead, we should encode our prior information about the world as an inductive bias through a neural network architecture.

Convolutional neural networks are an example of a success case here. 2-D convolutions by their nature are implicitly encoding notions of spatial locality and translation invariance. This makes them an extremely well-suited architecture for vision tasks. Although I suppose it’s possible to learn that “pixels next to each other in space are meaningful” purely from fully-connected layers, I’ve never seen it work in practice (and I’ve never personally been able to get dense networks to work well for toy computer vision tasks).

But how do we extend this beyond convolutions? In particular, I’m very interested in how to encode linguistic knowledge into our network architectures. As a committed Chomskyan, I think there’s an incredible amount of structure to human language, most of which is biologically innate and not learned. The most obvious example I can think of is that language is governed by hierarchical dependencies, not by linear word order. That implies that the RNNs (and LSTMs and GRUs) that are standard in NLP are unsuitable models for language, despite their admittedly impressive success in things like machine translation.

How to actually tackle this problem? Honestly, I have no idea. It’s unclear to me how to encode hierarchical information into neural networks, let alone all the other rich linguistic structure. I’d probably start by doing a refresher course in minimalist syntax – most of my knowledge comes from a syntax textbook I borrowed from my high school math teacher, most of which I’ve forgotten by this point (and honestly, I bet that most of what I remember is now outdated). Maybe try training on some toy problems that require hierarchical dependencies (e.g. anaphor resolution and binding). We’ll see where I go from there.

Parameter Efficiency

I’ll admit that I’d always felt a little uneasy about neural networks Although I can’t argue with the results, I’ve never seen a very convincing explanation of why neural networks work, other than the fact that they do. The theoretician in me finds that incredibly suspicious.

At NIPS, I discovered that most neural networks could be “pruned” – that is, you could throw out weights (sometimes more than three quarters of them!) – without any noticeable differences in performance. (Apparently this was common knowledge to essentially everyone but me, which I guess shows how much implicit and tribal knowledge exists that is hard to see from the outside). I also learned that neural networks were even more robust to low precision arithmetic than I assumed: I knew that lots of neural nets used 16-bit floating point precision, but at NIPS I saw people using 8-bit and sometimes even 1-bit precision!

The ability to prune the majority of your extremely low-precision weights imply to me that, at the very least, we’re being supper inefficient during training, and at most, that we’re massively overfitting most of our data. I’ve actually wondered whether the success of deep learning is “merely” because of how expressive neural networks are (composing simple functions is surprisingly expressive and neural nets have an enormous number of parameters to fit), which would let you fit essentially any function given a large enough network. I think an interesting toy-experiment would be to see whether deep networks can fit randomized noise as well as they can fit, say, images. If the training error for noise isn’t any lower than the training error for images, then I think that’d be pretty good evidence that something fishy’s going on.

(I suspect that this has been done already, but I think it’s a good task to get my feet wet with deep learning outside of toy models fit on my laptop).

Adversarial Learning

I learned about this new paper, Universal Adversarial Perturbations . Building off earlier work on adversarial perturbations the authors discovered a universal adversarial perturbation. In other words, they found a noise pattern that, when combined with any image, causes the network to misclassify the perturbed image, even though the changes are imperceptible to the human eye. Even worse, these perturbations seem to generalize across networks and not just across images. Definitely something scary to look into…

Amateur Hour