Great work by Zain Shah learning a joint embedding of GIFs and sentences.

"Max-margin" is a better name for this objective function than "triplet loss", which is how I first heard of it. Think that was from the FaceNet (?) paper from Facebook.

Key numbers: 5M comparisons to train, 100 cached/precomputed GIFs for comparison & ready to return for scaling load, AWS g2.8xlarge GPU instances to train (as of Nov '16), 1024D latent space (wonder if smaller is possible?), Microsoft Research Video Description Corpus for data: 120k one-sentence descriptions of 2k short YouTube clips.

"Max-margin" is a better name for this objective function than "triplet loss", which is how I first heard of it. Think that was from the FaceNet (?) paper from Facebook.

Key numbers: 5M comparisons to train, 100 cached/precomputed GIFs for comparison & ready to return for scaling load, AWS g2.8xlarge GPU instances to train (as of Nov '16), 1024D latent space (wonder if smaller is possible?), Microsoft Research Video Description Corpus for data: 120k one-sentence descriptions of 2k short YouTube clips.

They say a picture’s worth a thousand words, so GIFs are worth at least an order of magnitude more. But what are we to do when the experience of finding the right GIF is like searching for the right ten thousand words in a library full of books, and your only aid is the Dewey Decimal System?

what finding the right GIF is like

We build a GIF search engine of course!

what it should be like

This seems quite magical — you type in a phrase and get exactly the GIF you were thinking of — but behind the scenes it’s a matter of glueing two machine learning models pre-trained on massive datasets together by training a third, smaller model on a dataset.

TL;DR— If you’re already familiar with the basics of deep learning, the following few paragraphs will cover a high level overview of how the GIF search engine works, how it was trained, etc. If not, don’t fret — I give a bottom up explanation of the entire process with minimal math background required below this overview.

By formulating the problem generally we can get away with only having to learn a shallow neural network model that embeds hidden layer representations from two pre-trained models — a convolutional neural network pre-trained to classify objects in images, and a recurrent neural network pre-trained to predict surrounding context in text. We train a shallow neural network to embed the representations from these models into a joint space together based on associations from a corpus of short videos and their sentence descriptions.

Specifically, these models are the VGG16 16-layer CNN pre-trained on ImageNet, the Skip Thoughts GRU RNN pre-trained on the BooksCorpus, and a set of 2 linear embedding matrices trained jointly with the others on the videos and sentence descriptions from the Microsoft Video Description Corpus. This framing does constrain the resulting model to only working well with live-action videos that are similar to the videos in the MVDC, but the pre-trained image and sentence models help it generalize to pairings in that domain it has never before seen.

Don’t worry if the above doesn’t make sense — if you’d like to know more read on and I’ll explain how the individual pieces work below.

The GIFs are processed by a CNN (left), and your query is processed by a RNN (right). The result is the GIF with the closest embedding to your query’s embedding

First, we have our CNN or convolutional neural network, pre-trained to classify the objects found in images. At a high level, a convolutional neural network is a deep neural network with a specific pattern of parameter reuse that enables it to scale to large inputs (read: images). Researchers have trained convolutional neural networks to exhibit near-human performance in classifying objects in images, a landmark achievement in computer vision and artificial intelligence in general.

Now what does this have to do with GIFs? Well, as one may expect, the “skills" learned by the neural network in order to classify objects in an image should generalize to other tasks requiring understanding images. If you taught a robot to tell you what’s in an image for example, and then started asking it to draw the boundaries of such objects (a vision task that is harder, but requires much of the same knowledge), you’d hope it would pick up this task more quickly than if it had started on this new task from scratch.

We can use an understanding of how neural networks function to figure out exactly how to achieve such an effect. Deep neural networks are so called because they contain layers of composed pieces — each layer is simply a matrix multiplication followed by an activation function.

ReLu, Tanh, and Sigmoid activation functions. Taken from http://www.slideshare.net/oeuia/neural-network-as-a-function

This means, for a given input, we multiply it by a matrix, then pass it through one of those functions, then multiply it by another matrix, then pass it through one of those functions again, until we have the numbers we want. In classification, the numbers we want are a probability distribution over classes/categories, and this is necessarily far fewer numbers than in our original input.

It is well understood that matrix multiplications simply parametrize transformations of a space of information — e.g. for images you can imagine that each matrix multiplication warps the image a bit so that it is easier to understand for subsequent layers, amplifying certain features to cover a wider domain, and shrinking others that are less important. You can also imagine that, based on the shape of the common activation functions (they “saturate" at the limits of their domain from -∞ to ∞, and only have a narrow range around their center when they aren’t strictly one number or another), they are utilized to “destroy" irrelevant information by shifting and stretching their narrow range of effectiveness to the region of interest in the data. Further, by doing this many times rather than only once, the network can combine features from disparate parts of the image that are relevant to one another.

When you put these pieces together what you really have is a simple, but powerful, layered machine that destroys, combines, and warps information from an image until you only have the information relevant to the task at hand.

When the task at hand is classification, then it transforms the image information until only the information critical to making a class decision is available. We can leverage this understanding of the neural network to realize that just prior to the layer that outputs class probabilities we have a layer that does *most* of the dirty work in understanding the image *except* reducing it to class labels.

Now we can reuse learned abilities from our previous task, and generalize far beyond our limited training data for this new task. More concretely, for a given image, we recognize that this penultimate layer’s output may be a more useful representation than the original (the image itself) for a new task if it requires similar skills. For our GIF search engine this means that we’ll be using the output of the VGG-16 CNN from Oxford trained on the task of classifying images in the ImageNet dataset as our representation for GIFs (and thereby the input to the machine learning model we’d like to learn).

What is the equivalent representation for sentences? That brings us to the second piece of our puzzle, the SkipThoughts GRU (gated-recurrent-unit) RNN (recurrent neural network) trained on the Books Corpus. Like convolutional networks share their parameters across the width and height of an image, recurrent ones share their parameters across the length of a sequence. Convolutional networks’ parameter sharing relies on an assumption that only local features are relevant at each layer of the hierarchy, and these features are then integrated by moving up the hierarchy, incrementally summarizing and distilling the data below at each step. Recurrent networks however accumulate data over time, adding the input they are currently looking at to a history. In this manner, they effectively have “memory", and can operate on arbitrary sequences of data — pen strokes, text, music, speech, etc. Like convolutional neural networks, they represent the state of the art in many sequence learning tasks like speech recognition, sentiment analysis from text, and even handwriting recognition.

There are two interesting problems here:

1) it isn’t immediately straightforward how you represent words in a sentence like we do pixels in an image

2) there isn’t a clear analog to object classification in images for text.

While characterizing an image as a 2D array of numbers may be somewhat intuitive, transforming sentences into the same general space won’t be. This is because while you can easily treat the brightness/color of a pixel in an image as a number on some range, the same doesn’t seem as intuitive for words. Words are *discrete* while the colors of pixels are *continuous*.

Importantly (and necessarily for our application), these definitions aren’t as incontrovertible as they may seem. While words themselves are certainly distinct, they represent ideas that aren’t necessarily so black and white. There may not be any words between cat and dog, but we can certainly think of concepts between them. And for some pairs of words, there actually are plenty of words in between (i.e. between hot and cold).

colors are continuous, while color names are discrete

For example, the space of colors is certainly continuous, but the space of named colors is not. There are infinitely many colors between black and white, but we really only have a few words for them (grey, steel, charcoal, etc.).

**What if we could find that space of colors from the words for them, and use that space directly?**

We would only require that the words for colors that are similar also be close to each other in the color space. Then, despite the color names being a discrete space, operations we want to do on or between the colors (like mixing colors or finding similar ones) become simple once we first convert them to the continuous space.

Our method for bringing the discrete world of language into a continuous space like images involves a step like that of the colors. We will find the space of meaning behind the words, by finding embeddings for every word such that words that are similar in meaning are close to one another.

Researchers at Google Brain did exactly this, with their software system Word2Vec. The realization key to their implementation is that, although words don’t have a continuous definition of meaning we can use for the distance optimization, they do approximately obey a simple rule popular in the Natural Language Processing Literature

The Distributional Hypothesis — *succinctly, it states*:

a word is characterized by the company it keeps

At a high level, this means that rather than optimizing for similar words to be close together, they assume that words that are often in similar contexts have similar meanings, and optimize for that directly instead.

More specifically, the prevailing success was with a model called Skip-grams, which tasked their model with directly outputting a probability distribution of neighboring words (not always directly neighboring, they would often skip a few words to make the data more diverse, hence the name “skip grams"). Once it was good at predicting the probability of words in its context, they took the hidden layer weight matrix and used it as a set of dense continuous vectors representing the words in their vocabulary.

Once this optimization is completed, the resulting word vectors have exactly the property we wanted them to have — amazing! We’ll see that this is exactly the pattern of success here — it doesn’t take much, just a good formulation of what you’re optimizing for. The initial Word2Vec results contained some pretty astonishing figures — in particular, they showed that not only were similar words near each other, but that the dimensions of variability were consistent with simple geometric operations.

For example, you could take the continuous vector representation for king, subtract from it the one for man, add the one for woman, and the closest vector to the result is the representation for queen.

king−man+woman=queen

That is the power of *density*, by forcing these representations to be close to one another, regularities in the language become regularities in the embedded space. This is a desirable property for getting the most out of your data, and is generally necessary in our representations if we are to expect generalization.

Now that we have a way to convert words from human-readable sequences of letters into computer readable sequences of N-dimensional vectors, we can process our sentences similarly to our GIFs — with dimensions: the dimensionality of the word vectors, and the sentence length.

This leaves us with our sentences looking somewhat like rectangles, with durations and heights, and our GIFs looking like rectangular prisms, with durations, heights, and widths.

Just like with the CNN — we’d like to take an RNN trained on a task that requires skills we want to reuse, and isolate the representation from the RNN that immediately precedes the specificity of said task. There are many classical natural language understanding tasks like sentiment analysis, named entity recognition, coreference resolution, etc. but surprisingly few of them require general language understanding. Often, classical NLP methods that pay attention to little more than distinct word categories perform about as well as state-of-the-art deep learning powered systems. Why? In all but rare cases, these problems simply don’t require much more than word level statistics. Classifying a sentence as positive or negative sentiment is roughly analogous to classifying whether an image is of the outdoors or indoors — you’ll do pretty well just learning which colors are outdoors/indoors exclusive and classifying your image on that alone. For sentiment analysis, that method amounts to learning negative/positive weights for every word in a vocabulary, then to classify a sentence multiply the words found in that sentence by their weights and add it all up.

There are more complex cases, that require nuanced understanding of context and language to classify correctly, but those instances are infrequent. What often separates these remarkably simple cases from the more complex ones is the independence of the features: only weighting words as negative or positive would never correctly classify “The movie was not good" — at best it would appear neutral when you add up the effects of “not" and “good". A model that understands the nuance of the language would need to integrate features across words — like our CNN does with its many layers, and our RNN is expected to do over time.

While the language tasks above rarely depend on this multi-step integration of features, some researchers at the University of Toronto found an objective that does — and called it Skip-Thoughts. Like the skip-grams objective for finding general word embeddings, the skip-thoughts objective is that of predicting the context around a sentence given the sentence. The embedding comes from a GRU RNN instead of a shallow single hidden-layer neural network, but the objective, and means of isolating the representation, are the same.

We now have most of the pieces required to build the GIF search engine of our dreams. We have generic, relatively low dimensional, dense representations for both GIFs and sentences — the next piece of the puzzle is comparing them to one another. Just as we did with words, we can embed completely different media into a joint space together, so long as we have a metric for their degree of association or similarity. Once a joint embedding like this is complete, we will be able to find synonymous GIFs the same way we did words — just return the ones closest in the embedding space.

Although it is conceptually simple, learning this embedding is a significant challenge — it helps that our representations for both GIFs and sentences are dense but they are only low dimensional relative to the original media (~4K vs ~1M). There is an issue fundamental to data analysis in high-dimensional space known as the “curse of dimensionality"

The Curse of Dimensionality — *from **Wikipedia* :

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience… The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

You can think of our final set of parameters as a statistical result that requires significant evidence; each parameter update proceeds according to the data we present our training algorithm. While with a dense dataset this would mean each parameter update is likely to be evidenced by many neighboring data points, sparse high dimensional data makes that exponentially less likely. Thus we will need to ensure we have sufficient training data to overcome this burden.

We’ve formulated our problem as one of associating GIFs with their sentence descriptions, but this isn’t exactly a well trodden path — I searched and searched for a dataset specific to this to no avail. GIPHY.com is closest — with a substantial trove of GIFs, and associated human labels for each, but the labels are rarely of the contents of the image itself — instead they are often tags regarding popular culture references, names of people/objects, or general emotions associated with the imagery. By recognizing that we could focus on live action GIFs — which are just short, low resolution videos — I found the Microsoft Research Video Description Corpus, a dataset of 120k sentence descriptions for 2k short YouTube video clips.

In the Skip-Thoughts paper they show that their model returns vectors that are sufficiently generalizable that they can demonstrate competitive image-sentence ranking (a very similar task to ours, just with static images instead of GIFs) with a simple linear embedding of both image and sentence features into a joint 1000 dimensional space. Thus, we attempt to replicate those results but with the YouTube dataset. Our model will be assembled such that there are two embedding matrices, one from the image representation and one from the sentence representation, into a 1024 dimensional joint space. There will be no non-linearities, in order to prevent excessive loss of information. When learning a model, we need to know what makes a good set of parameters and what makes a bad one, so we can appropriately update the parameters and get a better model at the end of our learning process — this is called our “objective function".

Typically in supervised learning we know the exact answers that our model is supposed to be outputting, so we can directly minimize the difference between our model’s outputs and the correct answers for our dataset. In our case here, we don’t know exactly what the embedded vectors in this low dimensional space should be, only that for associated GIFs and sentences the embeddings should be close. We don’t want them to be exactly the same numbers though — there are multiple possible associations for every GIF and our model may draw stronger conclusions than it should. We can accomplish this objective with a formulation called max-margin, where for each training example we fetch one associated pair of GIFs and sentences, and one completely unassociated pair, then pull the associated ones closer to each other than the unassociated ones. We do this enough times (~5M times to be exact) and we have a model that accurately embeds GIFs and the sentences that describe them near one another.

The technical side of actually getting this working is both completely unrelated in content to this post and sufficiently involved that it deserves a post of its own, but in short I run the service on AWS g2.8xlarge GPU instances with some autoscaling to deal with variable load. I also obviously can’t compete with a service like GIPHY on content, so instead of managing my own database of GIFs I take a hybrid approach where I maintain a sharded cache across the instances available, and when necessary grab the top 100 results from GIPHY, then rerank this entire collection with respect to the query you typed in. When you type a query into the box at http://deepgif.tarzain.com the embedding process described above is run on your query. If there are precomputed, cached GIFs with a sufficiently high score then I return those results immediately, otherwise I download some GIFs from GIPHY, rerank them, and return relevant results.

Well that’s about it! Have fun playing around with it — and please share cool results with #DeepGIF

Oof, I just read the article Inspecting Algorithms for Bias. It starts off with the sensational quote "There’s software used across the country to predict future criminals. And it’s biased against blacks", attributes it to "ProPublica, a Pulitzer Prize-winning nonprofit news organization", then only in the second half reveals that well *actually* ProPublica's study was flawed and its conclusion is wrong. We can all agree that interpretibility is critically important, but this reporting really misses the mark.

Specifically:

ProPublica compared false positive rates and false negative rates for blacks and whites and found them to be skewed in favor of whites.

Northpointe, in contrast, compared the PPVs for different races and found them to be similar.

(NorthPointe authored the system, called COMPAS, and PPV is the proportion of automated judgments that turn out to be right)

So ProPublica compared absolute numbers across races and found a difference, and Northpointe compared proportions (PPV) and found they were the same. However, "the recidivism rates for blacks and whites do in fact differ", so *of course* the proportion-based approach is the fairer metric by which to judge this technology.

But instead of opening with the fact that ProPublica was wrong, this article leads the fact that ProPublica won a Pulitzer, and then buries the fact that its study was wrong under opaque statistical jargon in the second half of the piece. If I had to guess I'd say that 90% of the readers of this article will come away with a negative impression of COMPAS and an improved impression of ProPublica. Is this responsible journalism?

Props to them for linking to the rebuttal study, at least. From page 10 of that rebuttal:

Discussions about ARAIs (actual risk assessment instruments), statistics, methods, and test-bias may seem complex and uninteresting (we find them rather fascinating). We are at unique time in history. We are being presented with the chance of a generation, and perhaps a lifetime, to reform sentencing and unwind mass incarceration in a scientific way and that opportunity is slipping away because of misinformation and misunderstanding about ARAIs. Poorly conducted research or misleading statements can lead to confusion and/or paralysis for those charged with making policy."

However, while I disagree with their implicit judgment, I do agree with their explicit conclusion. Both interpretability and fairness are critically important, especially for Sourceress (my employer), and they're both of keen interest to me. As for making sure that we can interpret why our models are making the decisions they are: displaying feature importances are a good start, and packages like LIME can explain individual decisions that we identify as suspicious - I actually discovered this at the Fairness workshop at NIPS this year. At Sourceress, anyone who discoveres a suspicious automated judgment can send it to me and I'll analyze it for bias and share the results of my investigation with the team.

✍️ Use sparse matrices unless you're working with decision trees April 18th, 2018

What problems do they solve? How are they implemented? Which libraries have implementations? What are the performance implications and takeaways? tl;dr: most algorithms benefit massively by virtue of avoiding extraneous multiply/divide by zero operations; tree-based algorithms don't.

If you're working with data that has many zero values - for example, a set of tf-idf vectors, or a set of one-hot encoded data, or count vectors (as is the case at Sourceress!) - then that data is taking up tons of memory and most arithmetic that you'll do with this matrix is unnecessary (you don't need to multiply anything by zero to know what the result is going to be). Compressing this matrix could save a lot of space and time.

There are lots of implementations. The scipy one stores the matrix as three arrays:

The first array stores the cumulutive count of nonzero values in all current and previous rows. The second array stores column index values for each nonzero value. And the third array stores all nonzero values. [1]

With this representation, the original matrix can be completely reconstructed, and addition and multiplication can be done efficiently, apparently. I bet the implementations of these operations were fun algorithms problems.

This is just one possibility among many, though. One research paper I skimmed divided a matrix into many small dense submatrices ("Block compressed row storage" format), each of which might have included some zeros, because this representation was more hardware-agnostic, for some reason. Different hardware might necessitate different implementations.

Looks like SIMD for sparse matrices was an especially active research area from 2008-2014 [2] with the advent of highly parallel/multicore machines. All three of the most highly cited papers stated that sparse matrix-vector multiply was one of the most heavily used kernels in scientific computing [3] [4] [5] (which perhaps shouldn't surprise me, but I had thought that sparse matrices were most commonly used in NLP). Some of these papers also cited the need for a hardware-independent sparse matrix format [4] [5].

Sklearn has dozens of classifier and regressor implementations that accept sparse matrices, including most common algorithms. The way to check is to look at the docstrings of the `fit`

method and look for `{array-like, sparse matrix}`

:

```
def fit(self, X, y, sample_weight=None):
"""Fit the model according to the given training data.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
```

Scipy additionally has a

function

that can transform dense data into sparse data.

```
scipy.sparse import csr_matrix
sparse_dataset = csr_matrix(dataset)
```

They can yield dramatic performance improvements on some classic ML algorithms, including Bernoulli naive Bayes (it took 12% as long to fit a sparse matrix as a dense matrix), logistic regression (66% as long), and SVMs (50% as long). Decision tree-based algorithms, however, did not improve, unfortunately.

Many algorithms may end up being faster - the main performance cost is probably moving paging data in and out of memory, not the additional computational cost of dealing with data that's organized in a more complex way.

Additionally, some data that might not have fit in RAM might be able to when compressed into a sparse representation.

Another benefit - algorithms that might have otherwise spent a lot of time multiplying a vector by 0 can skip these operations instead.

*Sources:*

https://dziganto.github.io/Sparse-Matrices-For-Efficient-Machine-Learning/

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=SIMD+sparse+matrix&btnG=

https://www.sciencedirect.com/science/article/pii/S0167819108001403

https://epubs.siam.org/doi/abs/10.1137/130930352

http://vecpar.fe.up.pt/2016/docs/VECPAR_2016_paper_2.pdf

✍️ NIPS 2017 March 23rd, 2018

I went to NIPS and took 60 pages of notes and collected dozens of pictures.

Original posts:

I saw my first NIPS!! I took 58 pages of notes (https://github.com/JasonBenn/deep-learning-paper-notes/), which is...

Posted by Jason Benn on Sunday, December 10, 2017

Posted by Jason Benn on Sunday, December 10, 2017

Review of this paper. I chose to review this one because it has achieved state-of-the-art results on visual question answering and visual grounding tasks, which is just plain cool.

Originally published to Medium (apologies for the wonky formatting).

How do we best identify salient areas of and answer questions about images?

Visual question answering: “What is the woman feeding the giraffe?" Correct answer: “Carrot"

Visual grounding: “A tattooed woman (red) with a green dress (blue) and yellow backpack (green) holding a water bottle (pink) is walking across the street." The bounding boxes drawn would be excellent answers to this task.

We can create good representations of images using convolutional nets and good representations of phrases and sentences using RNNs, so it makes sense that a “visual question answering" (VQA) task or a “visual grounding" task should involve combining these representations somehow. Typical approaches use vector concatenation, element-wise vector summing, or element-wise vector multiplication to combine these representations. The authors suspect these approaches are too simplistic to fully capture the relationships between images and text (makes sense), but are reluctant to use bilinear pooling (AKA a full outer product of vectors — see definitions of terms at the bottom of this article), because the number of resulting parameters is too high to be practical (in this case, our image and text vectors are of length 2048, so the resulting matrix would have 2048² elements, and we’d need to fully connect that matrix to 3000 classes, resulting in ~12.5 billion learnable parameters). The authors here apply an existing technique for capturing the discriminating abilities of bilinear pooling with only a a few thousand parameters (16k) while preserving backpropability using a recently invented technique called multimodal compact bilinear pooling (MCB).

I have a theory that the term “multimodal compact bilinear pooling" was chosen specifically to maximize the impressiveness of this paper. It’s a mouthful but isn’t such a complicated concept: “bilinear pooling" simply means the outer product of the two input vectors; this is what happens when you multiply every element of one vector of length N by every element of the other vector of length M, resulting in a matrix of size NxM. This yields a TON of parameters, making it powerful but very expensive. “Compact" bilinear pooling means the authors applied dimensionality reduction techniques to get almost the same level of power with way fewer parameters. I don’t know exactly how it works — I didn’t read the paper where compact bilinear pooling was invented — but I have no idea how this could work as parameters change during training. In my head I’m imagining something similar to principal components analysis. “Multimodal", as far as I can tell, is a term that describes this specific use case but doesn’t really mean anything different — compact bilinear pooling could be done on any two vectors, practically speaking it doesn’t matter at all that one vector happens to represent one mode (images) while the other represents another (text).

Can we improve the state of the art in visual question answering and visual grounding tasks with this clever approach to bilinear pooling?

The authors developed a model that uses MCB at three different points where visual and textual representations need to be combined:

1. to predict spatial attention in the image

2. to predict an answer to the question

3. to relate the encoded multiple-choice answers to the combined question+image space (this only applies to the multiple choice problem).

They also use attention maps and additional training data, which feels a bit like sacrificing experimental purity to achieve state-of-the-art results.

> Like, is it a vector of dimensionality 2048, or a vector with 2048 elements, where each element has 3000 dimensions? Is that a 2048-dimensional vector, a 3000-dimensional vector, or a 6,144,000-dimensional vector? Why don’t we call this thing a matrix instead of a vector??Here’s the architecture breakdown:

Visual question answering: Inputs are an image and a question, and the output is one of 3k classes.

Wait, the “open-ended" VQA is just a 3,000-way classification problem? Are these answers all one word long? How was this answer set generated — were the answers supplied first, then a set of questions and images created that were answerable by one of the 3,000?

The image representations are trained using the 152-layer ResNet model, pretrained on ImageNet. They threw out the final 1,000-way fully connected layer, used L2 norm on the second-to-last layer’s output (the “pool5" layer); this results in a 2048-dimensional vector.

The questions are tokenized into words, one-hot encoded, and passed through a learned embedding layer followed by a 2-layer LSTM, each outputting a 1024-dimensional vector, which are then concatenated to form a 2048-dimensional vector.

The two pieces are passed through a 16,000-dimensional MCB and fully connected to the 3,000 top answers.

Open-ended VQA architecture.

Attention and visual grounding: inputs are a phrase and an image, and the output is a set of bounding boxes. I need to read one or two of these papers to understand this bit:

1. Xu et al., 2015 “Show, attend and tell: Neural image caption generation with visual attention"

2. Xu and Saenko, 2016: “Ask, attend and answer: Exploring question-guided spa- tial attention for visual question answering"

3. Yang et al., 2015: “Stacked attention networks for image question answering"

But I’ll take a stab at explaining the result anyway: it appears that there are “spatial grid locations" in certain layers of the conv net; the technique relies on merging these visual representations with the language representation and predicting “attention weight". From the paper: “Predicting attention maps… allows the model to effectively learn how to attend to salient locations based on both the visual and language representations."

Magic, man.

Architecture for visual grounding task.

Multiple choice answering: answers are encoded with a word embedding and LSTM layers, with the LSTM weights shared across all the answers. These are then merged with MCB and fully connected to N classes, where N is the number of multiple choice answers.

Architecture for multiple-choice VQA.

Visual question answering: significant improvement on the previous year’s state-of-the-art (7.9%), moderate improvement over this year’s 2nd place competing model (<1%), but most of this improvement is due to incorporating attentional maps (and presumably some is due to the additional training data). Thankfully, they omitted parts of the model so we can see that MCB accounts for about 3% improvement in the results.

Visual grounding: moderate improvement in the state of the art: slightly under 1% improvement on each of the two VG datasets.

The authors acknowledge that their approach might be better than the state of the art simply because their model involves more parameters. They compensate for this confounding variable by tacking on fully connected layers to the end of the non-bilinear pooled models until the total number of parameters is comparable, then compare the performance of these models. But I don’t know if I buy that this proves that MCB is the winner, or if it just proves that more parameters at the point where vector representations are combined is the difference maker. Maybe a better control would have been a random selection of dimensions from an outer product of the vectors such that the total dimensionality was the same as MCB (16,000)? This would show that MCB is providing superior results to randomly selected parameters at the point of the vectors meeting. Further, how do they know that 16,000 is the optimal number of parameters at the point where vectors meet? If that is the information bottleneck, where a representation needs to be highly complex to capture the relationship between image and language, then why not experiment with more or fewer parameters at that location?

The authors were trying to win the competition; thankfully, they also broke down their results by technique. Seems like MCB is worth considering in pretty much any model that involves combining similarly dimensional vectors — it’s an improvement over the standard techniques of concatenation, addition, or multiplication because it captures more information about the relationship between the multimodal representations. As an added benefit, because it’s based on element-wise multiplication, any number of similarly-sized vectors can be added efficiently.

Is the data available? Yes.

How much computation? Not so much — they rely on pre-trained ImageNet weights for their convnet, and presumably used pretrained word embeddings as well.

Can the problem be scaled down? Yes, you could test the viability of alternate approaches using subsets of the Visual7W/Visual Genome/VQA datasets.

How much code development? Their code is available on Github! Nice!

How much work to turn this paper into a concrete and useful application? Well, if we’re specifically talking about improving the state of the art for the tasks discussed in the paper:

Visual question answering: How far can you get with the ability to answer questions about an image, provided your answers are already within a set of 3,000 single-word answers? What subset of useful common questions and answers could this represent? I’m not sure. However, it’s not hard to see how this is a step towards speaking with AI assistants — how long until we have the computational power to encode not just 3,000 short answers, but a much larger set of possible answers, perhaps one that encompasses the most common 80% of answers that people want? Alexa or Google Home probably rely on similar technology.

Visual grounding, on the other hand, could be used to automate cleaning up images, or for video-related tasks such as identifying pedestrians or another driver signaling to you in self-driving car applications. Improving the state of the art here is obviously useful.

But the larger principle — that MCB is a superior approach to combining different types of information if you buy their methodology) is useful for any mixed-modality task.

The trickiest part about all this is that MCB requires that vectors are the same length. But that might be impractical — imagine a self-driving plane, which combines visual data from a video camera on the wings with less complex instrument — perhaps a wind sensor on the nose. Preferably you would be able to combine these representations efficiently (i.e., without using an outer product) but without requiring that your wind sensor output a representation as highly dimensional as your video camera. I understand why the authors chose to test combining vectors with MCB on a VQA task — language and imagery are similarly complex and so warrant output vectors of similar dimensionality — it’s a perfect application. A technique for efficiently combining vectors of uneven dimensionality would be a good point of research.

I didn’t find many reviews of this paper, but I did find reviews of a later one criticized for being highly similar to this one, and the reviewer’s comments are relevant enough to be worth mentioning here:

“Impressive results on standard benchmarks show progress. While the novelty is limited, accepting the paper will help others build on the state-of-the-art results."

“However, this paper used many pretrained models and embeddings, so it would make the paper better if all these effects are better analyzed."

outer product of… vectors: fancy name for multiplying two vectors together into a matrix (multiplying each element of vector A by each element of vector B). [1, 2] ⊗ [3, 4] = [[3, 4], [6, 8]]

vs cartesian product: this is technically set theory, but it’s a similar concept to the outer product, but you’re creating sets, not products. Similar to the nested loops join algorithm in databases. [1, 2] × [3, 4] = [[[1, 3], [1, 4]], [[2, 3], [2, 4]]]. (The symbol shows the contrast to outer product! ⊗ looks like the combination of × and ○, which is kind of like the multiplication symbol)

vs element-wise multiplication: two vectors of the same length making a third vector of the same length. [1, 2] x [3, 4] = [3, 8].

vs Hadamard product: same as above!

vs dot product: element-wise multiplying two vectors of the same length, then summing the result. [1, 2] ⋅ [3, 4] = 11.

vs inner product: this is a more general form of the dot product; an inner product can be computed on any two vectors, including infinite or complex (i.e., including imaginary numbers) vectors. Dot products can only be computed on vectors of real numbers.

vs tensor product: is just another name for the outer product.

vs cross product: finds a vector perpindicular to both of the given vectors. Useful for physics and engineering and lots of other things… not so much neural nets :)

Bilinear pooling: fancier name for the outer product of two vectors.

“Ablations without MCB": by ablations, they mean that they tried omitting MCB from each location in turn, and replacing it with something comparable (in this case, vector concatenation followed by multiple fully connected layers of similar total dimensionality), to test the added benefit of MCB.

Visual grounding: finding the bounding box of a phrase’s location in an image.

L2 normalization: penalizes large outputs; helps avoid exploding gradients.