A nice high-level overview of trends in DL hardware. My top takeaways are that mobile hardware for inference is increasingly ubiquitous, and that the near-future of meta-learning will require hardware that probably unites training and inference. Most current hardware focuses on one mode or the other, if my understanding is correct - though that is less true of RL.
Almost two years ago I started to include a Hardware section into my Deep Learning presentations. It was dedicated to a review of the current state and a set of trends for the nearest 1–5+ years.
Here is a version from April 2016, and here is an update from October 2017. Last year we saw a lot of interesting announces, I gave some talks with updated slides, and now I am updating it for February/March 2018. I will publish it soon as a separate presentation, and this text(s) will be a companion post(s) to the slides with the goal to make it more readable and useful as a reference.
I started to write it as a single post, but it soon became too large, so I decided to split it into a series of bite-sized posts.
I will constantly update the texts to fix errors and include recent news and announces. See the release notes at the bottom of the current post. Feel free to comment the posts and/or drop me an email to [email protected].
TL;DR / Executive Summary
Here is a short summary of what the series will be about.
•There are two distinct working modes for the currently used Neural Networks (NN, aka Deep Learning, DL): Training (learning a set of weights for a NN designed to solve a specific task) and Inference (using a trained NN). Training is much more compute intensive process than inference. Many applications separate these two modes, but some tasks (like Deep Reinforcement Learning) may require tight integration of both. There is, in principle, another mode of Meta-Learning (finding the right architecture, parameters and so on), but let’s set it aside for now.
•Deep Learning goes the same way as Bitcoin has passed. ̶M̶i̶n̶i̶n̶g̶ Training started on CPUs (Cental Processing Units, ordinary processors from Intel/AMD like Core i7, Ryzen and so on), then switched to GPUs (Graphic Processing Units from NVIDIA/AMD like NVIDIA GTX 1080 Ti), then switching to FPGAs (Field-Programmable Gate Arrays, an integrated circuit designed to be programmed by a customer) and ASICs (Application-specific Integrated Circuits, produced to be customized for special calculations rather than general-purpose computing).
•Right now most of the training goes on GPUs, and it’s NVIDIA GPUs. AMD almost lost the battle, because their GPUs have very poor support in DL frameworks (while having very good performance). FPGAs and ASICs are on the rise (among the recent examples is the Google’s TPU, Tensor Processing Unit).
•One of the next big thing in DL Hardware is mobile processors suited for DL, and almost every company is adding some ML/DL capabilities in the form of special instructions, optimized DSPs, and dedicated NPUs (Neural Processing Units). It’s mostly about inference, not training. Fast (and energy efficient) mobile processors will allow using [already trained] models to instantaneously processing data without the need to send it to the cloud, so reducing latency and increasing privacy/security, and this will probably lead to another cambrian explosion of AI applications.
•Having increased computing power at the edge (mobile, wearables, home devices, IoT, etc) could advance distributed training modes. This topic requires more research and experiments.
•There is an interesting field called neuromorphic computing, and some interesting things are happening, i.e. IBM already has its TrueNorth chip, Intel announces its Loihi chip. Two main advantages of neuromorphic architectures are that 1) they are more suitable for the brain-like computations; 2) they are potentially much more energy efficient (i.e. TrueNorth consumes only 70mW of energy, compare this to 250W for the top NVIDIA GPUs).
•Memristors are being researched, and they could advance neuromorphic computing even further. They provide a completely different circuitry comparing to previously mentioned neuromorphic processors still based on transistors. Right now it’s hard to expect something useful in production.
•Quantum computers (QC) are on the rise, the last year we saw a great deal of achievements from all the major companies including Google, IBM, Microsoft, Intel and so on. Quantum computing can advance ML field in different ways, starting from increasing speed of many classical algorithms to completely new ones and development of the field of Quantum Machine Learning. But there are still many obstacles to overcome, and there is a long way before we can use QCs for real-life large datasets.
This library appears to be SOTA for automatic model selection and hyperparameter tuning.
Another popular and well-maintained library, TPOT, uses genetic programming to search through model and hyperparameter space; this is more parallelizable than auto-sklearns strategy of Bayesian search. Neat! I'll try both.
Ever since Cal Newport wrote about how I learned how to code in Deep Work by reading books and going to a bootcamp (which thousands of people do every year - I'm not special), I've been getting a few emails a week asking for more details. Most often, people ask for the specific books I read as I was preparing, or general advice on breaking into the field. Below is a representative sample. If I sent you this link to you, I hope it's helpful! If you have any further questions, please email me, and I'll use them to improve this guide.
"We greatly overestimate what we can do in one year. But we greatly underestimate what is possible for us in five years." -Peter Drucker. This guide probably would've scared the shit out of me back in 2013 when I was starting. But it's worth it! Software engineering is creative, intellectually stimulating, high-impact, flexible (you can work on almost any problem you want), more social than people realize (pair programming is great!), you can work outside or abroad, it's well compensated, and it has an unlimited skill ceiling. The last part is the best, because if you're more systematic about studying than your peers, then it's just a matter of time before you'll be so in-demand that you'll be able to dictate the terms of your employment such that your lifestyle is however you want it to be. I recommend it to most everyone that thinks they might learn to like it.
tl;dr: two steps to software mastery
"Hi Jason! I've read from Cal Newport's book 'DeepWork' that you already finished reading as many as 18 books in a few weeks/months by the time you attended the notorious Dev boot camp to become a full stack programmer. Could you tell me which are those software books you were reading. I'm looking for a career change into programming, so any inputs you give would give me would surely help me here."
Hi Yasir! You can do much better than the path I took. I recommend:
Step 1, in depth: how to become an employable web developer
"I am Nirmala from Hyderabad, India and I teach programming to children, teenagers and adults who wish to make a career change to programming.
I am curious about the methods that you adopted for getting into the deep focus mode, the methods that were briefly described in Cal Newport's book "DeepWork".
Would it be possible for you to share the strategy that worked for you? Which language were you trying to learn? What were the books you referred and your learning methodology? Would you be able to share any link that refers to your learning methodology?
The reason I am asking you is that the students , particularly the older ones are barely able to remember what was taught which makes them frustrated and slowly, the confidence and self belief ebbs away.
Some students have been very successful but I would attribute their success more to their own self-motivation."
Hi Nirmala! That's incredibly cool, I love what you're doing.
The things that I thought would work for me, that I actually implemented, were to cut myself off from the Internet and try to memorize the syntax and the APIs of the languages (I figured that not having to look things up as often would make programming much faster). Memorizing languages turns out to be a nice benefit but it's not nearly as important as I thought.
The things that I didn't realize would help me at first but that turned out to help a great deal being excited and motivated by fun side projects and reading the code of other people working on similar problems.
As for barely being able to remember what you've learned -- hard for me to prescribe a solution when I'm not sure why they're having a hard time remembering. Are they moving too fast from subject to subject without exercising their fragile new knowledge? Are they not asking "why" often enough? Are they trying to memorize things they don't understand?
And my learning methodology is pretty much all from Cal Newport. So Good They Can't Ignore You changed my whole life, and I liked his earlier work about being a pro college student too. And Deep Work, obviously :)
Project idea: Snake in the browser in an HTML canvas
25% CLIENT-SIDE RENDERING WEBAPPS
Learn React, but... ugh, every tutorial also includes Webpack, ES2017, Redux, etc, etc, etc. By the time you get to this point, anything I recommend would be out of date, so just email me for recommendations.
At this point you'll be good enough for an intermediate role at any job you want.
[#]Step 2, in depth: how to master software engineering
"My name is Jonathan and I am a junior developer.
I am currently reading Cal Newports book and found your part to be really relatable and great. I'd like to ask some questions if you wouldnt mind answering them. As a junior developer what do you think is a priority in becoming employable asap? If you could go back in time and give advice to yourself when you were still learning programming what would it be? Lastly, what resources (books, online courses, etc.) Old or new would you deem as fundamental for any developer. I believe I pickup on things well but I still feel like I'm lacking the knowledge and understanding that CS graduates would have as I'm not doing a comp sci degree. If you cant answer them all thats fine. Any advice would be tremendous, I'm just really keen to absorb any useful information!
Glad you enjoyed the book. Kinda couldn't believe it when he featured me like that, I just responded to a mailing list blast and he interviewed me over Skype for an hour.
The long slow path to employability is through computer science. There are lots of junior developers with basic web development skills but very little of the foundational skills that you'll need for a long, satisfying, and stable career. Fortunately, you now have the perfect resource for getting there! Rejoice, it's teachyourselfcs.com. Give yourself 2-3 years to work though that list and you'll be golden. If you're interviewing now, then just focus on algorithms and data structures, as those will most help you during interview time. I personally found this list really fun and interesting - I finally understood all the terms and concepts that people at the office were talking about! Learning computer science after being a working programmer is like being prescribed glasses for the first time and going outside and noticing all the leaves on the trees. You just see and understand so much more.
Oh look, I already answered #3! You're welcome. I'm always super happy to share that resource because it's just so good. If you're in SF, you should consider taking the Bradfield CS courses too, it's the same content but with other students. I've now taken 6 and they've all leveled me up as a developer. Try to get a job and persuade your employer to pay for them. Once you're done with interviewing, here's the best order to take them in:
This path was a bit muddled because I didn't realize until halfway through that there are two paths in "machine learning": there's deep learning, which involves neural nets and is likely the new hotness you've been hearing all about; and then there's machine learning, AKA data science, which is decidedly less sexy but more universally useful. I learned both, and guess which one I spend all my time on at Sourceress? Data science. Don't worry, you still get to use neural nets.
If your goal is computer vision, then do fast.ai 1 and 2, half of the deep learning book, and CS231n.
If your goal is AI research but you don't want to get a PhD, then do all of the above. Once you've implemented 3 important papers and can pass the Google software engineering interview then you can get a job at Google Brain as a Research Software Engineer.
Goal: become a proper neckbeard. I've read OSTEP (highly recommend as an operating systems 101 resource), and now I'd like a practical guide that goes over topics like:
the Unix kernel
how the filesystem is organized
useful Bash/Shell utilities, etc.
the sequence of events when I start a shell
how to use subshells?!
useful /etc files
which of the 15 methods for bash expansion to actually use
Cuz I've been learning about all of these in a non-organized way for the last 5 years and I still stumble over all of them, so let's try learning about them systematically.
Myles' general guidance: "i don't know if this metaphor is useful but i've been using it recently: you can think of the kernel as a singleton object (i.e. you can't make another instance for testing without booting a whole new machine) with tons of private state and hundreds of methods. bash shell is just the most popular "high level wrapper" (like jquery) for it. all languages also wrap the kernel, but they tend to reimplement a bunch of stuff (like threads) so they can wrap more than one kernel (windows, bsd, etc). even c wraps the kernel. if something's consistently confusing, it always helps to ask "what does the kernel side of this look like? what are it's private data structures and methods?" because those are "unwrapped" and they have to be small and fast and simple"
This guy on Quora suggests two others: "Once you've graduated from the Advanced Bash-Scripting Guide, I'd suggest the much more useful Greg's Wiki (especially the Pitfalls article). It's the single most useful Bash resource out there (please someone prove me wrong), and significantly, is an active (and actively moderated) wiki with even anonymous editing."
My take: the Advanced Bash-Scripting Guide assumes no programming knowledge, meh.
Bartosz on Greg's Wiki: "oooh, yes, Greg's wiki! no, really, don't skip it. it explicitly brings your attention that the official docs often inadvertently skip over. especially the pitfalls thing is extremely practical. basically it's super useful to have a resource written from a cranky old graybeard perspective of "okay, here's how Bash will try to kill you, plan accordingly"
Given Myles' advice to learn the kernel, I'll start with TLPI. And I'll keep LCL&SSB as a reference because, well, it's only $20. And I'll throw in the Pitfalls article from Greg's Wiki for good measure.
Bartosz: "Oh, for general Linux neckbeard upkeep, can't beat LWN:
BLS thinks that wholesale & retail buyers is vulnerable to automation and will decline by 2% in total employment over the next 10 years.
The Times allows that perhaps more jobs are being _created_ by these innovations, but I wonder how many stylists they employ vs. the number of jobs that automation has already displaced. Plus - I doubt that these stylists are in either of these companies' long-term plans. All the work that they do powers future supervised algorithms.
High-Skilled White-Collar Work? Machines Can Do That, TooHigh-Skilled White-Collar Work? Machines Can Do That, Too
A Stitch Fix warehouse in San Francisco. The company relies on algorithms to help personalize shipments to customers.CreditChristie Hemm Klok for The New York Times
One of the best-selling T-shirts for the Indian e-commerce site Myntra is an olive, blue and yellow colorblocked design. It was conceived not by a human but by a computer algorithm — or rather two algorithms.
The first algorithm generated random images that it tried to pass off as clothing. The second had to distinguish between those images and clothes in Myntra’s inventory. Through a long game of one-upmanship, the first algorithm got better at producing images that resembled clothing, and the second got better at determining whether they were like — but not identical to — actual products.
This back and forth, an example of artificial intelligence at work, created designs whose sales are now “growing at 100 percent," said Ananth Narayanan, the company’s chief executive. “It’s working."
Clothing design is only the leading edge of the way algorithms are transforming the fashion and retail industries. Companies now routinely use artificial intelligence to decide which clothes to stock and what to recommend to customers.
And fashion, which has long shed blue-collar jobs in the United States, is in turn a leading example of how artificial intelligence is affecting a range of white-collar work as well. That’s especially true of jobs that place a premium on spotting patterns, from picking stocks to diagnosing cancer.
A popular T-shirt sold on the Indian e-commerce site Myntra was conceived by two algorithms. One generated random images; the other identified those that resembled existing designs without duplicating them.
“A much broader set of tasks will be automated or augmented by machines over the coming years," Erik Brynjolfsson, an economist at the Massachusetts Institute of Technology, and Tom Mitchell, a Carnegie Mellon computer scientist, wrote in the journal Science last year. They argued that most of the jobs affected would become partly automated rather than disappear altogether.
The fashion industry illustrates how machines can intrude even on workers known more for their creativity than for cold empirical judgments. Among those directly affected will be the buyers and merchandise planners who decide which dresses, tops and pants should populate their stores’ inventory.
A key part of a buyer’s job is to anticipate what customers will want using a well-honed sense of where fashion trends are headed. “Based on the fact that you sold 500 pairs of platform shoes last month, maybe you could sell 1,000 next month," said Kristina Shiroka, who spent several years as a buyer for the Outnet, an online retailer. “But people might be over it by then, so you cut the buy."
Merchandise planners then use the buyer’s input to figure out what mix of clothing — say, how many sandals, pumps and flats — will help the company reach its sales goals.
In the small but growing precincts of the industry where high-powered algorithms roam free, however, it is the machine — and not the buyer’s gut — that often anticipates what customers will want.
Stitch Fix’s business would probably not exist without the use of algorithms. Among other things, they project how many clients will be in a certain situation, or “state," several months in the future, and what volume of clothes people buy in each situation.CreditChristie Hemm Klok for The New York Times
That’s the case at Stitch Fix, an online styling service that sends customers boxes of clothing whose contents they can keep or return, and maintains detailed profiles of customers to personalize their shipments.
Stitch Fix relies heavily on algorithms to guide its buying decisions — in fact, its business probably could not exist without them. Those algorithms project how many clients will be in a given situation, or “state," several months into the future (like expanding their wardrobe after, say, starting a new job), and what volume of clothes people tend to buy in each situation. The algorithms also know which styles people with different profiles tend to favor — say, a petite nurse with children who lives in Texas.
Myntra, the Indian online retailer, arms its buyers with algorithms that calculate the probability that an item will sell well based on how clothes with similar attributes — sleeves, colors, fabric — have sold in the past. (The buyers are free to ignore the projection.)
All of this has clouded the future of buyers and merchandise planners, high-status workers whose annual earnings can exceed $100,000.
At more conventional retailers, a team of buyers and support workers is assigned to each type of clothing (like designer, contemporary or casual) or each apparel category, like dresses or tops. Some retailers have separate teams for knit tops and woven tops. A parallel merchandise-planning group could employ nearly as many people.
Bombfell, which is similar to Stitch Fix but caters specifically to men, relies on a single buyer for its tops and accessories. The company’s data and algorithmic tools help the buyer project clothing demand.CreditJeenah Moon for The New York Times
Buyers say this specialization helps them intuitively understand trends in styles and colors. “You’re so immersed in it, you almost get a feeling," said Helena Levin, a longtime buyer at retailers like Charlotte Russe and ModCloth.
Ms. Levin cited mint-green dresses, a top seller earlier this decade. “One day it just died," she said. “It stopped. ‘O.K., everything mint, get out.’ Right after, it looked old. You could feel it."
But retailers adept at using algorithms and big data tend to employ fewer buyers and assign each a wider range of categories, partly because they rely less on intuition.
At Le Tote, an online rental and retail service for women’s clothing that does hundreds of millions of dollars in business each year, a six-person team handles buying for all branded apparel — dresses, tops, pants, jackets.
Brett Northart, a co-founder, said the company’s algorithms could identify what to add to its stock based on how many customers placed the items on their digital wish lists, along with factors like online ratings and recent purchases.
Nathan Cates, a Bombfell buyer, in a T-shirt from among the company’s offerings, with Will Noguchi, a stylist. Bombfell’s algorithms help Mr. Cates make choices about what to buy, but he is obsessive about touching the fabric before acquiring an item and almost always tries it on first.CreditJeenah Moon for The New York Times
Bombfell, a box service similar to Stitch Fix catering only to men, relies on a single employee, Nathan Cates, to buy all of its tops and accessories.
The company has built algorithmic tools and a vast repository of data to help Mr. Cates, who said he could more accurately project demand for clothing than a buyer at a traditional operation.
“We know exactly who our customers are," he said. “We know exactly where they live, what their jobs are, what their sizing is."
For now, at least, only a human can do parts of his job. Mr. Cates is obsessive about touching the fabric before purchasing an item and almost always tries it on first.
“If this is a light color, are we going to see your nipples?" he explained. (The verdict on a mint T-shirt he donned at the company’s headquarters in New York? “A little nipply.")
One sector of the fashion industry that is adding jobs is the ranks of the stylists who pick clothes to send to customers, based partly on recommendations from algorithms. One stylist, Mary Kate Makowski, fulfilled an order at Trunk Club.CreditWhitten Sabbatini for The New York Times
There are other checks on automation. Negotiations with suppliers typically require a human touch. Even if an algorithm can help buyers make decisions more quickly and accurately, there are limits to the number of supplier relationships they can juggle.
Arti Zeighami, who oversees advanced analytics and artificial intelligence for the H & M group, which uses artificial intelligence to guide supply-chain decisions, said the company was “enhancing and empowering" human buyers and planners, not replacing them. But he conceded it was hard to predict the effect on employment in five to 10 years.
Experts say some of these jobs will be automated away. The Bureau of Labor Statistics expects employment of wholesale and retail buyers to contract by 2 percent over a decade, versus a 7 percent increase for all occupations. Some of this is because of the automation of less sophisticated tasks, like cataloging inventory, and buying for less stylistically demanding retailers (say, auto parts).
There is at least one area of the industry where the machines are creating jobs rather than eliminating them, however. Bombfell, Stitch Fix and many competitors in the box-fashion niche employ a growing army of human stylists who receive recommendations from algorithms about clothes that might work for a customer, but decide for themselves what to send.
“If they’re not overly enthusiastic upfront when I ask how do you feel about it, I’m making a note of it," said Jade Carmosino, a sales manager and stylist at Trunk Club, a Stitch Fix competitor owned by Nordstrom.
In this, stylists appear to reflect a broader trend in industries where artificial intelligence is automating white-collar jobs: the hiring of more humans to stand between machines and customers.
For example, Chida Khatua, the chief executive of EquBot, which helped create an exchange-traded fund that is actively managed by artificial intelligence, predicted that the asset-management industry would hire more financial advisers even as investing became largely automated.
Paper Club (plus a few friends and colleagues) is going to work through this content at a cadence of one meeting every other Wednesday starting late June 2018.
I reviewed content from Andrew Ng's Coursera course, and the textbook just seems both denser and more practical than Ng's course for the content where they overlap (which is about two thirds of it). I see several substitutions:
We can replace all four of these lessons:
Linear Regression with One Variable
Linear Regression with Multiple Variables
With just Chapter 4, and I think Chapter 4 is better anyway, so that's nice. Saves us almost a month!
These lessons have no analogue in the sklearn book (though there is one great chapter in the Deep Learning book by Goodfellow) and seem great:
Unsupervised Learning: it's just k-means, which we've covered, so we can skip.
Dimensionality Reduction: it's just PCA, which we could do either here or in sklearn with Chap 8. I'd bias Chap 8 because it's in sklearn, and because it also includes a brief overview of other techniques (including t-SNE and LDA, which are pretty common)
I'm not sure about these sections, so let's revisit this at the end:
Anomaly Detection - using Gaussian distributions. This should be really easy after all the Bayes stuff.
Recommender Systems - we already made a basic collaborative filter with fast.ai (though I'd be down to try this again with Sourceress data)
Large Scale Machine Learning - it's about stochastic/minibatch/regular gradient descent... and Map Reduce? I'm skeptical. Probably skip.
The main last practical thing that this sequence misses is NLP and data preprocessing - once you understand how to transform text into numerical features, then all of the above techniques open up to you (ML algorithms only take numbers as inputs). I'll add 1 week on http://scikit-learn.org/stable/modules/feature_extraction.html & maybe some word vector stuff.
So here's the 10ish-week core sequence I recommend, optimized for real-world practicality and covering lots of interesting ground quickly:
We don't really measure AI progress with the Turing test or other games anymore, so that's a bit of a strawman. In fact, OpenAI explicitly measures progress on generalization with Gym/Universe by evaluating agents by their ability to never-before-seen games.
This is a good framing of intelligence, though: autonomy as opposed to automation. And I like Moravec's Paradox: the simplest reality is more complex than the most complex game. AIs that achieve superhuman performance in games will still get crushed by reality.
In recent weeks I've been forced to reformulate and distill my views on AI. After my winter post went viral many people contacted me over email and on twitter with many good suggestions. Since there is now more attention to what I have to offer, I decided to write down in a condensed form what I think is wrong with our approach to AI and what could we fix. Here are my 10 points:
We are trapped by Turing's definition of intelligence. In his famous formulation Turing confined intelligence as a solution to a verbal game played against humans. This in particular sets intelligence as a (1) solution to a game, and (2) puts human in the judgement position. This definition is extremely deceptive and has not served the field well. Dogs, monkeys, elephants and even rodents are very intelligent creatures but are not verbal and hence would fail the Turing test.
The central problem of AI is Moravec's Paradox. It is vastly more stark today than it was when it was originally formulated in 1988 and the fact we've done so little to address it over those 30 years is embarrassing. The central thesis of the paradox is that apparently simplest reality is more complex than the most complex game. We are obsessed with superhuman performance in games (and other restricted and well defined universes of discourse such as datasets) as an indicator of intelligence, a position coherent with the Turing test. We completely ignore the fact that it is the reality itself rather than a committee of humans that makes ultimate judgements on the intelligence of actors.
Our models may even work, but often for the wrong reason. I've elaborated on that in my other posts , , , , deep learning comes in as a handy example. We apparently solved object recognition, but numerous studies show that the reasons why deep nets recognize objects are vastly different from the reasons why humans detect object. For a person concerned with fooling humans in the spirit of the Turing test this may not be important. For a person who is concerned with the ability of an artificial agent to deal with unexpected (out of domain) reality this is of central importance.
Reality is not a game. If anything, it is an infinite collection of games with ever changing rules. Anytime some major development happens, the rules of the game are being rewritten and all the players need to adjust or they die. Intelligence is a mechanism evolved to allow agents to solve this problem. Since intelligence is a mechanism to help us play the "game with ever changing rules", it is no wonder that as a side effect it allows us to play actual games with fixed set of rules. That said the opposite is not true: building machines that exceed our capabilities in playing fixed-rule games tells us close to nothing about how to build a system that could play a "game with ever changing rules".
There are certain rules in physical reality that don't change - these are the laws of physics. We have verbalized them and used them to make predictions that allowed us to build the civilization. But every organism on this planet masters these rules non verbally in order to be able to behave in the physical environment. A child knows the apple will fall from the tree way before it learns about Newtonian dynamics.
Our statistical models for vision are vastly insufficient as they only rely on frozen in time appearance of things and human-assigned abstract label. A deep-net can see millions of images of apples on trees and will never figure out the law of gravity (and many other things which are absolutely obvious to us).
The hard thing about common sense is that it is so obvious to us, it is very hard to even verbalize and consequently label in the data. We have a giant blindspot that covers everything which is "obvious". Consequently we can't teach computers common sense, not only because it would likely be impractical, but more fundamentally because we don't even realize what is it. We don't realize until our robot does something extremely stupid and only then an eureka moment arrises - "oh it does not understand that ... [put any obvious fact of choice here] ...".
If we want to address Moravec's paradox [which in my opinion should be the focal point of any serious AI effort today] we somehow need to mimic the ability of organisms to learn purely from observing the world, without the need of labels. A promising idea towards achieving this goal is to build systems that make temporal prediction of future events and learn by comparing the actual development with their prediction. Numerous experiments suggest that this is indeed what is going on in biological brains and it makes a lot of sense from numerous perspectives, as these systems, among other things would have to learn the laws of physics (as they appear observed by the agent, aka. folks physics). The predictive vision model is a step in that direction but certainly not the last step.
Almost all that we do today and call AI is some form of automation of things that can be verbalized. In many areas this may work, but is really not very different from putting Excel in place of a paper spreadsheet to help accountants. The area which is (and always was) problematic is autonomy. Autonomy is not automation. Autonomy means a lot more than just automation, and it means a whole lot more if it is autonomy that is required to be safer than humans, as in self driving cars. Autonomy should almost be synonymous with broadly defined intelligence as it assumes ability to deal with unexpected, untrained, proverbial unknown unknowns.
These are the core points I'd like to convey. They have various nuances, hence why I write this blog. However certainly if you acknowledge these points, we are pretty much on the same page. There are other numerous details which are heavily debated which I don't think are essential but for completeness let me express my views on a few of those:
Innate or learned? Certainly there are organisms with innate capabilities and certainly there are things we learn. This is however an implementation related question and I don't think it has a definite answer. In our future development I'm sure we will use the combination of both.
Learned features or hand crafted features? This is a related question. My broad view is that vast majority of aspects of the "cortical computation" will be learned, that is in the context of AI and autonomy (but that does not mean we can't handcraft something if it proves to be useful and otherwise hard to learn for some reason). There are also huge pieces of brain that are most likely pre-wired. In more specific application of automation, things can go both way. There are cases in which learned features are clearly superior than hand crafted ones (the whole sales pitch of deep learning), but there are numerous applications where carefully handcrafted and developed features are absolutely, unquestionably superior to any learned stuff. In general I think it is a false dichotomy.
Spiking, continuous, digital or analog, maybe quantum? I don't have an extremely strong position on that, each has advantages and disadvantages. Digital is simple, deterministic and readily available. Analog is hard to control but uses far less power. Same with spiking though that has the added benefit of being closer to biology which may suggest that for some reason this is the better solution. Quantum? I'm not sure there is any strong evidence for the necessity of quantum computation in solving intelligence, though we may find out it is necessary as we go. These are all questions about "how?". My main interest is in the question of "what?".
Since I want to keep it short (it is already too long) I'll stop here. Feel free to give me feedback in comments, emails and on twitter.
Great work by Zain Shah learning a joint embedding of GIFs and sentences.
"Max-margin" is a better name for this objective function than "triplet loss", which is how I first heard of it. Think that was from the FaceNet (?) paper from Facebook.
Key numbers: 5M comparisons to train, 100 cached/precomputed GIFs for comparison & ready to return for scaling load, AWS g2.8xlarge GPU instances to train (as of Nov '16), 1024D latent space (wonder if smaller is possible?), Microsoft Research Video Description Corpus for data: 120k one-sentence descriptions of 2k short YouTube clips.
Building a Deep Learning Powered GIF Search Engine
They say a picture’s worth a thousand words, so GIFs are worth at least an order of magnitude more. But what are we to do when the experience of finding the right GIF is like searching for the right ten thousand words in a library full of books, and your only aid is the Dewey Decimal System?
what finding the right GIF is like
We build a GIF search engine of course!
what it should be like
This seems quite magical — you type in a phrase and get exactly the GIF you were thinking of — but behind the scenes it’s a matter of glueing two machine learning models pre-trained on massive datasets together by training a third, smaller model on a dataset.
TL;DR — If you’re already familiar with the basics of deep learning, the following few paragraphs will cover a high level overview of how the GIF search engine works, how it was trained, etc. If not, don’t fret — I give a bottom up explanation of the entire process with minimal math background required below this overview.
By formulating the problem generally we can get away with only having to learn a shallow neural network model that embeds hidden layer representations from two pre-trained models — a convolutional neural network pre-trained to classify objects in images, and a recurrent neural network pre-trained to predict surrounding context in text. We train a shallow neural network to embed the representations from these models into a joint space together based on associations from a corpus of short videos and their sentence descriptions.
Specifically, these models are the VGG16 16-layer CNN pre-trained on ImageNet, the Skip Thoughts GRU RNN pre-trained on the BooksCorpus, and a set of 2 linear embedding matrices trained jointly with the others on the videos and sentence descriptions from the Microsoft Video Description Corpus. This framing does constrain the resulting model to only working well with live-action videos that are similar to the videos in the MVDC, but the pre-trained image and sentence models help it generalize to pairings in that domain it has never before seen.
Don’t worry if the above doesn’t make sense — if you’d like to know more read on and I’ll explain how the individual pieces work below.
The GIFs are processed by a CNN (left), and your query is processed by a RNN (right). The result is the GIF with the closest embedding to your query’s embedding
First, we have our CNN or convolutional neural network, pre-trained to classify the objects found in images. At a high level, a convolutional neural network is a deep neural network with a specific pattern of parameter reuse that enables it to scale to large inputs (read: images). Researchers have trained convolutional neural networks to exhibit near-human performance in classifying objects in images, a landmark achievement in computer vision and artificial intelligence in general.
Now what does this have to do with GIFs? Well, as one may expect, the “skills" learned by the neural network in order to classify objects in an image should generalize to other tasks requiring understanding images. If you taught a robot to tell you what’s in an image for example, and then started asking it to draw the boundaries of such objects (a vision task that is harder, but requires much of the same knowledge), you’d hope it would pick up this task more quickly than if it had started on this new task from scratch.
We can use an understanding of how neural networks function to figure out exactly how to achieve such an effect. Deep neural networks are so called because they contain layers of composed pieces — each layer is simply a matrix multiplication followed by an activation function.
This means, for a given input, we multiply it by a matrix, then pass it through one of those functions, then multiply it by another matrix, then pass it through one of those functions again, until we have the numbers we want. In classification, the numbers we want are a probability distribution over classes/categories, and this is necessarily far fewer numbers than in our original input.
It is well understood that matrix multiplications simply parametrize transformations of a space of information — e.g. for images you can imagine that each matrix multiplication warps the image a bit so that it is easier to understand for subsequent layers, amplifying certain features to cover a wider domain, and shrinking others that are less important. You can also imagine that, based on the shape of the common activation functions (they “saturate" at the limits of their domain from -∞ to ∞, and only have a narrow range around their center when they aren’t strictly one number or another), they are utilized to “destroy" irrelevant information by shifting and stretching their narrow range of effectiveness to the region of interest in the data. Further, by doing this many times rather than only once, the network can combine features from disparate parts of the image that are relevant to one another.
When you put these pieces together what you really have is a simple, but powerful, layered machine that destroys, combines, and warps information from an image until you only have the information relevant to the task at hand.
When the task at hand is classification, then it transforms the image information until only the information critical to making a class decision is available. We can leverage this understanding of the neural network to realize that just prior to the layer that outputs class probabilities we have a layer that does most of the dirty work in understanding the image except reducing it to class labels.
Now we can reuse learned abilities from our previous task, and generalize far beyond our limited training data for this new task. More concretely, for a given image, we recognize that this penultimate layer’s output may be a more useful representation than the original (the image itself) for a new task if it requires similar skills. For our GIF search engine this means that we’ll be using the output of the VGG-16 CNN from Oxford trained on the task of classifying images in the ImageNet dataset as our representation for GIFs (and thereby the input to the machine learning model we’d like to learn).
What is the equivalent representation for sentences? That brings us to the second piece of our puzzle, the SkipThoughts GRU (gated-recurrent-unit) RNN (recurrent neural network) trained on the Books Corpus. Like convolutional networks share their parameters across the width and height of an image, recurrent ones share their parameters across the length of a sequence. Convolutional networks’ parameter sharing relies on an assumption that only local features are relevant at each layer of the hierarchy, and these features are then integrated by moving up the hierarchy, incrementally summarizing and distilling the data below at each step. Recurrent networks however accumulate data over time, adding the input they are currently looking at to a history. In this manner, they effectively have “memory", and can operate on arbitrary sequences of data — pen strokes, text, music, speech, etc. Like convolutional neural networks, they represent the state of the art in many sequence learning tasks like speech recognition, sentiment analysis from text, and even handwriting recognition.
There are two interesting problems here:
1) it isn’t immediately straightforward how you represent words in a sentence like we do pixels in an image
2) there isn’t a clear analog to object classification in images for text.
While characterizing an image as a 2D array of numbers may be somewhat intuitive, transforming sentences into the same general space won’t be. This is because while you can easily treat the brightness/color of a pixel in an image as a number on some range, the same doesn’t seem as intuitive for words. Words are discrete while the colors of pixels are continuous.
Discrete vs Continuous
Importantly (and necessarily for our application), these definitions aren’t as incontrovertible as they may seem. While words themselves are certainly distinct, they represent ideas that aren’t necessarily so black and white. There may not be any words between cat and dog, but we can certainly think of concepts between them. And for some pairs of words, there actually are plenty of words in between (i.e. between hot and cold).
colors are continuous, while color names are discrete
For example, the space of colors is certainly continuous, but the space of named colors is not. There are infinitely many colors between black and white, but we really only have a few words for them (grey, steel, charcoal, etc.).
What if we could find that space of colors from the words for them, and use that space directly?
We would only require that the words for colors that are similar also be close to each other in the color space. Then, despite the color names being a discrete space, operations we want to do on or between the colors (like mixing colors or finding similar ones) become simple once we first convert them to the continuous space.
Our method for bringing the discrete world of language into a continuous space like images involves a step like that of the colors. We will find the space of meaning behind the words, by finding embeddings for every word such that words that are similar in meaning are close to one another.
Researchers at Google Brain did exactly this, with their software system Word2Vec. The realization key to their implementation is that, although words don’t have a continuous definition of meaning we can use for the distance optimization, they do approximately obey a simple rule popular in the Natural Language Processing Literature
The Distributional Hypothesis — succinctly, it states:
a word is characterized by the company it keeps
At a high level, this means that rather than optimizing for similar words to be close together, they assume that words that are often in similar contexts have similar meanings, and optimize for that directly instead.
More specifically, the prevailing success was with a model called Skip-grams, which tasked their model with directly outputting a probability distribution of neighboring words (not always directly neighboring, they would often skip a few words to make the data more diverse, hence the name “skip grams"). Once it was good at predicting the probability of words in its context, they took the hidden layer weight matrix and used it as a set of dense continuous vectors representing the words in their vocabulary.
Once this optimization is completed, the resulting word vectors have exactly the property we wanted them to have — amazing! We’ll see that this is exactly the pattern of success here — it doesn’t take much, just a good formulation of what you’re optimizing for. The initial Word2Vec results contained some pretty astonishing figures — in particular, they showed that not only were similar words near each other, but that the dimensions of variability were consistent with simple geometric operations.
For example, you could take the continuous vector representation for king, subtract from it the one for man, add the one for woman, and the closest vector to the result is the representation for queen.
That is the power of density, by forcing these representations to be close to one another, regularities in the language become regularities in the embedded space. This is a desirable property for getting the most out of your data, and is generally necessary in our representations if we are to expect generalization.
Now that we have a way to convert words from human-readable sequences of letters into computer readable sequences of N-dimensional vectors, we can process our sentences similarly to our GIFs — with dimensions: the dimensionality of the word vectors, and the sentence length.
This leaves us with our sentences looking somewhat like rectangles, with durations and heights, and our GIFs looking like rectangular prisms, with durations, heights, and widths.
Just like with the CNN — we’d like to take an RNN trained on a task that requires skills we want to reuse, and isolate the representation from the RNN that immediately precedes the specificity of said task. There are many classical natural language understanding tasks like sentiment analysis, named entity recognition, coreference resolution, etc. but surprisingly few of them require general language understanding. Often, classical NLP methods that pay attention to little more than distinct word categories perform about as well as state-of-the-art deep learning powered systems. Why? In all but rare cases, these problems simply don’t require much more than word level statistics. Classifying a sentence as positive or negative sentiment is roughly analogous to classifying whether an image is of the outdoors or indoors — you’ll do pretty well just learning which colors are outdoors/indoors exclusive and classifying your image on that alone. For sentiment analysis, that method amounts to learning negative/positive weights for every word in a vocabulary, then to classify a sentence multiply the words found in that sentence by their weights and add it all up.
There are more complex cases, that require nuanced understanding of context and language to classify correctly, but those instances are infrequent. What often separates these remarkably simple cases from the more complex ones is the independence of the features: only weighting words as negative or positive would never correctly classify “The movie was not good" — at best it would appear neutral when you add up the effects of “not" and “good". A model that understands the nuance of the language would need to integrate features across words — like our CNN does with its many layers, and our RNN is expected to do over time.
While the language tasks above rarely depend on this multi-step integration of features, some researchers at the University of Toronto found an objective that does — and called it Skip-Thoughts. Like the skip-grams objective for finding general word embeddings, the skip-thoughts objective is that of predicting the context around a sentence given the sentence. The embedding comes from a GRU RNN instead of a shallow single hidden-layer neural network, but the objective, and means of isolating the representation, are the same.
Learning a Joint Embedding
We now have most of the pieces required to build the GIF search engine of our dreams. We have generic, relatively low dimensional, dense representations for both GIFs and sentences — the next piece of the puzzle is comparing them to one another. Just as we did with words, we can embed completely different media into a joint space together, so long as we have a metric for their degree of association or similarity. Once a joint embedding like this is complete, we will be able to find synonymous GIFs the same way we did words — just return the ones closest in the embedding space.
The GIFs are processed by a CNN (left), and your query is processed by a RNN (right). The result is the GIF with the closest embedding to your query’s embedding
Although it is conceptually simple, learning this embedding is a significant challenge — it helps that our representations for both GIFs and sentences are dense but they are only low dimensional relative to the original media (~4K vs ~1M). There is an issue fundamental to data analysis in high-dimensional space known as the “curse of dimensionality"
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience… The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.
You can think of our final set of parameters as a statistical result that requires significant evidence; each parameter update proceeds according to the data we present our training algorithm. While with a dense dataset this would mean each parameter update is likely to be evidenced by many neighboring data points, sparse high dimensional data makes that exponentially less likely. Thus we will need to ensure we have sufficient training data to overcome this burden.
We’ve formulated our problem as one of associating GIFs with their sentence descriptions, but this isn’t exactly a well trodden path — I searched and searched for a dataset specific to this to no avail. GIPHY.com is closest — with a substantial trove of GIFs, and associated human labels for each, but the labels are rarely of the contents of the image itself — instead they are often tags regarding popular culture references, names of people/objects, or general emotions associated with the imagery. By recognizing that we could focus on live action GIFs — which are just short, low resolution videos — I found the Microsoft Research Video Description Corpus, a dataset of 120k sentence descriptions for 2k short YouTube video clips.
In the Skip-Thoughts paper they show that their model returns vectors that are sufficiently generalizable that they can demonstrate competitive image-sentence ranking (a very similar task to ours, just with static images instead of GIFs) with a simple linear embedding of both image and sentence features into a joint 1000 dimensional space. Thus, we attempt to replicate those results but with the YouTube dataset. Our model will be assembled such that there are two embedding matrices, one from the image representation and one from the sentence representation, into a 1024 dimensional joint space. There will be no non-linearities, in order to prevent excessive loss of information. When learning a model, we need to know what makes a good set of parameters and what makes a bad one, so we can appropriately update the parameters and get a better model at the end of our learning process — this is called our “objective function".
Typically in supervised learning we know the exact answers that our model is supposed to be outputting, so we can directly minimize the difference between our model’s outputs and the correct answers for our dataset. In our case here, we don’t know exactly what the embedded vectors in this low dimensional space should be, only that for associated GIFs and sentences the embeddings should be close. We don’t want them to be exactly the same numbers though — there are multiple possible associations for every GIF and our model may draw stronger conclusions than it should. We can accomplish this objective with a formulation called max-margin, where for each training example we fetch one associated pair of GIFs and sentences, and one completely unassociated pair, then pull the associated ones closer to each other than the unassociated ones. We do this enough times (~5M times to be exact) and we have a model that accurately embeds GIFs and the sentences that describe them near one another.
Turning it into a real service
The technical side of actually getting this working is both completely unrelated in content to this post and sufficiently involved that it deserves a post of its own, but in short I run the service on AWS g2.8xlarge GPU instances with some autoscaling to deal with variable load. I also obviously can’t compete with a service like GIPHY on content, so instead of managing my own database of GIFs I take a hybrid approach where I maintain a sharded cache across the instances available, and when necessary grab the top 100 results from GIPHY, then rerank this entire collection with respect to the query you typed in. When you type a query into the box at http://deepgif.tarzain.com the embedding process described above is run on your query. If there are precomputed, cached GIFs with a sufficiently high score then I return those results immediately, otherwise I download some GIFs from GIPHY, rerank them, and return relevant results.
Well that’s about it! Have fun playing around with it — and please share cool results with #DeepGIF