In recent years, the AI field has made tremendous progress in developing AI systems that can learn from massive amounts of carefully labeled data. This paradigm of supervised learning has a proven track record for training specialist models that perform extremely well on the task they were trained to do. Unfortunately, there’s a limit to how far the field of AI can go with supervised learning alone.
Supervised learning is a bottleneck for building more intelligent generalist models that can do multiple tasks and acquire new skills without massive amounts of labeled data. Practically speaking, it’s impossible to label everything in the world. There are also some tasks for which there’s simply not enough labeled data, such as training translation systems for low-resource languages. If AI systems can glean a deeper, more nuanced understanding of reality beyond what’s specified in the training data set, they’ll be more useful and ultimately bring AI closer to human-level intelligence.
As babies, we learn how the world works largely by observation. We form generalized predictive models about objects in the world by learning concepts such as object permanence and gravity. Later in life, we observe the world, act on it, observe again, and build hypotheses to explain how our actions change our environment by trial and error.
A working hypothesis is that generalized knowledge about the world, or common sense, forms the bulk of biological intelligence in both humans and animals. This common sense ability is taken for granted in humans and animals, but has remained an open challenge in AI research since its inception. In a way, common sense is the dark matter of artificial intelligence.
Common sense helps people learn new skills without requiring massive amounts of teaching for every single task. For example, if we show just a few drawings of cows to small children, they’ll eventually be able to recognize any cow they see. By contrast, AI systems trained with supervised learning require many examples of cow images and might still fail to classify cows in unusual situations, such as lying on a beach. How is it that humans can learn to drive a car in about 20 hours of practice with very little supervision, while fully autonomous driving still eludes our best AI systems trained with thousands of hours of data from human drivers? The short answer is that humans rely on their previously acquired background knowledge of how the world works.
How do we get machines to do the same?
We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.
Self-supervised learning enables AI systems to learn from orders of magnitude more data, which is important to recognize and understand patterns of more subtle, less common representations of the world. Self-supervised learning has long had great success in advancing the field of natural language processing (NLP), including the Collobert-Weston 2008 model, Word2Vec, GloVE, fastText, and, more recently, BERT, RoBERTa, XLM-R, and others. Systems pretrained this way yield considerably higher performance than when solely trained in a supervised manner.
Our latest research project SEER leverages SwAV and other methods to pretrain a large network on a billion random unlabeled images, yielding top accuracy on a diverse set of vision tasks. This progress demonstrates that self-supervised learning can excel at CV tasks in complex, real-world settings as well.
Today, we’re sharing details on why self-supervised learning may be helpful in unlocking the dark matter of intelligence — and the next frontier of AI. We’re also highlighting what we believe are some of the most promising new directions of energy-based models for prediction in the presence of uncertainty, joint embedding methods and latent-variable architectures for self-supervised learning and reasoning in AI systems.
Self-supervised learning obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. The general technique of self-supervised learning is to predict any unobserved or hidden part (or property) of the input from any observed or unhidden part of the input. For example, as is common in NLP, we can hide part of a sentence and predict the hidden words from the remaining words. We can also predict past or future frames in a video (hidden data) from current ones (observed data). Since self-supervised learning uses the structure of the data itself, it can make use of a variety of supervisory signals across co-occurring modalities (e.g., video and audio) and across large data sets — all without relying on labels.
In self-supervised learning, the system is trained to predict hidden parts of the input (in gray) from visible parts of the input (in green).
As a result of the supervisory signals that inform self-supervised learning, the term “self-supervised learning” is more accepted than the previously used term “unsupervised learning.” Unsupervised learning is an ill-defined and misleading term that suggests that the learning uses no supervision at all. In fact, self-supervised learning is not unsupervised, as it uses far more feedback signals than standard supervised and reinforcement learning methods do.
Self-supervised learning has had a particularly profound impact on NLP, allowing us to train models such as BERT, RoBERTa, XLM-R, and others on large unlabeled text data sets and then use these models for downstream tasks. These models are pretrained in a self-supervised phase and then fine-tuned for a particular task, such as classifying the topic of a text. In the self-supervised pretraining phase, the system is shown a short text (typically 1,000 words) in which some of the words have been masked or replaced. The system is trained to predict the words that were masked or replaced. In doing so, the system learns to represent the meaning of the text so that it can do a good job at filling in “correct” words, or those that make sense in the context.
Predicting missing parts of the input is one of the more standard tasks for SSL pretraining. To complete a sentence such as “The (blank) chases the (blank) in the savanna,” the system must learn that lions or cheetahs can chase antelope or wildebeests, but that cats chase mice in the kitchen, not the savanna. As a consequence of the training, the system learns to represent the meaning of words, the syntactic role of words, and the meaning of entire texts.
These techniques, however, can’t be easily extended to new domains, such as CV. Despite promising early results, SSL has not yet brought about the same improvements in computer vision that we have seen in NLP (though this will change).
The main reason is that it is considerably more difficult to represent uncertainty in the prediction for images than it is for words. When the missing word cannot be predicted exactly (is it “lion” or “cheetah”?), the system can associate a score or a probability to all possible words in the vocabulary: high score for “lion,” “cheetah,” and a few other predators, and low scores for all other words in the vocabulary.
Training models at this scale also required a model architecture that was efficient in terms of both runtime and memory, without compromising on accuracy. Fortunately, a recent innovation by FAIR in the realm of architecture design led to a new model family called RegNets that perfectly fit these needs. RegNet models are ConvNets capable of scaling to billions or potentially even trillions of parameters, and can be optimized to fit different runtime and memory limitations.
But we do not know how to efficiently represent uncertainty when we predict missing frames in a video or missing patches in an image. We cannot list all possible video frames and associate a score to each of them, because there is an infinite number of them. While this problem has limited the performance improvement from SSL in vision, new techniques SSL techniques such as SwAV are starting to beat accuracy records in vision tasks. This is best demonstrated by the SEER system that uses a large convolutional network trained with billions of examples.
To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP compared with CV. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.
In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.
This seems like an intractable problem.
There is a way to think about SSL within the unified framework of an energy-based model (EBM). An EBM is a trainable system that, given two inputs, x and y, tells us how incompatible they are with each other. For example, x could be a short video clip, and y another proposed video clip. The machine would tell us to what extent y is a good continuation for x. To indicate the incompatibility between x and y, the machine produces a single number, called an energy. If the energy is low, x and y are deemed compatible; if it is high, they are deemed incompatible.
An energy-based model (EBM) measures the compatibility between an observation x and a proposed prediction y. If x and y are compatible, the energy is a small number; if they are incompatible, the energy is a larger number.
Training an EBM consists of two parts: (1) showing it examples of x and y that are compatible and training it to produce a low energy, and (2) finding a way to ensure that for a particular x, the y values that are incompatible with x produce a higher energy than the y values that are compatible with x. Part one is simple, but part two is where the difficulty lies.
For image recognition, our model takes two images, x and y, as inputs. If x and y are slightly distorted versions of the same image, the model is trained to produce a low energy on its output. For example, x could be a photo of a car, and y a photo of the same car that was taken from a slightly different location at a different time of day, so that the car in y is shifted, rotated, larger, smaller, and displaying slightly different colors and shadows than the car in x.
Joint embedding, Siamese networks
A particular well-suited deep learning architecture to do so is the so-called Siamese networks or joint embedding architecture. The idea goes back to papers from Geoff Hinton’s lab and Yann LeCun’s group in the early 1990s (here and here) and mid-2000s (here, here, and here). It was relatively ignored for a long time but has enjoyed a revival since late 2019. A joint embedding architecture is composed of two identical (or almost identical) copies of the same network. One network is fed with x and the other with y. The networks produce output vectors called embeddings, which represent x and y. A third module, joining the networks at the head, computes the energy as the distance between the two embedding vectors. When the model is shown distorted versions of the same image, the parameters of the networks can easily be adjusted so that their outputs move closer together. This will ensure that the network will produce nearly identical representations (or embedding) of an object, regardless of the particular view of that object.
Joint embedding architecture. The function C at the top produces a scalar energy that measures the distance between the representation vectors (embeddings) produced by two identical twin networks sharing the same parameters (w). When x and y are slightly different versions of the same image, the system is trained to produce a low energy, which forces the model to produce similar embedding vectors for the two images. The difficult part is to train the model so that it produces high energy (i.e., different embeddings) for images that are different.
The difficulty is to make sure that the networks produce high energy, i.e. different embedding vectors, when x and y are different images. Without a specific way to do so, the two networks could happily ignore their inputs and always produce identical output embeddings. This phenomenon is called a collapse. When a collapse occurs, the energy is not higher for nonmatching x and y than it is for matching x and y.
There are two categories of techniques to avoid collapse: contrastive methods and regularization methods.
Contrastive energy-based SSL
Contrastive methods are based on the simple idea of constructing pairs of x and y that are not compatible, and adjusting the parameters of the model so that the corresponding output energy is large.
Training an EBM with a contrastive method consists in simultaneously pushing down on the energy of compatible (x,y) pairs from the training set, indicated by the blue dots, and pushing up on the energy of well chosen (x,y) pairs that are incompatible, symbolized by the green dots. In this simple example, x and y are both scalars, but in real situations, x and y could be an image or a video with millions of dimensions. Coming up with incompatible pairs that will shape the energy in suitable ways is challenging and expensive computationally.
The method used to train NLP systems by masking or substituting some input words belongs to the category of contrastive methods. But they don’t use the joint embedding architecture. Instead, they use a predictive architecture in which the model directly produces a prediction for y. One starts for a complete segment of text y, then corrupts it, e.g., by masking some words to produce the observation x. The corrupted input is fed to a large neural network that is trained to reproduce the original text y. An uncorrupted text will be reconstructed as itself (low reconstruction error), while a corrupted text will be reconstructed as an uncorrupted version of itself (large reconstruction error). If one interprets the reconstruction error as an energy, it will have the desired property: low energy for “clean” text and higher energy for “corrupted” text.
The general technique of training a model to restore a corrupted version of an input is called denoising auto-encoder. While early forms of this idea go back to the 1980s, it was revived in 2008 by Pascal Vincent and colleagues at the University of Montréal, introduced in the context of NLP by Collobert and Weston, and popularized by the BERT paper from our friends at Google.
A masked language model, which is an instance of denoising auto-encoder, itself an instance of contrastive self-supervised learning. Variable y is a text segment; x is a version of the text in which some words have been masked. The network is trained to reconstruct the uncorrupted text.
As we pointed out earlier, a predictive architecture of this type can produce only a single prediction for a given input. Since the model must be able to predict multiple possible outcomes, the prediction is not a single set of words but a series of scores for every word in the vocabulary for each missing word location.
But we cannot use this trick for images because we cannot enumerate all possible images. Is there a solution to this problem? The short answer is no. There are interesting ideas in this direction, but they have not yet led to results that are as good as joint embedding architectures. One interesting avenue is latent-variable predictive architectures.
A latent-variable predictive architecture. Given an observation x, the model must be able to produce a set of multiple compatible predictions symbolized by an S-shaped ribbon in the diagram. As the latent variable z varies within a set, symbolized by a gray square, the output varies over the set of plausible predictions.
Latent-variable predictive models contain an extra input variable (z). It is called latent because its value is never observed. With a properly trained model, as the latent variable varies over a given set, the output prediction varies over the set of plausible predictions compatible with the input x.
Latent-variable models can be trained with contrastive methods. A good example of this is a generative adversarial network (GAN). The critic (or discriminator) can be seen as computing an energy indicating whether the input y looks good. The generator network is trained to produce contrastive samples to which the critic is trained to associate high energy.
But contrastive methods have a major issue: They are very inefficient to train. In high-dimensional spaces such as images, there are many ways one image can be different from another. Finding a set of contrastive images that cover all the ways they can differ from a given image is a nearly impossible task. To paraphrase Leo Tolstoy’s Anna Karenina: “Happy families are all alike; every unhappy family is unhappy in its own way.” This applies to any family of high-dimensional objects, it seems.
What if it were possible to make sure the energy of incompatible pairs is higher than that of compatible pairs without explicitly pushing up on the energy of many incompatible pairs?
Non-contrastive energy-based SSL
Non-contrastive methods applied to joint embedding architectures is possibly the hottest topic in SSL for vision at the moment. The domain is still largely unexplored, but it seems very promising.
Non-contrastive methods for joint-embedding include DeeperCluster, ClusterFit, MoCo-v2, SwAV, SimSiam, Barlow Twins, BYOL from DeepMind, and a few others. They use various tricks, such as computing virtual target embeddings for groups of similar images (DeeperCluster, SwAV, SimSiam) or making the two joint embedding architectures slightly different through the architecture or the parameter vector (BYOL, MoCo). Barlow Twins tries to minimize the redundancy between the individual components of the embedding vectors.
Perhaps a better alternative in the long run will be to devise non-contrastive methods with latent-variable predictive models. The main obstacle is that they require a way to minimize the capacity of the latent variable. The volume of the set over which the latent variable can vary limits the volume of outputs that take low energy. By minimizing this volume, one automatically shapes the energy in the right way.
A successful example of such a method is the Variational Auto-Encoder (VAE), in which the latent variable is made “fuzzy”, which limits its capacity. But VAE have not yet been shown to produce good representations for downstream visual tasks. Another successful example is sparse modeling, but its use has been limited to simple architectures. No perfect recipe seems to exist to limit the capacity of latent variables.
The challenge of the next few years may be to devise non-contrastive methods for latent-variable energy-based model that successfully produce good representations of image, video, speech, and other signals and yield top performance in downstream supervised tasks without requiring large amounts of labeled data.
Most recently, we’ve created and open sourced a new billion-parameter self-supervised CV model called SEER that’s proven to work efficiently with complex, high-dimensional image data. It is based on the SwAV method applied to a convolutional network architecture (ConvNet) and can be trained from a vast number of random images without any metadata or annotations. The ConvNet is large enough to capture and learn every visual concept from this large and complex data. After pretraining on a billion random, unlabeled and uncurated public Instagram images, and supervised fine-tuning on ImageNet, SEER outperformed the most advanced, state-of-the-art self-supervised systems, reaching 84.2 percent top-1 accuracy on ImageNet.
These results show that we can bring the self-supervised learning paradigm shift to computer vision.
At Facebook, we’re not just advancing self-supervised learning techniques across many domains through fundamental, open scientific research, but we’re also applying this leading-edge work in production to quickly improve the accuracy of content understanding systems in our products that keep people safe on our platforms.
Self-supervision research, like our pretrained language model XLM, is accelerating important applications at Facebook today — including proactive detection of hate speech. And we’ve deployed XLM-R, a model that leverages our RoBERTa architecture, to improve our hate speech classifiers in multiple languages across Facebook and Instagram.This will enable hate speech detection even in languages for which there is very little training data.
We’re encouraged by the progress of self-supervision in recent years, though there’s still a long way to go until this method can help us uncover the dark matter of AI intelligence. Self-supervision is one step on the path to human-level intelligence, but there are surely many steps that lie behind this one. Long-term progress will be cumulative. That’s why we’re committed to working collaboratively with the broader AI community to achieve our goal of, one day, building machines with human-level intelligence. Our research has been made publicly available and published at top conferences. And we’ve organized workshops and released libraries to help accelerate the research in this area.