We’re introducing a neural network called clip which efficiently learns visual concepts from natural language supervision. Clip can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the “zero-shot” capabilities of gpt-2 and gpt-3. Although both models have the same accuracy on the ImageNet test set, CLIP’s performance is much more representative of how it will fare on datasets that measure accuracy in different, non-ImageNet settings.
There are 2 functions of clip:
Background and related work
Approach
Background and related work introduce.
Clip (contrastive language–image pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade8 but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories.910 a critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer.
Approach introduce.
In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.
Clip is part of a group of papers revisiting learning visual representations from natural language supervision in the past year. This line of work uses more modern architectures like the transformer and includes virtex,which explored autoregressive language modeling, icmlm, which investigated masked language modeling, and convirt,which studied the same contrastive objective we use for clip but in the field of medical imaging.