Learning Transferable Visual Models From Natural Language Supervision
Table of Contents
Title: Learning Transferable Visual Models From Natural Language Supervision
Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
Published: Feb 26, 2021
Link: https://arxiv.org/abs/2103.00020
Summary (Generated by Microsoft Copilot):
Introduction:
- The paper explores learning visual models from natural language supervision, aiming to overcome the limitations of traditional computer vision systems that rely on fixed object categories.
Challenges:
- Traditional models require additional labeled data for new visual concepts, limiting their generality and usability.
Methods:
- The authors propose a pre-training task of predicting which caption matches which image, using a dataset of 400 million (image, text) pairs collected from the internet.
Novelties:
- The approach enables zero-shot transfer to downstream tasks by referencing learned visual concepts through natural language.
Results:
- The model performs competitively on over 30 different computer vision datasets without needing dataset-specific training.
Performances:
- Achieves accuracy comparable to fully supervised models, such as matching ResNet-50 on ImageNet zero-shot.
Limitations:
- The paper acknowledges that performance on common benchmarks is still lower than alternative approaches.
Discussion:
- The findings suggest significant potential for scalable pre-training methods using natural language supervision in computer vision.