Table of Contents

Title: BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Authors: Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon

Published: Mar 2, 2023

Link: https://arxiv.org/abs/2303.00915

Summary (Generated by Microsoft Copilot):

Introduction:

  • BiomedCLIP is a multimodal biomedical foundation model pretrained on 15 million image-text pairs from scientific articles, aimed at enhancing biomedical vision-language processing.

Challenges:

  • Existing biomedical datasets are small and lack diversity, often focusing on specific image types like chest X-rays, limiting generalizability.

Methods:

  • PMC-15M dataset was created from 4.4 million scientific articles in PubMed Central, containing diverse biomedical images and captions.
  • BiomedCLIP uses domain-specific adaptations like PubMedBERT for text encoding and Vision Transformer models for image encoding.

Novelties:

  • Largest biomedical multimodal dataset to date, significantly larger and more diverse than previous datasets.
  • Domain-specific adaptations for both text and image processing.

Results:

  • BiomedCLIP achieved state-of-the-art performance in various biomedical tasks, including cross-modal retrieval, image classification, and visual question answering.

Performances:

  • Outperformed general-domain models and prior biomedical models, even in low-resource settings.

Limitations:

  • Composite figures not split into sub-figures.
  • Computational constraints limited the exploration of larger models and higher image resolutions.

Discussion:

  • BiomedCLIP demonstrates the importance of large-scale, diverse pretraining for biomedical vision-language models, paving the way for future research and applications.