XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Table of Contents
Title: XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Authors: Biao Wu, Yutong Xie, Zeyu Zhang, Minh Hieu Phan, Qi Chen, Ling Chen, Qi Wu
Published: Jul 28 2024
Link: https://arxiv.org/abs/2407.19546
Summary (Generated by Microsoft Copilot):
Introduction:
- The paper introduces XLIP, a framework for medical language-image pre-training using cross-modal attention masked modeling.
Challenges:
- Scarcity of medical data for accurate reconstruction of pathological features.
- Existing methods often use only paired or unpaired data, not both.
Methods:
- Attention-masked image modeling (AttMIM) and entity-driven masked language modeling (EntMLM) to enhance feature learning.
- Utilizes both paired and unpaired data with disease-kind prompts.
Novelties:
- Cross-modal attention masking for better pathological feature learning.
- Blending masking strategy integrating attention-guided and prompt-driven masking.
Results:
- Achieves state-of-the-art (SOTA) performance in zero-shot and fine-tuning classification on five datasets.
Performances:
- Outperforms existing models like MedKLIP, GLORIA, and CheXzero in various medical image classification tasks.
Limitations:
- The paper does not explicitly mention limitations, but challenges in data scarcity and model complexity are implied.
Discussion:
- The proposed XLIP framework shows significant improvements in medical data representation and classification, highlighting the potential for advanced medical VLP methods.