XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

Table of Contents

Title: XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

Authors: Biao Wu, Yutong Xie, Zeyu Zhang, Minh Hieu Phan, Qi Chen, Ling Chen, Qi Wu

Published: Jul 28 2024

Summary (Generated by Microsoft Copilot):

Introduction:

The paper introduces XLIP, a framework for medical language-image pre-training using cross-modal attention masked modeling.

Challenges:

Scarcity of medical data for accurate reconstruction of pathological features.
Existing methods often use only paired or unpaired data, not both.

Methods:

Attention-masked image modeling (AttMIM) and entity-driven masked language modeling (EntMLM) to enhance feature learning.
Utilizes both paired and unpaired data with disease-kind prompts.

Novelties:

Cross-modal attention masking for better pathological feature learning.
Blending masking strategy integrating attention-guided and prompt-driven masking.

Results:

Achieves state-of-the-art (SOTA) performance in zero-shot and fine-tuning classification on five datasets.

Performances:

Outperforms existing models like MedKLIP, GLORIA, and CheXzero in various medical image classification tasks.

Limitations:

The paper does not explicitly mention limitations, but challenges in data scarcity and model complexity are implied.

Discussion:

The proposed XLIP framework shows significant improvements in medical data representation and classification, highlighting the potential for advanced medical VLP methods.