Rich Human Feedback for Text-to-Image Generation, CVPR2024 Best Paper Award
Table of Contents
Overview
Paper: Liang et al., Rich Human Feedback for Text-to-Image Generation (cvpr2024 open access or arxiv).
This paper is one of the CVPR2024 best paper award winners.
(Figures and tables in this post are from the original paper)
Novelties of the Paper
- They proposed new dataset called “Rich Human Feedback on 18K generated images” (RichHF-18K), which consists these three information:
- Point annotations: implausibility regions, artifacts areas and text-image misalignment parts in generated images.
- Labeled prompt words: words of missing or misrepresented concepts in prompt.
- Four scores: plausibility, text-image alignment, aesthetics, and total rating for generated images.
- RichHR-18K dataset is publicly available on their GitHub repository.
- They also proposed new multimodal transformer architecture, based on ViT and T5X, called “Rich Automatic Human Feedback” (RAHF), which takes pairs of prompts and images generated by using corresponding prompts, then predicts (1) heatmaps of implausibility and misalignment regions in input images, (2) scores of input and (3) misalignment words in input prompt.
- They showed the abilities of RAHF to improve other tasks including finetuning of existed model with predicted scores, and inpainting task with predicted heatmaps and scores.
Performance Evaluation Methods
- They evaluated RAHF performance by comparing ResNet-50, PickScore and CLIP.
- The Pearson linear correlation coefficient (PLCC), Spearman rank correlation coefficient (SRCC) are used for score evaluation. Mean square error (MSE), CC, KLD, SIM, NSS and AUC-Judd are utilized for heatmap. Precision, Recall and F1 Score are used for text evaluations.
- Human evaluated which generated image is better, by the original Muse and by the finetuned Muse with RAHF score.
Discussions
- HARF outperforms other models in most metrics.
- The reason why HARF was worse than ResNet-50 in the evaluation of misalignment heatmaps, may be due to poorly defined ground truth regions.
- Human quantitative evaluations show the results that, in more than 50% of the samples, the finetuned Muse was judged significantly or slightly better than the original one.
- They mentioned there are numerous ways to utilize RichHF-18K dataset and HARF model.
What I learned
- AI is utilized to train other AI model (input prompts for finetuning of Muse were generated by other LLM model PaLM2).
- Top-level research not only shows their achievements, but also the future research directions for AI researchers.