MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Yunfei Xie1,*, Ce Zhou1,*, Lang Gao1,*, Juncheng Wu2,*, Xianhang Li2, Hong-Yu Zhou3, Sheng Liu4, Lei Xing4, James Zou4, Cihang Xie2, Yuyin Zhou2,
1 Huazhong University of Science and Technology, 2 UC Santa Cruz, 3 Harvard University, 4 Stanford University
alt text

Our textual description is multigranular with more attributes than radiology report of chest x-rays dataset MIMIC-CXR, visual QA dataset SLAKE and radiology objects caption dataset ROCO.

Abstract

This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal large language models to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. This dataset can be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain. The dataset is publicly available at [MedTrinity-25M]

MedTrinity-25M Pipeline

Data construction pipeline. 1) Data processing: extracting essential information from collected data, including metadata integration to generate coarse caption, ROI locating, and medical knowledge collection. 2) Multigranular textual description generation: using this information to prompt MLLM to generate MedTrinity-25M.

Statistical overview of MedTrinity-25M

Image 1

(a) Modality distribution in MedTrinity-25M.

Image 2

(b) Anatomical and biological structures in MedTrinity-25M.

Image 3

(c) Data size comparison.

Image 4

(d) Wordcloud of disease statistic in MedTrinity-25M.

Acknowledgement

We thank the Microsoft Accelerate Foundation Models Research Program, the OpenAI Researcher Access Program, TPU Research Cloud (TRC) program, Google Cloud Research Credits program, AWS Cloud Credit for Research program, and Lambda Cloud for supporting our computing needs.

BibTeX


      @misc{xie2024medtrinity25mlargescalemultimodaldataset,
        title={MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine}, 
        author={Yunfei Xie and Ce Zhou and Lang Gao and Juncheng Wu and Xianhang Li and Hong-Yu Zhou and Sheng Liu and Lei Xing and James Zou and Cihang Xie and Yuyin Zhou},
        year={2024},
        eprint={2408.02900},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2408.02900}, 
  }