Overview
Meteor visual instruction tuning dataset 2.1M is a set of Large-Multi-Modal-powered visual instruction tuning dataset.
The dataset aims to provide a rich resource for training models on various visual task, enhancing their ability to understand and reason about images in diverse context. It is meticulously curated to include a wide array of visual information and corresponding question-answer pairs, making it a valuable tool for advancing visual instruction tuning and related research fields.
Key Components
Question-Answer Pairs:
• Fundamental Image Understanding & Common-Sense Knowledge: 664K pairs from ShareGPT4V-Instruct, including LLaVA-Instruct-665K.
• Non-Object Concepts & Cognitive Reasoning: 27K pairs from DocVQA, ChartQA, DVQA, and AI2D.
• Multi-Discipline Tasks & Integrated Abilities: 574K/27K pairs from DeepForm, InfoVQA, DocVQA, KleisterCharity, TabFact, TextVQA, WikiTable, TextCaps, and VisualMRC.
Image Descriptions:
• Common Categories: 91K descriptions for images from LAION, CC, SBU, MS-COCO, TextCaps, and various web sources, depicting landmarks, animals, celebrities, art, text, and nature.
Math Knowledge:
• Math-related Question-Answer Pairs: 177K pairs from GLLaVA-Align/Instruct and 3K from Math-Vision.
• Text-Only Math Samples: 566K from Math-Instruct/Plus.
Gathered Dataset Layout