Overview
TroL visual instruction tuning dataset 2.3M is a set of Large-Multi-Modal-powered visual instruction tuning dataset.
The dataset aims to provide a balanced and diverse set of samples, ensuring broad coverage across various visual and cognitive tasks. With a total of 2.3 million samples, this dataset is a valuable resource for training models in vision-language task, enhancing their ability to understand, reason, and perform complex tasks efficiently.
Key Components
Dataset Source:
• Fundamental Image Understanding & Common-Sense Knowledge: Samples from ShareGPT4V-Caption/Instruct.
• Non-Object Concepts & Cognitive Reasoning: Samples from ALLaV4V-Text and MiniGemini-Instruct.
• Multi-Discipline Tasks & Integrated Abilities: Samples from GLLaVA-Align/Instruct and Math-Vision/Instruct/Plus
Sample Collection:
• Real-World Images & Text-Only Samples: 899K samples.
• Documents, Charts, Diagrams, Signs, and Symbols: 627K samples.
• Math Samples : 747K samples (180.5K with images, 566.8K text-only)
Gathered Dataset Layout