Notice

[#299]   2024-07-17 [ACM MM 2024]  Efficient Training for Multilingual Visual Speech Recognition (by Minsu Kim, Jeonghun Yeo) is accepted in ACM MM 2024

Title: Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

Authors: Minsu Kim*, Jeonghun Yeo*, Se Jin Park, Hyeongseop Rha, Yong Man Ro (* equal contributor)


This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speech unit that can be obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.

[#298]   2024-07-03 [ECCV 2024]  MoAI: Mixture of All Intelligence for Large Language and Vision Models (by Byung-Kwan Lee) is accepted in ECCV 2024

Title; MoAI: Mixture of All Intelligence for Large Language and Vision Models

Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro


The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

[#297]   2024-07-03 [Pattern Recognition]  Text-Guided Distillation Learning to Diversify Video Embeddings (by Sangmin Lee) is accepted in Pattern Recognition

Title; Text-Guided Distillation Learning to Diversify Video Embeddings for Text-Video Retrieval

Authors: Sangmin Lee, Hyung-Il Kim, Yong Man Ro


Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multikey memory to consider temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms other methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on multiple video embeddings.

[#296]   2024-07-03 [ICIP 2024]  Weather-aware Drone-view Object Detection via Environmental Context Understanding (by Hyunjun Kim) is accepted in ICIP 2024

Title; Weather-aware Drone-view Object Detection via Environmental Context Understanding

Authors: Hyunjun Kim, Dahye Lee, Sungjune Park, Yong Man Ro


Drone-view object detection has shown noticeable performances and has been adopted by various real-world applications. However, there exist still several problems to be handled for its safe usage. While most existing methods have tried to manage a variety of object scales, there are very few works to deal with diverse weather conditions. Therefore, in this paper, we propose a novel approach to build a drone-view object detector robust against the adverse effects of diverse environmental factors, such as foggy, rainy, and low illumination. To this end, we generated a weather content feature set using a multimodal large language model (MLLM), to describe diverse weather, illumination, and visibility conditions. These features are then adaptively selected based on the input image and applied to the detection framework to recognize the environmental semantics in the given visual images. Hereby, a detection framework can have environmental context understanding capability in drone-view images. With the comprehensive experiments and analysis, we corroborate the effectiveness of the proposed method showing the robustness against adverse weather conditions.

[#295]   2024-07-03 [ICIP 2024]  Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-driven Approach for Cross-modal Alignment Fusion (by Taeheon Kim, Sangyun Chung, Youngjoon Yu) is accepted in ICIP 2024 Workshop

Title; Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-driven Approach for Cross-modal Alignment Fusion

Authors: Taeheon Kim*, Sangyun Chung*, Youngjoon Yu*, Yong Man Ro (*equal contributor)


Multispectral pedestrian detection is a crucial component in various critical applications. However, a significant challenge arises due to the misalignment between these modalities, particularly under real-world conditions where data often appear heavily misaligned. Conventional methods developed on well-aligned or minimally misaligned datasets fail to address these discrepancies adequately. This paper introduces a new framework for multispectral pedestrian detection designed specifically to handle heavily misaligned datasets without the need for costly and complex traditional pre-processing calibration. By leveraging Large-scale Vision-Language Models (LVLM) for cross-modal semantic alignment, our approach seeks to enhance detection accuracy by aligning semantic information across the RGB and thermal domains. This method not only simplifies the operational requirements but also extends the practical usability of multispectral detection technologies in practical applications.

[#294]   2024-06-26 [2024 가을학기 합격생 연구실 TO]  국비 석사 2명, KAIST 석사 1명, 산학장학생 등 TO 있습니다.   

국비 석사 2명,  KAIST 석가 1 명,  산학장학생 등 모집합니다.

모집 연구분야

관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.

[#293]   2024-05-19  [Recent Ph.D. graduate: postdocs]  Minsu, Ph.D graduate of 2024 has joined postdoc in AI research at META.

Dr. Minsu Kim, who received his Ph.D. in February 2024, has joined the AI research group at META in London as a postdoctoral researcher. We extend our congratulations to him and hope that he will achieve outstanding results in AI research. By combining the research skills he developed during his Ph.D. at the IVY and LVL labs, particularly in human multimodal AI, with the cutting-edge research he will undertake at META, we believe Dr. Kim will make significant contributions to the field of AI.


[#292]   2024-05-19  [Amazon, Google Internships]   Sungjune and Se Jin will join Amazon and Google for research internships, respectively.

Two PhD students from the IVY lab have secured research internships at Amazon and Google in USA, both  leading institutions in the field of AI. Sungjune Park will join Amazon, and Se Jin Park will join Google to enhance their ongoing research during their PhD studies. Sungjune Park has published several top-tier papers on multimodal AI, focusing on integrating vision and language, while Se Jin Park has published several top-tier papers on human multimodal AI, specifically on the ability to process and understand human-relevant modalities such as spoken language and facial-audio expressions. They expect to complete a paper as an outcome of their internships. This research internship experience will enable them to expand and deepen their PhD research, thereby building global competitiveness.


[#291]   2024-05-16  [ACL 2024]   CoLLaVO: Crayon Large Language and Vision mOdel (Byung-Kwan Lee) is accepted in Findings of the Association for Computational Linguistics, ACL 2024

Title: CoLLaVO: Crayon Large Language and Vision mOdel

Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2024

[#290]   2024-05-16  [ACL 2024]   Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation (Se Jin Park, Chae Won Kim) accepted In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL 2024

Title: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Authors: Se Jin Park*, Chae Won Kim*, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, and Yong Man Ro 

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (\ie, audio and visual) spoken dialogue corpus containing 387 hours of approximately 10,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. All the data will be open-sourced.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2024

[#289]   2024-04-26  [Pattern Recognition]  Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank (by Sungjune Park, Hyunjun Kim) is accepted in Pattern Recognition

Title: Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank 

Authors: {Sungjune Park, Hyunjun Kim: equal first authors}, and Yong Man Ro

Pedestrian detection is a crucial field of computer vision research which can be adopted in various real-world applications (e.g., self-driving systems). However, despite the noticeable evolution of pedestrian detection, the pedestrian representations learned within a detection framework are usually limited to the particular scene data in which they were trained. Therefore, in this paper, we propose a novel approach to construct versatile pedestrian knowledge bank containing representative pedestrian knowledge which can be applicable to various detection frameworks and adopted in diverse scenes. We extract generalized pedestrian knowledge from a large-scale pretrained model, and we curate them by quantizing most representative features and guiding them to be more distinguishable from various background scenes. After they are stored in the versatile pedestrian knowledge bank, we leverage them to complement and enhance pedestrian features within a detection framework. Through comprehensive experiments, we validate the effectiveness of our method, demonstrating its versatility and outperforming state-of-the-art  detection performances.



IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), Pattern Recognition

[#288]   2024-03-26  [IEEE TCSVT]  Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection (by Sungjune Park, Hyunjun Kim) is accepted in IEEE Trans. on CSVT

Title: Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection 

Authors: {Sungjune Park, Hyunjun Kim: equal first authors}, and Yong Man Ro

Large language models (LLMs) have shown their capability in understanding contextual and semantic information regarding appearance knowledge of instances. In this paper, we introduce a novel approach to utilize the strength of an LLM in understanding contextual appearance variations and to leverage its knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of crucial tasks directly related with our safety (e.g., intelligent driving system), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish description corpus which includes numerous narratives describing various appearances of pedestrians and others. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. After that, we perform a task-prompting process to obtain appearance elements which are representative appearance knowledge guided to be relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE TCSVT

[#287]   2024-03-12  [2024 가을학기 대학원생 모집]  국비 석사 2명, KAIST박사 1명, 산학장학생 등 모집합니다. 관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.

국비 석사 2명,  KAIST박사 1명,  산학장학생 등 모집합니다.

모집 연구분야

관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.



IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST),

[#286]   2024-02-27  [CVPR 2024]  Causal Mode Multiplexer: A Novel Framework for Unbiased Data (by Taeheon Kim) is accepted in CVPR 2024

Title: Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

Authors: {Taeheon Kim, Sebin Shin: equal first authors}, Youngjoon Yu, Hak Gu Kim, and Yong Man Ro

RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024

[#285]   2024-02-27  [CVPR 2024]  AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation (by Se Jin Park, Minsu Kim) is accepted in CVPR 2024

Title: AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Authors:  {Jeongsoo Choi, Se Jin Park, Minsu Kim: equal first authors}, and Yong Man Ro

This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024

[#284]   2024-02-27  [IEEE TMM]  AKVSR: Compressing Audio Knowledge of a Pretrained Model (by Jeong Hun Yeo) is accepted in IEEE Trans. on Multimedia

Title: AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio  Knowledge of a Pretrained Model

Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, and Yong Man Ro

Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE Transactions on Multimedia

[#283]   2024-02-22  Recruitment for PhD and MS Students

Title: Recruitment for PhD and MS Students

The IVY Laboratory is promoting international exchanges. For students applying to join the lab after September 2024, we prefer PhD candidates who are interested in international growth after completing their doctoral program and aim to pursue international opportunities upon graduation. Additionally, for master's degree candidates interested in joining our laboratory, we invite even those who aspire to pursue a PhD abroad or seek international career paths. We particularly welcome students who already have a lot of interest and experience in studying and researching deep learning-based approaches. Interested students are encouraged to contact us via email at ymro@kaist.ac.kr.

We look forward to hearing from you.

[#282]   2024-02-21  Prof. Yong Man Ro Named ICT Endowed Chair Professor at KAIST

Title: Prof. Yong Man Ro Named ICT Endowed Chair Professor at KAIST

Prof. Yong Man Ro has been appointed as the ICT Endowed Chair Professor at KAIST. Since establishing the IVY Lab in 1997, Prof. Ro has been instrumental in advancing research in image processing, computer vision, artificial intelligence (AI), and multimedia.

Under his guidance, the IVY Lab has achieved remarkable milestones, including the graduation of 25 PhD and 70 Master's students, who have gone on to make significant contributions in IT area in the world. The laboratory's research output is highly competitive and excellent, including more than 520 peer-reviewed journal articles and top conference papers.


IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024