[#310] 2024-10-18 [IEEE TPAMI] Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition (by Minsu Kim) is accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence
Title: Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition
Minsu Kim, Hyeong-Il Kim, Yong Man Ro
Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR model on unseen speakers, we propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive VSR. Specifically, motivated by recent advances in Natural Language Processing (NLP), we finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters. Different from the previous prompt tuning methods mainly limited to Transformer variant architecture, we explore different types of prompts, the addition, the padding, and the concatenation form prompts that can be applied to the VSR model which is composed of CNN and Transformer in general. With the proposed prompt tuning, we show that the performance of the pre-trained VSR model on unseen speakers can be largely improved by using a small amount of adaptation data (e.g., less than 5 minutes), even if the pre-trained model is already developed with large speaker variations. Moreover, by analyzing the performance and parameters of different types of prompts, we investigate when the prompt tuning is preferred over the finetuning methods. The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases, LRW-ID and GRID.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE TPAMI 2024
[#309] 2024-10-15 [NVIDIA Internship] Byung Kwan Lee will join NVIDIA for a research internship.
Byung Kwan Lee will join NVIDIA to enhance his ongoing research on Vision LLM for his Ph.D. studies. Byung Kwan has recently published several top-tier papers on Vision LLMs, focusing on integrating vision and language as well as efficient VLLM. He expects to complete a paper as an outcome of his internship. This research internship experience will enable him to expand and deepen his Ph.D. research, as well as IVL lab research, thereby building global competitiveness.
[#308] 2024-10-09 [IEEE TNNLS] Advancing Causal Intervention in Image Captioning with Causal Prompt (by Youngjoon Yu) is accepted in IEEE Transactions on Neural Networks and Learning Systems
Title: Advancing Causal Intervention in Image Captioning with Causal Prompt
Youngjoon Yu, Yeonju Kim, Yong Man Ro
This paper introduces a novel approach, called Causal Prompting Network (CPNet), to enhance the causal intervention in the context of image captioning. By leveraging visual prompt engineering in the feature space, this method aims to achieve superior performance in causal intervention tasks. Since CPNet is highly flexible and adaptable, it can be incorporated into any existing causal intervention-based image captioning framework. Specifically, two types of visual prompts — Causal RoI Prompt (CRP) and Causal Matching Prompt (CMP) — are employed to refine the feature representations effectively. CRP is utilized on the Region of Interest (RoI) feature of the object feature to enhance RoI features with deconfounded causal features. Meanwhile, CMP is used to strengthen the contextual representation of confounders linked to image captioning tasks. To evaluate the proposed CPNet’s effectiveness, an extensive range of experiments are conducted on the popular MS-COCO and Flickr30k dataset, and the results are validated using the Karpathy split. Experimental results demonstrate that the proposed CPNet surpasses the performance of other state-of-the-art image captioning methods.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE TNNLS 2024
[#307] 2024-09-27 [2025 봄학기 연구실 현재 TO] 국비 석사 2명 TO, KAIST 프로그램 장학금(KEPSI, EPSS, LGenius, EPSD) TO 있습니다.
국비 석사 2명, 산학장학생 등 모집합니다.
학생 초청 연구분야
Vision + LLM (large language model) / LVLM (large vision language model)
AI 모델: XAI, 역량, 메모리, Human-Machine Interaction, Robust, Multimodal AI
Integration Vision, Language and Audio/Speech
관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.
[#306] 2024-09-26 [NeurIPS 2024] CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models (by Junho Kim, Hyunjun Kim) is accepted at NeurIPS 2024
Title: CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim*, Hyunjun Kim*, Yeonju Kim, Yong Man Ro (* equal contribution)
Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM's vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.
[#305] 2024-09-26 [NeurIPS 2024] Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models (by Byung-Kwan Lee) is accepted at NeurIPS 2024
Title: Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro
The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.
[#304] 2024-09-21 [EMNLP 2024] From CollaVo (ACL 24) to MoAI (ECCV 24), Now TroL: Advancing Large Language and Vision Model (by Byung-Kwan Lee) is accepted at EMNLP 2024
Title: TroL: Traversal of Layers for Large Language and Vision Models
Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro
Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.
[#303] 2024-09-21 [EMNLP 2024] Where Visual Speech Meets Language: VSP-LLM (by Jeong Hun Yeo, Seunghee Han) is accepted at the Findings of EMNLP 2024
Title: Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Authors: Jeong Hun Yeo*, Seunghee Han*, Minsu Kim, Yong Man Ro (* equal contributor)
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.
[#302] 2024-09-21 [EMNLP 2024] What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models (by Junho Kim) is accepted at the Findings of EMNLP 2024
Title: What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
Authors: Junho Kim, Yeon Ju Kim, Yong Man Ro
This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of counterfactual thinking, a cognitive process where human considers alternative realities, enabling more extensive context exploration. Bridging the human cognition mechanism into LMMs, we aim for the models to engage with and generate responses that span a wider contextual scene understanding, mitigating hallucinatory outputs. We further introduce Plausibility Verification Process (PVP), a simple yet robust keyword constraint that effectively filters out sub-optimal keywords to enable the consistent triggering of counterfactual thinking in the model responses. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination and helps to broaden contextual understanding based on true visual clues.
[#301] 2024-08-19 [Outstanding Paper Award in ACL 2024] Se Jin Park and Chae Won Kim have won the Outstanding Paper Award at the ACL (Association for Computational Linguistics) 2024 conference.
PhD students Se Jin Park and Chae Won Kim have won the Outstanding Paper Award at the ACL (Association for Computational Linguistics) 2024 conference, held in Bangkok. ACL is recognized as the world’s leading conference in the field of Natural Language Processing (NLP) and is one of the top-tier international conferences in Artificial Intelligence (AI).
Their award-winning paper, titled "Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation," introduces an innovative model designed to make interactions between humans and AI more natural and human-like. Unlike traditional text-based or speech-based dialogue models, this research developed a Human Multimodal LLM (Large Language Model) that enables AI to comprehend both visual cues and vocal signals from humans. Additionally, it allows the AI to engage in conversations using human-like facial expressions and speech.
This breakthrough opens up new possibilities for improving the intuitiveness and effectiveness of human-AI interactions by simultaneously processing visual and auditory signals during conversations.
The paper was also presented as an oral presentation at the ACL 2024 conference in Bangkok, where it garnered significant attention.
Professor Yong Man Ro stated, " This research marks a significant advancement in human-AI interaction, and we hope this technology will be widely applied in various real-world applications. This award is yet another example of the international recognition of the excellence of AI research at KAIST’s School of Electrical Engineering."
[#300] 2024-08-03 [IEEE TASLP] Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation (by Minsu Kim) is accepted in IEEE Trans. on Audio, Speech, and Language Processing
Title: Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation
Authors: Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro
In this paper, we propose a pre-training method to learn unified representations of multilingual speech and text without using text, especially focusing on the purpose of multimodal-to-speech machine translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to pre-train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the pre-training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the UTUT pre-trained model can easily transferred to text-related tasks even if it is trained with textless. We show that a single UTUT pre-trained model can be employed for diverse multilingual speech- and text-related tasks, Speech-toSpeech Translation (STS), multilingual Text-to-Speech Synthesis (TTS), and Text-to-Speech Translation (TTST). By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, we show UTUT pre-trained model can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature. Samples can be found on https://choijeongsoo.github.io/utut.
[#299] 2024-07-17 [ACM MM 2024] Efficient Training for Multilingual Visual Speech Recognition (by Minsu Kim, Jeonghun Yeo) is accepted in ACM MM 2024
Title: Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Authors: Minsu Kim*, Jeonghun Yeo*, Se Jin Park, Hyeongseop Rha, Yong Man Ro (* equal contributor)
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speech unit that can be obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.
[#298] 2024-07-03 [ECCV 2024] MoAI: Mixture of All Intelligence for Large Language and Vision Models (by Byung-Kwan Lee) is accepted in ECCV 2024
Title; MoAI: Mixture of All Intelligence for Large Language and Vision Models
Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.
[#297] 2024-07-03 [Pattern Recognition] Text-Guided Distillation Learning to Diversify Video Embeddings (by Sangmin Lee) is accepted in Pattern Recognition
Title; Text-Guided Distillation Learning to Diversify Video Embeddings for Text-Video Retrieval
Authors: Sangmin Lee, Hyung-Il Kim, Yong Man Ro
Conventional text-video retrieval methods typically match a video with a text on a one-to-one manner. However, a single video can contain diverse semantics, and text descriptions can vary significantly. Therefore, such methods fail to match a video with multiple texts simultaneously. In this paper, we propose a novel approach to tackle this one-to-many correspondence problem in text-video retrieval. We devise diverse temporal aggregation and a multikey memory to consider temporal and semantic diversity, consequently constructing multiple video embedding paths from a single video. Additionally, we introduce text-guided distillation learning that enables each video path to acquire meaningful distinct competencies in representing varied semantics. Our video embedding approach is text-agnostic, allowing the prepared video embeddings to be used continuously for any new text query. Experiments show our method outperforms other methods on four datasets. We further validate the effectiveness of our designs with ablation studies and analyses on multiple video embeddings.
[#296] 2024-07-03 [ICIP 2024] Weather-aware Drone-view Object Detection via Environmental Context Understanding (by Hyunjun Kim) is accepted in ICIP 2024
Title; Weather-aware Drone-view Object Detection via Environmental Context Understanding
Authors: Hyunjun Kim, Dahye Lee, Sungjune Park, Yong Man Ro
Drone-view object detection has shown noticeable performances and has been adopted by various real-world applications. However, there exist still several problems to be handled for its safe usage. While most existing methods have tried to manage a variety of object scales, there are very few works to deal with diverse weather conditions. Therefore, in this paper, we propose a novel approach to build a drone-view object detector robust against the adverse effects of diverse environmental factors, such as foggy, rainy, and low illumination. To this end, we generated a weather content feature set using a multimodal large language model (MLLM), to describe diverse weather, illumination, and visibility conditions. These features are then adaptively selected based on the input image and applied to the detection framework to recognize the environmental semantics in the given visual images. Hereby, a detection framework can have environmental context understanding capability in drone-view images. With the comprehensive experiments and analysis, we corroborate the effectiveness of the proposed method showing the robustness against adverse weather conditions.
[#295] 2024-07-03 [ICIP 2024] Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-driven Approach for Cross-modal Alignment Fusion (by Taeheon Kim, Sangyun Chung, Youngjoon Yu) is accepted in ICIP 2024 Workshop
Title; Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-driven Approach for Cross-modal Alignment Fusion
Authors: Taeheon Kim*, Sangyun Chung*, Youngjoon Yu*, Yong Man Ro (*equal contributor)
Multispectral pedestrian detection is a crucial component in various critical applications. However, a significant challenge arises due to the misalignment between these modalities, particularly under real-world conditions where data often appear heavily misaligned. Conventional methods developed on well-aligned or minimally misaligned datasets fail to address these discrepancies adequately. This paper introduces a new framework for multispectral pedestrian detection designed specifically to handle heavily misaligned datasets without the need for costly and complex traditional pre-processing calibration. By leveraging Large-scale Vision-Language Models (LVLM) for cross-modal semantic alignment, our approach seeks to enhance detection accuracy by aligning semantic information across the RGB and thermal domains. This method not only simplifies the operational requirements but also extends the practical usability of multispectral detection technologies in practical applications.
[#294] 2024-06-26 [2024 가을학기 합격생 연구실 TO] 국비 석사 2명, KAIST 석사 1명, 산학장학생 등 TO 있습니다.
국비 석사 2명, KAIST 석사 1 명, 산학장학생 등 모집합니다.
모집 연구분야
Vision + LLM (large language model) / LVLM (large vision language model)
Multimodal + LLM / LVLM
Integration Vision, Language and Speech/Sound
관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.
[#293] 2024-05-19 [Recent Ph.D. graduate: postdocs] Minsu, Ph.D graduate of 2024 has joined postdoc in AI research at META.
Dr. Minsu Kim, who received his Ph.D. in February 2024, has joined the AI research group at META in London as a postdoctoral researcher. We extend our congratulations to him and hope that he will achieve outstanding results in AI research. By combining the research skills he developed during his Ph.D. at the IVY and LVL labs, particularly in human multimodal AI, with the cutting-edge research he will undertake at META, we believe Dr. Kim will make significant contributions to the field of AI.
[#292] 2024-05-19 [Amazon, Google Internships] Sungjune and Se Jin will join Amazon and Google for research internships, respectively.
Two PhD students from the IVY lab have secured research internships at Amazon and Google in USA, both leading institutions in the field of AI. Sungjune Park will join Amazon, and Se Jin Park will join Google to enhance their ongoing research during their PhD studies. Sungjune Park has published several top-tier papers on multimodal AI, focusing on integrating vision and language, while Se Jin Park has published several top-tier papers on human multimodal AI, specifically on the ability to process and understand human-relevant modalities such as spoken language and facial-audio expressions. They expect to complete a paper as an outcome of their internships. This research internship experience will enable them to expand and deepen their PhD research, thereby building global competitiveness.
[#291] 2024-05-16 [ACL 2024] CoLLaVO: Crayon Large Language and Vision mOdel (Byung-Kwan Lee) is accepted in Findings of the Association for Computational Linguistics, ACL 2024
Title: CoLLaVO: Crayon Large Language and Vision mOdel
Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2024
[#290] 2024-05-16 [ACL 2024] Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation (Se Jin Park, Chae Won Kim) accepted In Proceedings of the Annual Meeting of the Association for Computational Linguistics, ACL 2024
Title: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
Authors: Se Jin Park*, Chae Won Kim*, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, and Yong Man Ro
In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (\ie, audio and visual) spoken dialogue corpus containing 387 hours of approximately 10,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. All the data will be open-sourced.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ACL 2024
[#289] 2024-04-26 [Pattern Recognition] Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank (by Sungjune Park, Hyunjun Kim) is accepted in Pattern Recognition
Title: Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank
Authors: {Sungjune Park, Hyunjun Kim: equal first authors}, and Yong Man Ro
Pedestrian detection is a crucial field of computer vision research which can be adopted in various real-world applications (e.g., self-driving systems). However, despite the noticeable evolution of pedestrian detection, the pedestrian representations learned within a detection framework are usually limited to the particular scene data in which they were trained. Therefore, in this paper, we propose a novel approach to construct versatile pedestrian knowledge bank containing representative pedestrian knowledge which can be applicable to various detection frameworks and adopted in diverse scenes. We extract generalized pedestrian knowledge from a large-scale pretrained model, and we curate them by quantizing most representative features and guiding them to be more distinguishable from various background scenes. After they are stored in the versatile pedestrian knowledge bank, we leverage them to complement and enhance pedestrian features within a detection framework. Through comprehensive experiments, we validate the effectiveness of our method, demonstrating its versatility and outperforming state-of-the-art detection performances.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), Pattern Recognition
[#288] 2024-03-26 [IEEE TCSVT] Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection (by Sungjune Park, Hyunjun Kim) is accepted in IEEE Trans. on CSVT
Title: Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
Authors: {Sungjune Park, Hyunjun Kim: equal first authors}, and Yong Man Ro
Large language models (LLMs) have shown their capability in understanding contextual and semantic information regarding appearance knowledge of instances. In this paper, we introduce a novel approach to utilize the strength of an LLM in understanding contextual appearance variations and to leverage its knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of crucial tasks directly related with our safety (e.g., intelligent driving system), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish description corpus which includes numerous narratives describing various appearances of pedestrians and others. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. After that, we perform a task-prompting process to obtain appearance elements which are representative appearance knowledge guided to be relevant to a downstream pedestrian detection task. The obtained knowledge elements are adaptable to various detection frameworks, so that we can provide plentiful appearance information by integrating the language-derived appearance elements with visual cues within a detector. Through comprehensive experiments with various pedestrian detectors, we verify the adaptability and effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance on two public pedestrian detection benchmarks (i.e., CrowdHuman and WiderPedestrian).
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE TCSVT
[#287] 2024-03-12 [2024 가을학기 대학원생 모집] 국비 석사 2명, KAIST박사 1명, 산학장학생 등 모집합니다. 관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.
국비 석사 2명, KAIST박사 1명, 산학장학생 등 모집합니다.
모집 연구분야
Vision + LLM (large language model) / LVLM (large vision language model)
Multimodal + LLM / LVLM
Integration Vision, Language and Speech/Sound
관심있는 학생은 ymro@kaist.ac.kr 로 메일하기 바랍니다.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST),
[#286] 2024-02-27 [CVPR 2024] Causal Mode Multiplexer: A Novel Framework for Unbiased Data (by Taeheon Kim) is accepted in CVPR 2024
Title: Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection
Authors: {Taeheon Kim, Sebin Shin: equal first authors}, Youngjoon Yu, Hak Gu Kim, and Yong Man Ro
RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result, multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation, such as ROTX data. To address this problem, we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover, we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new ROTX-MP. We will release our new dataset to the public for future research.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024
[#285] 2024-02-27 [CVPR 2024] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation (by Se Jin Park, Minsu Kim) is accepted in CVPR 2024
Title: AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
Authors: {Jeongsoo Choi, Se Jin Park, Minsu Kim: equal first authors}, and Yong Man Ro
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), CVPR 2024
[#284] 2024-02-27 [IEEE TMM] AKVSR: Compressing Audio Knowledge of a Pretrained Model (by Jeong Hun Yeo) is accepted in IEEE Trans. on Multimedia
Title: AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, and Yong Man Ro
Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), IEEE Transactions on Multimedia
[#283] 2024-02-22 Recruitment for PhD and MS Students
Title: Recruitment for PhD and MS Students
The IVY Laboratory is promoting international exchanges. For students applying to join the lab after September 2024, we prefer PhD candidates who are interested in international growth after completing their doctoral program and aim to pursue international opportunities upon graduation. Additionally, for master's degree candidates interested in joining our laboratory, we invite even those who aspire to pursue a PhD abroad or seek international career paths. We particularly welcome students who already have a lot of interest and experience in studying and researching deep learning-based approaches. Interested students are encouraged to contact us via email at ymro@kaist.ac.kr.
We look forward to hearing from you.
[#282] 2024-02-21 Prof. Yong Man Ro Named ICT Endowed Chair Professor at KAIST
Title: Prof. Yong Man Ro Named ICT Endowed Chair Professor at KAIST
Prof. Yong Man Ro has been appointed as the ICT Endowed Chair Professor at KAIST. Since establishing the IVY Lab in 1997, Prof. Ro has been instrumental in advancing research in image processing, computer vision, artificial intelligence (AI), and multimedia.
Under his guidance, the IVY Lab has achieved remarkable milestones, including the graduation of 25 PhD and 70 Master's students, who have gone on to make significant contributions in IT area in the world. The laboratory's research output is highly competitive and excellent, including more than 520 peer-reviewed journal articles and top conference papers.
IMAGE VIDEO SYSTEM (IVY.) KOREA ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY (KAIST), ICASSP 2024