Notice

[#252]   2022-11-22 [AAAI 2023]    Deep Visual Forced Alignment (by Minsu Kim) is accepted in AAAI 2023

Title: Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

Authors: Minsu Kim, Chae Won Kim, and Yong Man Ro 

Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking face video). Since the drawbacks of audio-based forced alignment can be complemented using the visual information when the audio signal is under poor condition, we try to develop a novel video-based forced alignment method. However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in developing visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). The proposed DVFA can align the input transcription (i.e., sentence) with the talking face video without accessing the speech audio. Moreover, by augmenting the alignment task with anomaly case detection, DVFA can detect mismatches between the input transcription and the input video while performing the alignment. Therefore, we can robustly align the text with the talking face video even if there exist error words in the text. Through extensive experiments, we show the effectiveness of the proposed DVFA not only in the alignment task but also in interpreting the outputs of VSR models.

[#251]   2022-11-22 [AAAI 2023]    Multispectral Invisible Coating (by Taeheon Kim) is accepted in AAAI 2023

Title: Multispectral Invisible Coating: Laminated Visible-Thermal Physical Attack against Multispectral Object Detectors using Transparent Low-e films

Authors: Taeheon Kim,  Youngjoon Yu, and Yong Man Ro

Multispectral object detection plays a vital role in safety-critical vision systems that require an around-the-clock operation and encounter dynamic real-world situations(e.g., self-driving cars and autonomous surveillance systems). Despite its crucial competence in safety-related applications, its security against physical attacks is severely understudied. We investigate the vulnerability of multispectral detectors against physical attacks by proposing a new physical method: Multispectral Invisible Coating. Utilizing transparent Low-e films, we realize a laminated visible-thermal physical attack by attaching Low-e films over a visible attack printing. Moreover, we apply our physical method to manufacture a Multispectral Invisible Suit that hides persons from the multiple view angles of Multispectral detectors. To simulate our attack under various surveillance scenes, we constructed a large-scale multispectral pedestrian dataset which we will release in public. Extensive experiments show that our proposed method effectively attacks the state-of-the-art multispectral detector both in the digital space and the physical world.

[#250]   2022-10-06 [IEEE TIP]   Defending Person Detection Against Adversarial Patch Attack by using Universal Defensive Frame (by Youngjoon Yu and Hong Joo Lee) is accepted in IEEE Transactions on Image Processing 

Title: Defending Person Detection Against Adversarial Patch Attack by using Universal Defensive Frame

Authors: Youngjoon Yu*, Hong Joo Lee*, Hakmin Lee, and Yong Man Ro (*: equally contributed)

Person detection has attracted great attention in the computer vision area and is an imperative element in human-centric computer vision. Although the predictive performances of person detection networks have been improved dramatically, they are vulnerable to adversarial patch attacks. Changing the pixels in a restricted region can easily fool the person detection network in safety-critical applications such as autonomous driving and security systems. Despite the necessity of countering adversarial patch attacks, very few efforts have been dedicated to defending person detection against adversarial patch attack. In this paper, we propose a novel defense strategy that defends against an adversarial patch attack by optimizing a defensive frame for person detection. The defensive frame alleviates the effect of the adversarial patch while maintaining person detection performance with clean person. The proposed defensive frame in the person detection is generated with a competitive learning algorithm which makes an iterative competition between detection threatening module and detection shielding module in person detection. Comprehensive experimental results demonstrate that the proposed method effectively defends person detection against adversarial patch attacks.


[#249]   2022-10-06 [IEEE TBIOM]    Face Shape-Guided Deep Feature Alignment for Face Recognition Robust to Face Misalignment (by Hyung-Il Kim) is accepted in IEEE transactions on biometrics, behavior, and identity science

Title: Face Shape-Guided Deep Feature Alignment for Face Recognition Robust to Face Misalignment


Authors: Hyung-Il Kim, Kimin Yun, and Yong Man Ro


For the past decades, face recognition (FR) has been actively studied in computer vision and pattern recognition society. Recently, due to the advances in deep learning, the FR technology shows high performance for most of the benchmark datasets. However, when the FR algorithm is applied to a real-world scenario, the performance has been known to be still unsatisfactory. This is mainly attributed to the mismatch between training and testing sets. Among such mismatches, face misalignment between training and testing faces is one of the factors that hinder successful FR. To address this limitation, we propose a face shape-guided deep feature alignment framework for FR robust to the face misalignment. Based on a face shape prior (e.g., face keypoints), we train the proposed deep network by introducing alignment processes, i.e., pixel and feature alignments, between well-aligned and misaligned face images. Through the pixel alignment process that decodes the aggregated feature extracted from a face image and face shape prior, we add the auxiliary task to reconstruct the well-aligned face image. Since the aggregated features are linked to the face feature extraction network as a guide via the feature alignment process, we train the robust face feature to the face misalignment. Even if the face shape estimation is required in the training stage, the additional face alignment process, which is usually incorporated in the conventional FR pipeline, is not necessarily needed in the testing phase. Through the comparative experiments, we validate the effectiveness of the proposed method for the face misalignment with the FR datasets.

[#248]   2022-09-25 2023 전기 학생모집

2023년 전기 합격생들 축하해요. 연구실에서 2023년도 전기에 국비 석사과정 2명, KAIST 석사과정, 및 산학장학생 (KEPSI, EPSS, LGenius) 등을 모집합니다.


모집 연구분야:

 - 딥러닝 기반 인공지능 (설명가능인공지능, 역량인지 인공지능, 강인한 인공지능)

 - Machine learning with multi-modal data

 - Computer vision

 - Multi-modal (vision-sound- language) co-learning


최근 연구실 석박사과정 딥러닝 관련 해외 학회 발표실적 - 링크 (LINK)

최근 연구실 석박사과정 해외 저널 실적 - 링크 (LINK)

 

연구실 입학관련 면담요청은 노용만 교수님(ymro@kaist.ac.kr)에게 이메일 하기 바랍니다.

[#247]   2022-07-12 [ACM Multimedia 2022]    Defending Physical Adversarial Attack on Object Detection via Adversarial Patch-Feature Energy (by Taeheon Kim) is accepted in ACM Multimedia 2022

Title: Defending Physical Adversarial Attack on Object Detection via Adversarial Patch-Feature Energy

Authors: Taeheon Kim, Youngjoon Yu, and Yong Man Ro

Object detection plays an important role in security-critical systems such as autonomous vehicles but has shown to be vulnerable to adversarial patch attacks. Existing defense methods against adversarial patches are restricted to localized noise attacks by removing noisy regions in the input image. However, adversarial patches have developed into natural-looking patterns which evade existing defenses. To address this issue, we propose a defense method based on a novel concept “Adversarial Patch-Feature Energy” (APE) which exploits common deep feature characteristics of an adversarial patch. Our proposed defense consists of APE-masking and APE-refinement which can be employed to defend against any adversarial patch on literature. Extensive experiments demonstrate that APE-based defense achieves impressive robustness against adversarial patches both in the digital space and the physical world.

[#246]   2022-07-04 [ECCV 2022]    Speaker-adaptive Lip Reading with User-dependent Padding (by Minsu Kim) is accepted in ECCV 2022

Title: Speaker-adaptive Lip Reading with User-dependent Padding

Authors: Minsu Kim, Hyunjun Kim, and Yong Man Ro 

Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the user-dependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID. The effectiveness of the proposed method is verified on sentence- and word-level lip reading, and we show it can further improve the performance of a well-trained model with large speaker variations.

[#245]   2022-07-04 [ECCV 2022]    VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection (by Joanna Hong) is accepted in ECCV 2022

Title: VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

Authors: Joanna Hong, Minsu Kim, and Yong Man Ro 

The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Distinct from the previous methods, our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets.


[#244]   2022-07-04 [ECCV 2022]    Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment (by Sangmin Lee) is accepted in ECCV 2022

Title: Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment

Authors: Sangmin Lee, Sungjune Park, and Yong Man Ro 

Retrieving desired videos using natural language queries has attracted increasing attention in research and industry fields as a huge number of videos appear on the internet. Natural language queries made by humans vary greatly and often include details related to audio cues. Some existing methods attempted to address this video retrieval problem by exploiting multi-modal information, especially audio-visual data of videos. However, many videos often have mismatched visual and audio cues for several reasons including background music, noise, and even missing sound. Therefore, the naive fusion of such mismatched visual and audio cues can negatively affect the semantic embedding of video scenes when retrieving video from text queries. Mismatch condition can be categorized into two cases: (i) Audio itself does not exist, (ii) Audio exists but does not match with visual. To deal with (i), we introduce audio-visual associative memory (AVA-Memory) to associate audio cues even from videos without audio data. The associated audio cues from visual data can guide the video embedding feature to be aware of audio information even in the missing audio condition. To address (ii), we propose audio embedding adjustment by considering the degree of matching between visual and audio data. In this procedure, constructed AVA-Memory enables to figure out how well the visual and audio in the video are matched and to adjust the weighting between actual audio and associated audio. Experimental results show that the proposed method outperforms other state-of-the-art video retrieval methods. Further, we validate the effectiveness of the proposed network designs with analyses.

[#243]   2022-06-15 [Interspeech 2022]     Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition (by Joanna Hong and Minsu Kim) is accepted in Interspeech 2022

Title: Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

Authors: Joanna Hong*, Minsu Kim*, Daehun Yoo, and Yong Man Ro (* equally contributed) 

This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context. Through context-dependent modeling, the ambiguity in viseme-to-phoneme mapping can be refined for mask generation. The noisy representations are masked out with the noise reduction mask resulting in enhanced audio features. The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition. We show the proposed end-to-end AVSR with the V-CAFE can further improve the noise-robustness of AVSR. The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.

[#242]   2022-03-04 [CVPR 2022]    Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network (by Byung-Kwan Lee and Junho Kim) is accepted in CVPR 2022

Title: Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network

Authors: Byung-Kwan Lee*, Junho Kim*, Yong Man Ro (*: equally contributed) 

Adversarial examples provoke weak reliability and potential security issues in deep neural networks. Although adversarial training has been widely studied to improve adversarial robustness, it works in an over-parameterized regime and requires high computations and large memory budgets. To bridge adversarial robustness and model compression, we propose a novel adversarial pruning method, Masking Adversarial Damage (MAD) that employs second-order information of adversarial loss function. By using it, we can accurately estimate adversarial saliency for model parameters and determine which parameters can be pruned without weakening adversarial robustness. Furthermore, we reveal that model parameters of initial layer are highly sensitive to the adversarial examples and show that compressed feature representation retains semantic information for the target objects. Through extensive experiments on three public datasets, we demonstrate that MAD effectively prunes adversarially trained networks without loosing adversarial robustness and shows better performance than previous adversarial pruning methods.

[#241]   2022-03-04 [CVPR 2022]    Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory (by Sangmin Lee) is accepted in CVPR 2022

Title: Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory

Authors: Sangmin Lee, Hyung-Il Kim, and Yong Man Ro 

Data representation learning without labels has attracted increasing attention due to its nature that does not require human annotation. Recently, as data samples are acquired in multi-sensory environments, representation learning has been extended to bimodal data, especially sound and image which are closely related to basic human senses. Existing sound and image representation learning methods necessarily require a large number of sound and image with corresponding pairs. Therefore, it is difficult to ensure the effectiveness of the methods in the weakly paired condition, which lacks paired bimodal data. In fact, according to human cognitive studies, the cognitive functions in the human brain for a certain modality can be enhanced by receiving other modalities, even not directly paired ones. Based on the observation, we propose a new problem to deal with the weakly paired condition: How to boost a certain modal representation even by using other unpaired modal data. To address the issue, we introduce a novel bimodal associative memory (BMA-Memory) with key-value switching that can store bimodal features in sound-image sub-memories and naturally associate with one another. BMA-Memory enables to build sound-image association with small paired bimodal data and to boost the built association with easily obtainable large amount of unpaired data. Through the proposed associative learning, it is possible to reinforce the representation of a certain modality (e.g., sound) even by using other unpaired modal data (e.g., images).

[#240]   2022-01-24 [ICASSP 2022]    2 papers have been accepted (Taeheon and Sungjune) in IEEE ICASSP 2022

1. Authors: Taeheon Kim, Hong Joo Lee, and Yong Man Ro

Title: MAP: Multispectral Adversarial Patch to Attack Person Detection

Recently, multispectral person detection has shown great performances in real world applications such as autonomous driving and security systems. However, the reliability of person detection against physical attacks has not been fully explored yet in multispectral person detectors. To evaluate the robustness of multispectral person detectors in the physical world, we propose a novel Multispectral Adversarial Patch (MAP) generation framework. MAP is optimized with a Cross-spectral Mapping(CSM) and Material Emissivity(ME) loss. This paper is the first to evaluate the reliability of a multispectral person detector against physical attack. Throughout experiments, our proposed adversarial patch successfully attacks the person detector and the Average Precision (AP) score is dropped by 90.79% in digital space and 73.34% in physical space.


2. Authors: Sungjune Park, Dae Hwi Choi, Jung Uk Kim, and Yong Man Ro

Title: Robust Thermal Infrared Pedestrian Detection by Associating Visible Pedestrian Knowledges

Recently, pedestrian detection on thermal infrared images has shown the robust pedestrian detection performance. In this paper, we propose a novel thermal infrared pedestrian detection framework which can associate and utilize the complementary pedestrian knowledge from visible images. Motivated by that humans can associate useful information from other sensors to perform a more reliable decision, we devise a Visible-sensory Pedestrian Associating (VPA) Memory to conduct the robust pedestrian detection by utilizing complementary visible-sensory pedestrian knowledge explicitly. The VPA Memory is trained to store the pedestrian information of visible images and associate it with a given thermal infrared pedestrian knowledge via the memory associating learning. We verify the effectiveness of the proposed framework by conducting extensive experiments, and it achieves state-of the-art pedestrian detection performances on thermal infrared images.