Notice

[#239]   2021-12-02 [AAAI 2022]    Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading (by Minsu Kim) is accepted in AAAI 2022

Title: Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading


Authors: Minsu Kim, Jeong Hun Yeo, and Yong Man Ro

 

Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head Visual-audio Memory (MVM). Firstly, MVM is trained with an audio-visual dataset and remembers audio representations by modelling the inter-relationships of a paired audio-visual representations. At the inference stage, visual input alone can extract the saved audio representation from the memory by examining the learned inter-relationships. Therefore, the lip reading model can complement the insufficient visual information with the extracted audio representations. Secondly, MVM is composed of multi-head key memories for saving visual features and one value memory for saving audio knowledge, which is designed to distinguish the homophenes. With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement. This also can be viewed as an explicit implementation of the one-to-many mapping of viseme-to-phoneme. Moreover, MVM is employed in multi-temporal levels to consider the context when retrieving the memory and distinguish the homophenes. Extensive experimental results verify the effectiveness of the proposed method in lip reading and in distinguishing the homophenes.

[#238]   2021-12-02 [AAAI 2022]    SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory (by Se Jin Park) is accepted in AAAI 2022

Title: SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip Memory


Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man Ro


The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at the phoneme level as they do not sufficiently provide visual information of the lips at the video synthesis step. To overcome this limitation, our work proposes Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence. It stores lip motion features from sequential ground truth images in the value memory and aligns them with corresponding audio features so that they can be retrieved using audio input at inference time. Therefore, using the retrieved lip motion features as visual hints, it can easily correlate audio with visual dynamics in the synthesis step. By analyzing the memory, we demonstrate that unique lip features are stored in each memory slot at the phoneme level, capturing subtle lip motion based on memory addressing. In addition, we introduce visual-visual synchronization loss which can enhance lip-syncing performance when used along with audio-visual synchronization loss in our model. Extensive experiments are performed to verify that our method generates high-quality video with mouth shapes that best align with the input audio, outperforming previous state-of-the-art methods.

[#237]   2021-11-09 [AAAI 2022]    Towards Versatile Pedestrian Detector with Multisensory-Matching and Multispectral Recalling Memory (by Jung Uk Kim) is accepted in AAAI 2022

Title: Towards Versatile Pedestrian Detector with Multisensory-Matching and Multispectral Recalling Memory


Authors: Jung Uk Kim, Sungjune Park, and Yong Man Ro


Recently, automated surveillance cameras can change a visible sensor and a thermal sensor for all-day operation. However, existing single-modal pedestrian detectors mainly focus on detecting pedestrians in only one specific modality (i.e., visible or thermal), so they cannot effectively cope with other modal inputs. In addition, recent multispectral pedestrian detectors have shown remarkable performance by adopting multispectral modalities, but they also have limitations in practical applications (e.g., different Field-of-View (FoV) and frame rate). In this paper, we introduce a versatile pedestrian detector that shows robust detection performance in any single modality. We propose a multisensory-matching contrastive loss to reduce the difference between the visual representation of pedestrians in the visible and the thermal modalities. Moreover, to make the proposed method perform robust detection on a single modality, we design a Multispectral Recalling (MSR) Memory. The MSR Memory enhances the visual representation of the single modal features by recalling that of the multispectral modalities. To guide the MSR Memory to store the contexts of the multispectral modalities, we introduce a multispectral recalling loss. It enables the pedestrian detector to encode more discriminative features in a single input modality. We would like to insist that our method is a step forward detector that can be applied to a variety of real-world applications. The comprehensive experimental results verify the effectiveness of the proposed method.


[#236]   2021-12-02 [IEEE TIP]    Robust Perturbation for Visual Explanation: Cross-checking Mask Optimization to Avoid ClassDistortion (by Junho Kim) is accepted in IEEE Transactions on Image Processing

Title: Robust Perturbation for Visual Explanation: Cross-checking Mask Optimization to Avoid ClassDistortion


Authors: Junho Kim, Seongyeop Kim, Seong Tae Kim, and Yong Man Ro


Along with the outstanding performance of the deep neural networks (DNNs), considerable research efforts have been devoted to finding ways to understand the decision of DNNs structures. In the computer vision domain, visualizing the attribution map is one of the most intuitive and understandable ways to achieve human-level interpretation. Among them, perturbation-based visualization can explain the “black box” property of the given network by optimizing perturbation masks that alter the network prediction of the target class the most. However, existing perturbation methods could make unexpected changes to network predictions after applying a perturbation mask to the input image, resulting in a loss of robustness and fidelity of the perturbation mechanisms. In this paper, we define class distortion as the unexpected changes of the network prediction during the perturbation process. To handle that, we propose a novel visual interpretation framework, Robust Perturbation, which shows robustness against the unexpected class distortion during the mask optimization. With a new cross-checking mask optimization strategy, our proposed framework perturbs the target prediction of the network while upholding the non-target predictions, providing more reliable and accurate visual explanations. We evaluate our framework on three different public datasets through extensive experiments. Furthermore, we propose a new metric for class distortion evaluation. In both quantitative and qualitative experiments, tackling the class distortion problem turns out to enhance the quality and fidelity of the visual explanation in comparison with the existing perturbation-based methods.


[#235]   2021-10-29 [IEEE TASLP]    Speech Reconstruction with Reminiscent Sound via Visual Voice Memory (by Joanna Hong) is accepted in IEEE Transactions on Audio Speech and Language Processing

Title: Speech Reconstruction with Reminiscent Sound via Visual Voice Memory


Authors: Joanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro


The goal of this work is to reconstruct speech from silent video, in both speaker dependent and speaker independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech through auxiliary information of audio. Hence, our method employs both video and audio information during training time, but does not require any additional auditory input in the inference time. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and speaker independent training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works. Moreover, we experiment on both multi-speaker and speaker independent settings and verify the effectiveness of the Visual Voice memory. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.


[#234]   2021-10-20 [BMVC 2021]    M-CAM: Visual Explanation of Challenging Conditioned Dataset with Bias-reducing Memory (by Seongyeop Kim) is accepted in BMVC 2021

Title: M-CAM: Visual Explanation of Challenging Conditioned Dataset with Bias-reducing Memory


Authors: Seongyeop Kim and Yong Man Ro


We introduce a framework that enhances visual explanation of class activation map (CAM) with key-value memory structure for deep networks. We reveal challenging conditions inherently existing in several datasets that degrade the visual explanation quality of existing CAM-based visual explanation methods (e.g. imbalanced data, multi-object co-occurrence) and try to solve it with the proposed framework. The proposed Bias-reducing memory module learns spatial feature representation of different classes from trained networks and stores each different semantic information in separate memory slots, while it does not require any modification to the existing networks. Furthermore, we propose a novel visual explanation method accompanied by a memory slot searching algorithm to retrieve semantically relevant spatial feature representation from the memory module and make visual explanation of network decisions. We evaluate our visual explanation framework with datasets of challenging conditions including several medical image datasets and multiclass classification datasets. We qualitatively and quantitatively compare it with existing CAM-based methods to demonstrate the strength of our framework.


[#233]   2021-09-29 [NIPS 2021]    Lip to Speech Synthesis with Visual Context Attentional GAN (by Minsu Kim) is accepted in NeurIPS 2021

Title: Lip to Speech Synthesis with Visual Context Attentional GAN


Authors: Minsu Kim, Joanna Hong, and Yong Man Ro


In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of visemes-to-phonemes, while global visual context is embedded into the intermediate speech representation to refine the coarse speech representation in details. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features and provides the desired global visual context corresponding to the given coarse speech representation to the generator. In addition to the explicit modelling of local and global visual representations, a synchronization technique is introduced through contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing state-of-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.


[#232]   2021-09-29 [NIPS 2021]    Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (by Junho Kim and Byung-Kwan Lee) is accepted in NeurIPS 2021

Title: Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck


Authors: Junho Kim*, Byung-Kwan Lee*, and Yong Man Ro (*: equally contributed)


Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields. Recent works have argued that the existence of the robust and non-robust features is a primary cause of the adversarial examples, and investigated their internal interactions in the feature space. In this paper, we propose a way of explicitly distilling feature representation into the robust and non-robust features, using Information Bottleneck. Specifically, we inject noise variation to each feature unit and evaluate the information flow in the feature representation to dichotomize feature units either robust or non-robust, based on the noise variation magnitude. Through comprehensive experiments, we demonstrate that the distilled features are highly correlated with adversarial prediction, and they have human-perceptible semantic information by themselves. Furthermore, we present an attack mechanism intensifying the gradient of non-robust features that is directly related to the model prediction, and validate its effectiveness of breaking model robustness.


[#231]   2021-09-15 [IEEE TMM]    CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition (by Minsu Kim) is accepted in IEEE Transactions on Multimedia

Title: CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition


Authors: Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro


Visual Speech Recognition (VSR) is a task that recognizes speech from external appearances of the face (i.e., lips) into text. Since the information from the visual lip movements is not sufficient to fully represent the speech, VSR is considered as one of the challenging problems. One possible way to resolve this problem is additionally utilizing audio which contains rich information for speech recognition. However, the audio information could not be always available such as in long-distance or crowded situations. Thus, it is necessary to find a way that successfully provides enough information for speech recognition with visual inputs only. In this paper, we alleviate the information insufficiency of visual lip movement by proposing a cross-modal memory augmented VSR with Visual-Audio Memory (VAM). The proposed framework tries to utilize the complementary information of audio even when the audio inputs are not provided at the inference time. Concretely, the proposed VAM learns to imprint audio features of short clip-level into a memory network using the corresponding visual features. To this end, the VAM contains two memories, lip-video key and audio value. The audio value memory is guided to imprint the audio feature and the lip-video key memory is guided to memorize the location of the imprinted audio. By doing this, the VAM can exploit rich audio information by accessing the memory using visual inputs only. Thus, the proposed VSR framework can refine the prediction with the imprinted audio information during inference time where the audio inputs are not provided. We validate the proposed method on popular benchmark databases, LRW, LRW-1000, GRID, and LRS2. Experimental results show that the proposed method achieves state-of-the-art performance on both word- and sentence-level visual speech recognition. In addition, we verify the learned representations inside the VAM contain meaningful information for VSR by examining and visualizing the learned representations.


[#230]   2021-09-13     2022 전기 학생모집

2022년도 전기 박사과정 (국비), 석사과정 (국비 및 KAIST 장학), 산학장학생 (KEPSI, EPSS, LGenius) 등을 모집합니다.

(http://admission.kaist.ac.kr/graduate/)

 

모집 연구분야:

 - Deep learning (XAI, adversarial attack/defense, multimodal)

 - Machine learning with visual data

 - Computer vision (object segmentation/detection/classification)

 - multimodal (Vision-Language) Deep learning

 - Defense security

 

현재 진행중인 연구과제:

 - Explainable (Interpretable) Deep learning

 - Adversarial defense in Deep learning

 - Deep learning algorithms (detection/classification/segmentation) in computer vision

 - Multimodal deep learning

 

최근 연구실 연구결과 - 링크 (LINK)

최근 연구실 석박사과정 딥러닝 관련 해외 학회 발표실적 - 링크 (LINK)

최근 연구실 석박사과정 해외 저널 실적 - 링크 (LINK)

을 참고하세요.

 

연구실 입학 문의는 노용만 교수님(ymro@kaist.ac.kr)께 이메일/사전미팅 하기 바랍니다.


[#229]   2021-08-05 [IEEE CSVT]    Assessing Individual VR Sickness through Deep Feature Fusion of VR Video and Physiological Response (by Sangmin Lee) is accepted in IEEE Transactions on Circuits and Systems for Video Technology

Title: Assessing Individual VR Sickness through Deep Feature Fusion of VR Video and Physiological Response


Authors: Sangmin Lee, Seongyeop Kim, Hak Gu Kim, and Yong Man Ro


Recently, VR sickness assessment for VR videos is highly demanded in industry and research fields to address VR viewing safety issues. Especially, it is difficult to evaluate VR sickness of individuals due to individual differences. To achieve the challenging goal, we focus on deep feature fusion of sickness-related information. In this paper, we propose a novel deep learning-based assessment framework which estimates VR sickness of individual viewers with VR videos and corresponding physiological responses. We design the content stimulus guider imitating the phenomenon that humans feel VR sickness. The content stimulus guider extracts a deep stimulus feature from a VR video to reflect VR sickness caused by VR videos. In addition, we devise the physiological response guider to encode physiological responses that are acquired while humans experience VR videos. Each physiology sickness feature extractor (EEG, ECG, and GSR) in the physiological response guider is designed to suit their physiological characteristics. Extracted physiology sickness features are then fused into a deep physiology feature that comprehensively reflects individual deviations of VR sickness. Finally, the VR sickness predictor assesses individual VR sickness effectively with the fusion of the deep stimulus feature and the deep physiology feature. To validate the proposed method extensively, we built two benchmark datasets which contain 360-degree VR videos with physiological responses (EEG, ECG, and GSR) and SSQ scores. Experimental results show that the proposed method achieves meaningful correlations with human SSQ scores. Further, we validate the effectiveness of the proposed network designs by conducting analysis on feature fusion and visualization.


[#228]   2021-07-23 [ICCV 2021]    Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video (by Minsu Kim and Joanna Hong) is accepted in ICCV 2021

Title: Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video


Authors: Minsu Kim*, Joanna Hong*, Se Jin Park, and Yong Man Ro (*: equally contributed)

 

In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network to achieve the multi-modal bridging, where the memory network consists of two modality-specific memories: source-key and target-value memories. These two modality-specific memories save a source and a target modal representations, respectively. Then, an associative bridge is constructed between the source-key memory and the target-value memory, regarding the interrelationship between the two memories. By learning the interrelationship through the associative bridge, it is possible to access the target-value memory using source modality and source-key memory without target modality. Accordingly, the proposed framework can recall the target modal representations with source modal inputs only and provides rich information for its downstream tasks. We apply the proposed framework to two tasks: lip reading and speech reconstruction from silent video. Through the proposed associative bridge and modality-specific memories, each task knowledge is enriched with the recalled audio context, achieving state-of-the-art performance. We also verify that the associative bridge properly relates the source and target memories.


[#227]   2021-07-23 [ICCV 2021]    Robust Small-scale Pedestrian Detection with Cued Recall via Memory Learning (by Jung Uk Kim and Sungjune Park) is accepted in ICCV 2021

Title: Robust Small-scale Pedestrian Detection with Cued Recall via Memory Learning


Authors: Jung Uk Kim*, Sungjune Park*, and Yong Man Ro (*: equally contributed)

 

Although the visual appearances of small-scale objects are not well observed, humans can recognize them by associating the visual cues of small objects from their memorized appearance. It is called cued recall. In this paper, motivated by the memory process of humans, we introduce a novel pedestrian detection framework that imitates cued recall in detecting the small-scale pedestrians. We propose a large-scale embedding learning with the large-scale pedestrian recalling memory (LPR Memory). The purpose of the proposed large-scale pedestrian embedding learning is to memorize and recall the large-scale pedestrian appearance via the LPR Memory. To this end, we employ the large-scale pedestrian exemplar set, so that, the LPR Memory can recall the information of the large-scale pedestrians from the small-scale pedestrians. Comprehensive quantitative and qualitative experimental results validate the effectiveness of the proposed framework with the LPR Memory.


[#226]   2021-06-01 [IEEE ICIP]    3 papers have been accepted (Junho, Hong Joo, Youngjoon) in IEEE ICIP 2021

1. Authors: Junho Kim, Minsu Kim, Yong Man Ro

Title: 'Interpretation of Lesional Detection via Counterfactual Generation'

 

2. Authors: Hong Joo Lee, Yong Man Ro

Title: 'Adversarially Robust Multi-Sensor Fusion Model Training via Random Feature Fusion for Semantic Segmentation'


3. Authors: Byeong Cheon Kim*, Youngjoon Yu*, Yong Man Ro *equally contributed first author

Title: 'Robust Decision-based black-box adversarial attack via Coarse-to-fine Random Search'


[#225]   2021-04-27 [IEEE Access]    Adversarially Robust Hyperspectral Image Classification via Random Spectral Sampling and Spectral Shape Encoding (by Sungjune Park and Hong Joo Lee) is accepted in IEEE Access

Title: Adversarially Robust Hyperspectral Image Classification via Random Spectral Sampling and Spectral Shape Encoding


Authors: Sungjune Park, Hong Joo Lee, Yong Man Ro


Although the hyperspectral image (HSI) classification has adopted deep neural networks (DNNs) and shown remarkable performances, there is a lack of studies of the adversarial vulnerability for the HSI classifications. In this paper, we propose a novel HSI classification framework robust to adversarial attacks. To this end, we focus on the unique spectral characteristic of HSIs (i.e., distinctive spectral patterns of materials). With the spectral characteristic, we present the random spectral sampling and spectral shape feature encoding for the robust HSI classification. For the random spectral sampling, spectral bands are randomly sampled from the entire spectrum for each pixel of the input HSI. Also, the overall spectral shape information, which is robust to adversarial attacks, is fed into the shape feature extractor to acquire the spectral shape feature. Then, the proposed framework can provide the adversarial robustness of HSI classifiers via randomization effects and spectral shape feature encoding. To the best of our knowledge, the proposed framework is the first work dealing with the adversarial robustness in the HSI classification. In experiments, we verify that our framework improves the adversarial robustness considerably under diverse adversarial attack scenarios, and outperforms the existing adversarial defense methods.


[#224]   2021-04-21 [IEEE CSVT]    Uncertainty-Guided Cross-Modal Learning for Robust Multispectral Pedestrian Detection (by Jung Uk Kim) is accepted in IEEE Transactions on Circuits and Systems for Video Technology

Title: Uncertainty-Guided Cross-Modal Learning for Robust Multispectral Pedestrian Detection


Authors: Jung Uk Kim, Sungjune Park, Yong Man Ro

 

Multispectral pedestrian detection has received great attention in recent years as multispectral modalities (i.e. color and thermal) can provide complementary visual information. However, there are major inherent issues in multispectral pedestrian detection. First, the cameras of the two modalities have different field-of-views (FoVs), so that image pairs are often miscalibrated. Second, modality discrepancy is observed, because image pairs are captured at different wavelengths. In this paper, to alleviate these issues, we propose a new uncertainty-aware multispectral pedestrian detection framework. In our framework, we consider two types of uncertainties: (1) Region of Interest (RoI) uncertainty and (2) predictive uncertainty. For the miscalibration issue, we propose RoI uncertainty which represents the reliability of the RoI candidates. With the RoI uncertainty, when combining two modal features, we devise uncertainty-aware feature fusion (UFF) module to reduce the effect of RoI features with high RoI uncertainty. We also propose uncertainty-aware cross-modal guiding (UCG) module for the modality discrepancy. In the UCG module, we use the predictive uncertainty, which indicates how reliable the prediction of the RoI feature is. Based on the predictive uncertainty, the UCG module guides the feature distribution of high predictive uncertain (less reliable) modality to resemble that of low predictive uncertain (more reliable) modality. The UCG module can encode more discriminative features by guiding feature distributions of two modalities to be similar. With comprehensive experiments on the public multispectral datasets, we verified that our method reduces the effect of the miscalibration and alleviates the modality discrepancy, outperforming existing state-of-the-art methods.


[#223]   2021-03-23 2021 후기 학생모집

2021년도 후기 박사과정 (KAIST 장학), 석사과정 (국비 및 KAIST 장학), 산학장학생 (KEPSI, EPSS, LGenius) 등을 모집합니다.

(http://admission.kaist.ac.kr/graduate/)

 

모집 연구분야:

 - Deep learning (XAI, adversarial attack/defense, multimodal)

 - Machine learning with visual data

 - Computer vision (object segmentation/detection/classification)

 - multimodal (Vision-Language) Deep learning

 - Defense security

 

현재 진행중인 연구과제:

 - Explainable (Interpretable) Deep learning

 - Adversarial defense in Deep learning

 - Deep learning algorithms (detection/classification/segmentation) in computer vision

 - Multimodal deep learning

 

최근 연구실 연구결과 - 링크 (LINK)

최근 연구실 석박사과정 딥러닝 관련 해외 학회 발표실적 - 링크 (LINK)

최근 연구실 석박사과정 해외 저널 실적 - 링크 (LINK)

을 참고하세요.

 

연구실 입학 문의는 노용만 교수님(ymro@kaist.ac.kr)께 이메일/사전미팅 하기 바랍니다.


[#222]   2021-03-05 [CVPR 2021]    Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning (Oral presentation) (by Sangmin Lee) is accepted in CVPR 2021

Title: Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning


Authors: Sangmin Lee, Hak Gu Kim, Dae Hwi Choi, Hyung-Il Kim, Yong Man Ro


Our work addresses long-term motion context issues for predicting future frames. To predict the future precisely, it is required to capture which long-term motion context (e.g., walking or running) the input motion (e.g., leg movement) belongs to. The bottlenecks arising when dealing with the long-term motion context are: (i) how to capture the long-term motion context naturally matching input sequences with limited dynamics, (ii) how to capture the long-term motion context with high-dimensionality (e.g., motion complexity). To address the issues, we propose novel motion context-aware video prediction. To solve the bottleneck (i), we introduce a long-term motion context memory (LMC-Memory) with memory alignment learning. The proposed memory alignment learning enables to store long-term motion contexts into the memory and to match them with sequences including limited dynamics. As a result, the long-term context can be recalled from the limited input sequence. In addition, to resolve the bottleneck (ii), we propose memory query decomposition to store local motion context (i.e., low-dimensional dynamics) and recall the suitable local context for each local part of the input individually. It enables to boost the alignment effects of the memory. Experimental results show the proposed method outperforms other sophisticated RNN-based methods, especially in the long-term condition. Further, we validate the effectiveness of the proposed network designs by conducting ablation studies and memory feature analysis.


[#221]   2021-03-05 [ICASSP 2021]    Towards Robust Training of Multi-Sensor data Fusion Network Against Adversarial Examples in Semantic Segmentation (by Youngjoon Yu) is accepted in ICASSP 2021

Title: Towards Robust Training of Multi-Sensor data Fusion Network Against Adversarial Examples in Semantic Segmentation


Authors: Youngjoon Yu, Hong Joo Lee, Byeong Cheon Kim, Jung Uk Kim, and Yong Man Ro

 

The success of multi-sensor data fusions in deep learning appears to be attributed to the use of complementary information among multiple sensor datasets. Compared to their predictive performance, relatively less attention has been devoted to the adversarial robustness of multi-sensor data fusion models. To achieve adversarial robust multi-sensor data fusion networks, we propose here a novel robust training scheme called Multi-Sensor Cumulative Learning (MSCL). The motivation behind the MSCL method is based on the way human beings learn new skills. The MSCL allows the multi-sensor fusion network to learn robust features from individual sensors, and then learn complex joint features from multiple sensors just as people learn to walk before they run. The step wise framework of MSCL enables the network to incorporate pre-trained knowledge of robustness with new joint information from multiple sensors. Extensive experimental evidence validated that the MSCL outperforms other multi-sensor fusion training in defending against adversarial examples.


[#220]   2021-03-05 Seong Tae Kim is appointed as an assistant professor at Kyung Hee University CS department

Dr. Seong Tae Kim, who received Ph.D (Advisor: Prof. Yong Man Ro) in 2019, is appointed as an assistant professor of the department of computer science and engineering at Kyung Hee University. He has established a research laboratory named of ‘Augmented Intelligence Lab’ https://sites.google.com/view/augilab).