Research Fields

Integrating Vision, Speech, and Language for AI

Integrating vision, speech, and language for AI is a challenging and exciting research area. This topic aims to build AI systems that interact with multimodal data. This topic is related multimodal deep learning topics below. Some of the ongoing works in IVL&IVL Lab are as follows:

Utilizing Large scale model, Multimodal Prompt 

Multimodal prompt with large scale model is a research topic that explores how to design multimodal prompts that can guide large scale models to solve multimodality tasks. The large-scale models can handle multiple types of data, such as images, videos, audio, and text. Some of the ongoing works in IVL&IVL Lab are as follows:

Multimodal Deep learning 

This research field combines different modalities of data, such as vision, language, and speech, to perform AI tasks. Currently, multimodal deep learning is being studied for human multimodality (speech, language, talking face) translation. The human multimodal translation includes visual speech recognition, speech synthesis, talking face generation, and audiovisual speech recognition. Also, image, text, and sound are transformed among them, e.g., image to text and text to image, and image to sound. Multimodal deep learning can help improve the performance and robustness of AI models by using complementary information from different modalities. Related papers published by IVY&IVL Lab are as follows:

Inclusive Human Multimodal Conversation 

Inclusive human multimodal conversation is a research topic that explores in any circumstance humans can communicate with each other via machine with different modes of conversion, such as speech, language, talking face. This topic is important for understanding how humans can interact more effectively and empathetically in diverse contexts and situations, such as different culture and language. Some of the ongoing works in IVL&IVL Lab are as follows:

Competency, Interpretability, Memorability, and Robustness of Deep learning Model

These research topics have goals to understand behavior of deep learning models on various tasks and domains. Competency refers to the ability of a model to achieve high accuracy and efficiency on a specific task or domain, such as image classification or natural language processing. Interpretability refers to the ability of a model to provide understandable and explainable outputs for human, such as revealing the relevant features or generating multimodal descriptions. Memorability refers to the ability of a model to store and retrieve knowledge from previous inputs or outputs, such as using attention mechanisms or memory networks. Robustness refers to the ability of a model to maintain its performance under various adversarial attacks. Related papers published by IVL&IVL Lab are as follows:

Computer Vision and Multimedia 

Computer vision in AI deals with computational methods for machine to understand and interpret the content of visual data. Computer vision and multimedia aim to make machine see and understand multimodal data from cameras or sensors, and take interaction based on that information. Related papers published by IVL&IVL Lab are as follows: