Search results
(1 - 2 of 2)
- Title
- Multimodal Learning and Generation Toward a Multisensory and Creative AI System
- Creator
- Zhu, Ye
- Date
- 2023
- Description
-
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed...
Show moreWe are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified intelligent system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various modalities including vision, audio, and text, has become an increasingly popular research area with emerging technical advances in recent years. Under the context of multimodal learning, the creativity to generate and synthesize novel and meaningful data is a critical criterion to assess machine intelligence.As a step towards a multisensory and creative AI system, we study the problem of multimodal generation in this thesis by exploring the field from multiple perspectives. Firstly, we analyze different data modalities in a comprehensive manner by comparing the data natures, the semantics, and their corresponding mainstream technical designs. We then propose to investigate three multimodal generation application scenarios, namely text generation from visual data, audio generation from visual data, and visual generation from textual data, with diverse approaches to give an overview of the field. For the direction of text generation from visual data, we study a novel multimodal task in which the model is expected to summarize a given video with textual descriptions, under a challenging condition where the video can only be partially seen. We propose to supplement the missing visual information via a dialogue interaction and introduce QA-Cooperative network with a dynamic dialogue history update learning mechanism to tackle the challenge. For the direction of audio generation from visual data, we present a new multimodal task that aims to generate music for a given silent dance video clip. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that heavily rely on pre-defined musical synthesizers, we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector-Quantized (VQ) audio representation via our proposed Dance2Music-GAN (D2M-GAN) framework. For the direction of visual generation from textual data, we tackle a key desideratum in conditional synthesis, which is to achieve high correspondence between the conditioning input and generated output using the state-of-the-art generative model -- Diffusion Probabilistic Model. While most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound in model training. In this work, we take a different route by explicitly enhancing input-output connections by maximizing their mutual information, which is achieved by our proposed Conditional Discrete Contrastive Diffusion (CDCD) framework. For each direction, we conduct extensive experiments on multiple multimodal datasets and demonstrate that all of our proposed frameworks are able to effectively and substantially improve task performance in their corresponding contexts.
Show less
- Title
- Multimodal Learning and Generation Toward a Multisensory and Creative AI System
- Creator
- Zhu, Ye
- Date
- 2023
- Description
-
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed...
Show moreWe are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified intelligent system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various modalities including vision, audio, and text, has become an increasingly popular research area with emerging technical advances in recent years. Under the context of multimodal learning, the creativity to generate and synthesize novel and meaningful data is a critical criterion to assess machine intelligence.As a step towards a multisensory and creative AI system, we study the problem of multimodal generation in this thesis by exploring the field from multiple perspectives. Firstly, we analyze different data modalities in a comprehensive manner by comparing the data natures, the semantics, and their corresponding mainstream technical designs. We then propose to investigate three multimodal generation application scenarios, namely text generation from visual data, audio generation from visual data, and visual generation from textual data, with diverse approaches to give an overview of the field. For the direction of text generation from visual data, we study a novel multimodal task in which the model is expected to summarize a given video with textual descriptions, under a challenging condition where the video can only be partially seen. We propose to supplement the missing visual information via a dialogue interaction and introduce QA-Cooperative network with a dynamic dialogue history update learning mechanism to tackle the challenge. For the direction of audio generation from visual data, we present a new multimodal task that aims to generate music for a given silent dance video clip. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that heavily rely on pre-defined musical synthesizers, we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector-Quantized (VQ) audio representation via our proposed Dance2Music-GAN (D2M-GAN) framework. For the direction of visual generation from textual data, we tackle a key desideratum in conditional synthesis, which is to achieve high correspondence between the conditioning input and generated output using the state-of-the-art generative model -- Diffusion Probabilistic Model. While most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound in model training. In this work, we take a different route by explicitly enhancing input-output connections by maximizing their mutual information, which is achieved by our proposed Conditional Discrete Contrastive Diffusion (CDCD) framework. For each direction, we conduct extensive experiments on multiple multimodal datasets and demonstrate that all of our proposed frameworks are able to effectively and substantially improve task performance in their corresponding contexts.
Show less