Search results
(1 - 6 of 6)
- Title
- HARDWARE/SOFTWARE CO-DESIGN PARTITIONING ALGORITHM FOR MACHINE VISION APPLICATIONS
- Creator
- Gonnot, Thomas
- Date
- 2017, 2017-05
- Description
-
Advancements in FPGA technologies now allows the implementation of machine vision using hardware component rather than processors for...
Show moreAdvancements in FPGA technologies now allows the implementation of machine vision using hardware component rather than processors for increased efficiency. The combination of hardware and software implementations, however, can provide even more efficient results by combining the advantages of both technologies. This leads to the problem of partitioning the machine vision algorithms between hardware and software. The hardware/software partition problem is NP-hard, which means that a solution to the problem can be checked in polynomial time, but the time to find the solution is not predictable. Automated methods based on a genetic algorithm or discrete particle swarm optimization algorithm allow a designer to implement computer vision algorithms without concerns for the hardware/software partitioning. Their reliance on randomness to explore different partitioning selections, however, means that the optimum result might not be reached and that the processing time cannot be predicted. This dissertation introduces a model for image processing and computer vision algorithms in a set of elementary blocks, each of which is assigned one or more configuration. This configuration can be either hardware or software and is linked to the corresponding resource utilization and performance. A procedure is also introduced to allocate the different blocks to either hardware or software, and a cost function is defined to evaluate the relevance of the generated design. The implementation of the model and procedure allows for the partitioning of any image processing in polynomial time by checking various implementations and selecting the optimum solution. This thesis includes two test cases used to test the efficiency of the method. The shift-invariant features transform is used to demonstrate the viability of the partitioning results on an algorithm containing multiple image convolution operations in parallel. The neural network, on the other hand, is used to demonstrate the performances of the procedure when machine vision algorithm contains many blocks. Finally, this dissertation present a set of machine vision applications, such as object tracking, object recognition, optical character recognition, facial recognition, and visually impaired assistance. The proposed model and procedure could be included in the design flow of hardware/software co-design tools and provide a library of image processing blocks ready to be implemented. This would allow image processing and computer vision designers would be able to implement any algorithm efficiently in hardware/software co-design without the need to know how to partition it.
Ph.D. in Electrical Engineering, May 2017
Show less
- Title
- STEREO-BASED DEPTH MAP PROCESSING: ESTIMATION AND REFINEMENT
- Creator
- Loghman, Maziar
- Date
- 2016, 2016-12
- Description
-
During the past decade, research in 3D video has become a hot topic owing to advancements in both hardware and software. Amongst different...
Show moreDuring the past decade, research in 3D video has become a hot topic owing to advancements in both hardware and software. Amongst different methods proposed for representing 3D data, multi-view video plus depth (MVD) format has gained a lot of attention. Most of such 3D algorithms rely on a per-pixel depth representation of the scene called a depth map. Depth maps are very useful for rendering virtual views and have lead to advancements in 3D compression algorithms. Generating an accurate and dense depth map is one of the important prerequisite for many 3D video applications. In this thesis, we highlight the following major problems in MVD. * Depth map estimation * Depth map refinement * Depth map coding In order to generate an accurate depth map, we propose a method based on Census transform with adaptive window patterns and semi-global optimization. A modified cross-based cost aggregation technique is proposed which helps to calculate a more reliable depth map. In order to further enhance the quality of the generated depth map, a novel multi-resolution anisotropic diffusion based algorithm is presented. The proposed depth refinement algorithm computes a dense depth map in which the holes have been filled and the object boundaries are sharpened. The next part of the research is based on depth map coding. In depth map coding, a considerable amount of time is required to investigate the mode decision pro- cess for every block of depth pixels. However, in real-time purposes, we can partially skip the mode selection step. In this thesis, we propose a novel depth intra-coding scheme for 3D video coding based on HEVC standard. The core idea of the proposed method is motivated by the fact that depth maps have specific characteristics that distinguish them from those of color images. By analyzing the reference depth maps based on homogeneousness of different regions, for some particular blocks, the DMM full-RD search is skipped and the mode is selected based on the previous similar tree- blocks. By this means, the time complexity of the encoding process is significantly reduced.
Ph.D. in Electrical Engineering, December 2016
Show less
- Title
- IMPROVING DEEP LEARNING BASED SEMANTIC SEGMENTATION USING CONTEXT INFORMATION
- Creator
- Xia, Zhengyu
- Date
- 2021
- Description
-
Semantic segmentation is an important but challenging task in computer vision because it aims to assign each pixel a category label accurately...
Show moreSemantic segmentation is an important but challenging task in computer vision because it aims to assign each pixel a category label accurately. Nowadays, applications such as autonomous driving, path navigation, image search engine, or augmented reality require accurate semantic analysis and efficient segmentation mechanisms. In this thesis, we propose multiple models to improve the performance of semantic segmentation. In the first part, we focus on the single-task network, which aims to improve the performance of semantic segmentation. Our research includes exploiting context information using mixed spatial pyramid pooling to extract dense context-embedded features in FCN-based semantic segmentation. We also propose a GAF module to generate a global context-based attention map to guide the shallow-layer feature maps for better pixel localization. In the second part, we focus on a multi-task network that incorporates semantic segmentation to improve other computer vision tasks such as object detection. Specifically, a multi-task network, along with a learning strategy is designed to let semantic segmentation and object detection assist each other since they are highly correlated. Also, we include weakly-supervised multi-label semantic segmentation learning to deal with the shortage of high-quality training examples and to improve the performance of cross-domain object detection. In the third part, we focus on improving the performance of video panoptic segmentation, which is a unified network that incorporates semantic segmentation and instance segmentation using video streams. We design a new ConvLSTM pyramid to transmit spatio-temporal contextual information in our video panoptic segmentation network. Specifically, we propose a modified ConvLSTM to generate temporal contextual information. Also, we design an MSTPP module to obtain mixed spatio-temporal context-embedded feature maps. Experimental results on different datasets show that our proposed method achieves better performance compared with the state-of-the-art methods.
Show less
- Title
- AI IN MEDICINE: ENABLING INTELLIGENT IMAGING, PROGNOSIS, AND MINIMALLY INVASIVE SURGERY
- Creator
- Getty, Neil
- Date
- 2022
- Description
-
While an extremely rich research field, compared to other applications of AI such as natural language processing (NLP) and image processing...
Show moreWhile an extremely rich research field, compared to other applications of AI such as natural language processing (NLP) and image processing/generation, AI in medicine has been much slower to be applied in real-world clinical settings. Often the stakes of failure are more dire, the access of private and proprietary data more costly, and the burden of proof required by expert clinicians is much higher. Beyond these barriers, the often typical data-driven approach towards validation is interrupted by a need for expertise to analyze results. Whereas the results of a trained Imagenet or machine translation model are easily verified by a computational researcher, analysis in medicine can be much more multi-disciplinary demanding. AI in medicine is motivated by a great demand for progress in health-care, but an even greater responsibility for high accuracy, model transparency, and expert validation.This thesis develops machine and deep learning techniques for medical image enhancement, patient outcome prognosis, and minimally invasive robotic surgery awareness and augmentation. Each of the works presented were undertaken in di- rect collaboration with medical domain experts, and the efforts could not have been completed without them. Pursuing medical image enhancement we worked with radiologists, neuroscientists and a neurosurgeon. In patient outcome prognosis we worked with clinical neuropsychologists and a cardiovascular surgeon. For robotic surgery we worked with surgical residents and a surgeon expert in minimally invasive surgery. Each of these collaborations guided priorities for problem and model design, analysis, and long-term objectives that ground this thesis as a concerted effort towards clinically actionable medical AI. The contributions of this thesis focus on three specific medical domains. (1) Deep learning for medical brain scans: developed processing pipelines and deep learn- ing models for image annotation, registration, segmentation and diagnosis in both traumatic brain injury (TBI) and brain tumor cohorts. A major focus of these works is on the efficacy of low-data methods, and techniques for validation of results without any ground truth annotations. (2) Outcome prognosis for TBI and risk prediction for Cardiovascular Disease (CVD): we developed feature extraction pipelines and models for TBI and CVD patient clinical outcome prognosis and risk assessment. We design risk prediction models for CVD patients using traditional Cox modeling, machine learning, and deep learning techniques. In this works we conduct exhaustive data and model ablation study, with a focus on feature saliency analysis, model transparency, and usage of multi-modal data. (3) AI for enhanced and automated robotic surgery: we developed computer vision and deep learning techniques for understanding and augmenting minimally invasive robotic surgery scenes. We’ve developed models to recognize surgical actions from vision and kinematic data. Beyond model and techniques, we also curated novel datasets and prediction benchmarks from simulated and real endoscopic surgeries. We show the potential for self-supervised techniques in surgery, as well as multi-input and multi-task models.
Show less
- Title
- Multimodal Learning and Generation Toward a Multisensory and Creative AI System
- Creator
- Zhu, Ye
- Date
- 2023
- Description
-
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed...
Show moreWe are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified intelligent system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various modalities including vision, audio, and text, has become an increasingly popular research area with emerging technical advances in recent years. Under the context of multimodal learning, the creativity to generate and synthesize novel and meaningful data is a critical criterion to assess machine intelligence.As a step towards a multisensory and creative AI system, we study the problem of multimodal generation in this thesis by exploring the field from multiple perspectives. Firstly, we analyze different data modalities in a comprehensive manner by comparing the data natures, the semantics, and their corresponding mainstream technical designs. We then propose to investigate three multimodal generation application scenarios, namely text generation from visual data, audio generation from visual data, and visual generation from textual data, with diverse approaches to give an overview of the field. For the direction of text generation from visual data, we study a novel multimodal task in which the model is expected to summarize a given video with textual descriptions, under a challenging condition where the video can only be partially seen. We propose to supplement the missing visual information via a dialogue interaction and introduce QA-Cooperative network with a dynamic dialogue history update learning mechanism to tackle the challenge. For the direction of audio generation from visual data, we present a new multimodal task that aims to generate music for a given silent dance video clip. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that heavily rely on pre-defined musical synthesizers, we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector-Quantized (VQ) audio representation via our proposed Dance2Music-GAN (D2M-GAN) framework. For the direction of visual generation from textual data, we tackle a key desideratum in conditional synthesis, which is to achieve high correspondence between the conditioning input and generated output using the state-of-the-art generative model -- Diffusion Probabilistic Model. While most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound in model training. In this work, we take a different route by explicitly enhancing input-output connections by maximizing their mutual information, which is achieved by our proposed Conditional Discrete Contrastive Diffusion (CDCD) framework. For each direction, we conduct extensive experiments on multiple multimodal datasets and demonstrate that all of our proposed frameworks are able to effectively and substantially improve task performance in their corresponding contexts.
Show less
- Title
- Multimodal Learning and Generation Toward a Multisensory and Creative AI System
- Creator
- Zhu, Ye
- Date
- 2023
- Description
-
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed...
Show moreWe are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified intelligent system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various modalities including vision, audio, and text, has become an increasingly popular research area with emerging technical advances in recent years. Under the context of multimodal learning, the creativity to generate and synthesize novel and meaningful data is a critical criterion to assess machine intelligence.As a step towards a multisensory and creative AI system, we study the problem of multimodal generation in this thesis by exploring the field from multiple perspectives. Firstly, we analyze different data modalities in a comprehensive manner by comparing the data natures, the semantics, and their corresponding mainstream technical designs. We then propose to investigate three multimodal generation application scenarios, namely text generation from visual data, audio generation from visual data, and visual generation from textual data, with diverse approaches to give an overview of the field. For the direction of text generation from visual data, we study a novel multimodal task in which the model is expected to summarize a given video with textual descriptions, under a challenging condition where the video can only be partially seen. We propose to supplement the missing visual information via a dialogue interaction and introduce QA-Cooperative network with a dynamic dialogue history update learning mechanism to tackle the challenge. For the direction of audio generation from visual data, we present a new multimodal task that aims to generate music for a given silent dance video clip. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that heavily rely on pre-defined musical synthesizers, we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector-Quantized (VQ) audio representation via our proposed Dance2Music-GAN (D2M-GAN) framework. For the direction of visual generation from textual data, we tackle a key desideratum in conditional synthesis, which is to achieve high correspondence between the conditioning input and generated output using the state-of-the-art generative model -- Diffusion Probabilistic Model. While most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound in model training. In this work, we take a different route by explicitly enhancing input-output connections by maximizing their mutual information, which is achieved by our proposed Conditional Discrete Contrastive Diffusion (CDCD) framework. For each direction, we conduct extensive experiments on multiple multimodal datasets and demonstrate that all of our proposed frameworks are able to effectively and substantially improve task performance in their corresponding contexts.
Show less