课程简介
Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this talk, we will introduce some basic ideas of text and image modeling and how can we build cross-modal relations using deep learning models. In details, we will discuss a joint model by using metric learning to minimize the similarity of the same content from different modalities. We will also introduce some recent research developments in image captioning and vision question answering (VQA)
【工作坊大纲】
1. 语义鸿沟
2. 图像建模与CNN
3. 文本模型与词向量
4. 联合模型
5. 自动标注
6. 文本生成
7. 视觉问答
目标收益
了解到深度学习的前沿研究,了解如何利用深度学习进行图像、文本信息的联合建模并如何跨模态的实现语义搜索和图像问答系统。
培训对象
课程内容
Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this talk, we will introduce some basic ideas of text and image modeling and how can we build cross-modal relations using deep learning models. In details, we will discuss a joint model by using metric learning to minimize the similarity of the same content from different modalities. We will also introduce some recent research developments in image captioning and vision question answering (VQA)。
outline:
-语义鸿沟
-图像建模与CNN
-文本模型与词向量
-联合模型
-自动标注
-文本生成
-视觉问答