nlp、多模态智能和机器学习 · 10/31/2019 · [he, zhang, ren, sun, "deep residual...
TRANSCRIPT
1
艾伦.图灵
图灵测试
通过人类和机器之间的对话与交流来判断机器是否具有智能。
"Computing Machinery and
Intelligence" (1950)
人工智能的初心及圣杯 – 语义理解与人机对话
2
Geoff
Hinton
Yoshua
Bengio
Yann
LeCun
最近一次人工智能的发展浪潮由深度学习驱动
[Hinton and Salakhutdinov, “Reducing the dimensionality of data with neural networks.” Science, July 2006]
4
深度学习率先在大词表语音识别任务上产生突破
5.1%
Microsoft 2017, Switchboard
2010年MSR (Li Deng’s group) 首次在大规模AI核心任务(ASR)上展示了深度学习的威力和潜力
2017年MSR (XD Huang’s group) 在 Switchboard 上精度达到人类水平
随后开启了一系列深度学习在人类语言技术(HLT)领域的突破。
2010年深度学习将大词表语音识别性能提升20%,2017年在Switchboard上精度达到人类水平!
[Xiong, Wu, Alleva, Droppo, Huang, Stolcke, "The Microsoft 2017 conversational speech recognition system," ICASSP 2018]
[Dahl, Yu, Deng, Acero, "Large vocabulary continuous speech recognition with context-dependent DBN-HMMS," ICASSP 2011]
6
语言理解/语义意图分类
烤猪,最好吃了 带子? 我不 喜欢 带子。 这里的鸡尾酒令人惊叹, 有趣,味道好 下次我再来这个城市时, 我一定会再来一次 超推荐
[Yang, Yang, Dyer, He, Smola, Hovy, “HAN”, NAACL2016]
Hierarchical Attention Net (HAN) Introducing self-attention
2016年提出的层次化注意力模型 (HAN) 能在单词、句子、段落等多个层面来建模理解语言,判断意图,并通过对神经元激活的可视化来给出一定程度的可解释性。
7
“小明快递了一袋苹果给外公”
H3
输入
神经网络
自然语言的描述
通过深度神经网络逐步抽取语义上的不变性 (invariance)
抽象的语义表征
语义空间
从自然语言中抽取出语义并将其投影到语义空间以帮助搜索、推荐、分类、问答等应用
“外公从小明那收到了袋红富士” 语义相似的描述
“小明送给女友最新一代的苹果 X” 语义不同的描述
语言理解/语义的表征
[Huang, He, Gao, Deng, Acero, Heck, “DSSM”, CIKM2013]
8
Input word/phrase
dim = 100M Bag-of-words vector
dim = 50K
d=500 Char-trigram embedding matrix
Char-trigram encoding matrix (fixed)
d=500
Semantic vector
d=300
dim = 100M
dim = 50K
d=500
d=500
d=300
dim = 100M
dim = 50K
d=500
d=500
d=300
Ws,1
Ws,2
Ws,3
Ws,4
𝒗𝒔 𝒗𝒕+ 𝒗𝒕−
DSSM: 深度结构化语义模型
Compute gradients
𝜕𝒆𝒙𝒑(𝒄𝒐𝒔 𝒗𝒔, 𝒗𝒕+ )
𝒆𝒙𝒑(𝒄𝒐𝒔 𝒗𝒔, 𝒗𝒕′ )𝒕′={𝒕+,𝒕−}
𝜕W cos(𝑣𝑠 , 𝑣𝑡+) cos(𝑣𝑠 , 𝑣𝑡−)
Compute Cosine similarity between semantic vectors
基于相对相似度的训练目标函数:
Wt,1
Wt,2
Wt,3
Wt,4
Wt,1
Wt,2
Wt,3
Wt,4
s: “小明快递了一袋苹果给外公” t+: “外公从小明那收到了袋红富士” t -: “小明送给女友最新一代的苹果 X”
[Huang, He, Gao, Deng, Acero, Heck, “DSSM”, CIKM2013; Shen, He, Gao, Deng, Mesnil, “CDSSM”, WWW2014&CIKM2014]
9
知识推理及问答
知识库 𝜆𝑥. sister_of(justin_bieber, 𝑥)
谁是贾斯汀.比伯的姐姐?
sibling_of(justin_bieber, x) ∧ gender(x, female)
语义解析
SQL 搜索匹配
贾木尼.比伯
[Yih, He, Meek, ACL2014; Yih, Chang, He, Gao, ACL2015; Golub & He, EMNLP2016;…]
在连续向量空间表达知识、解析语义、执行推理和应答
11
京东智能客服
售 前
智能导购
智能调度 智能导航 智能摘要 实时辅助应答 智能质检 智能创事件
售 中 售 后 物 流
智能情感客服 智能情感客服
智能语音应答 (电话)
智能语音外呼
(电话)
智能语音应答 (电话)
零售全链条智能人机对话与交互服务
12
从语言理解、问答、到人机对话进展显著
“ 对话机器人不仅需要响应用户的请求,完成任务,还需要满足用户对沟通和情感的需求,与用户建立情感联系。” “我 们将成为有史以来第一代与 AI 共生的人类。”
— “从Eliza到小冰:社交对话机器人的机遇和挑战,” 沈向洋,何晓冬,李迪。
中国工程院院刊FITEE “人工智能2.0:理论与应用” 特刊 (10.1631/FITEE.1700826 )
13
视觉智能
人类物体识别错误率约5%
2012 前,大都是线性模型 2012后,主流模型是深度神经网络
MSR Jian Sun’s Group (ResNet)
Toronto Geoff Hinton‘s group (AlexNet)
[Fei-Fei Li +]
(1000 类物体识别测试)
2012年深度学习将大规模图像识别性能提升超过30%, 2015年在ImageNet上精度达到人类水平!
[Krizhevsky, Sutskever, Hinton, "Imagenet classification with deep convolutional neural networks," NIPS 2012]
[He, Zhang, Ren, Sun, "Deep residual learning for image recognition," CVPR 2016]
16
建立多模态语义空间:跨模态表征学习
通过深度结构语义模型(DSSM)把图像和文字均表征成语义空间内的向量 在此空间中进行语义相似度计算,生成最匹配图像内容的文字表述
图像特征
H1
H2
H3
W1
W2
W3
W4
Input s
H3
文字表述: 一位男士手拿球拍在网球场上
H1
H2
H3
W1
W2
W3
Input t1
H3 W4
Raw Image pixels
Convolution/pooling
Fully connected
CNN
视觉-语言多模态语义空间
[Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, et al., “From Captions to Visual Concepts and Back,” CVPR2015]
17
a baseball player throwing a ball
“一个棒球运动员在扔一个球。”
一个棒球 一个棒球运动员
一个棒球运动员在扔 一个棒球运动员在扔一个球 [Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, et al., “From Captions to Visual Concepts and Back,” CVPR2015]
图像描述:理解图像, 用语言表达
18
[Guo, Zhang, Hu, He, Gao, "MS-Celeb-1M: A dataset and benchmark for large-scale face recognition", ECCV 2016] [Tran, He, Zhang, Sun, et al., "Rich image captioning in the wild," CVPR DeepVision Workshop 2016]
与实体知识融合
Jen-Hsun Huang, Xiaodong He, Jian Sun
et al., that are posing for a picture.
19
可控的语言表达
[Gan, Gan, He, Gao, Deng, “StyleNet”, CVPR2017] [Gan, Gan, He, Gao, Deng, “Semantic Compositional Net”, CVPR2017]
控制语言生成,让AI用语言来表达浪漫或者幽默的风格 - StyleNet
20
视觉-语言多模态机器问答(Visual QA)
Answer natural language questions according to the content of a reference image.
Visual Question
Answering (VQA)
21
Spatial feature vectors of different regions of the image
Multiple-steps of reasoning over the image
to infer the answer
从图片描述到图文问答: 推理能力
To answer a question about a image:
Need to understand subtle relationships among multiple objects
Need to focus on the specific regions that are relevant to the answer.
22
Spatial feature vectors of different regions of the image
堆叠注意力网络 (Stacked Attention Net)
SANs perform multi-step reasoning
1. Question model
2. Image model
3. Multi-level attention model
4. Answer predictor
5. End-to-end learning using SGD
[Yang, He, Gao, Deng, Smola, “Stacked Attention Networks,” CVPR 2016]
23
𝑣𝑄
𝑝1 𝑝14
𝑝196
…
…
𝑝183
attention map
{𝑝𝑖} 𝑣𝐼 = 𝑝𝑖𝑣𝑖
𝑖
𝑣𝑄
𝑢 = 𝑣𝐼 + 𝑣𝑄
…
…
spatial image feature vectors
{𝑣𝑖}
𝑣𝑄
𝑣𝐼
To the next attention level
Multimodal
Pooling (level 1)
Attention 1
第一层注意力
跨模态表征融合与联结(Pooling & Grounding)
24 Query vector from the 1st level attention
𝑢
To the answer predictor
𝑣𝐼 (2) = 𝑝𝑖𝑣𝑖
𝑖
𝑢
𝑢(2) = 𝑣𝐼 (2) + 𝑢
…
…
𝑣𝑄
𝑣𝐼
𝑝1 𝑝14
𝑝196
…
…
𝑝183
attention map
{𝑝𝑖}
Attention 2
Multimodal
Pooling (level 2)
spatial image feature vectors
{𝑣𝑖}
第二层注意力
跨模态表征融合与联结(Pooling & Grounding)
25
Bottom-Up and Top-Down Attention(BUTD)
注意力模型的一个新视角
In human visual system, there are two kinds of attentions:
Top-down attention : proactively initiated by the current task (e.g., look for something)
Bottom-up attention : spontaneously emerge from visual salient stimuli
26
Adopt similar terminology to humans’ attention system: • attention mechanisms driven by non visual
or task-specific context as ‘top-down’ • purely visual feed-forward attention
mechanisms as ‘bottom-up’.
Top-down features: from CNN Bottom-up features: from F-RCNN
Overall Attention Net for VQA:
Bottom-Up and Top-Down Attention(BUTD)
27
Because of Bottom-up Attention
[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR18 [2] Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge, CVPR18
此后几乎所有的VQA队伍都使用了“Bottom-Up and Top-Down (BUTD)”注意力模型或其变种。
VQA Challenge @ CVPR2017
28
视觉-语言多模态导航
结合语言理解和对环境的视觉信息建模,智能代理能按指令从一个地方走到另一个地方
[Anderson et al., CVPR2018] [Wang et al., CVPR 2019]
29
Objective function:
[Reed et al., “Generative adversarial text-to-image synthesis”, ICML2016]
理解语言, 用绘画来表达 (Text-to-Image)
30
[Xu, Zhang, Huang, Zhang, Gan, Huang, He, “AttnGAN,” CVPR2018]
The final objective function: 𝐿 = 𝐿𝐺𝐴𝑁 + 𝜆𝐿𝐷𝐴𝑀𝑆𝑀
AttnGAN: GAN with Attention
31
一只红羽毛白肚子的短咀小鸟
[Xu, Zhang, Huang, Zhang, Gan, Huang, He, “AttnGAN,” CVPR2018]
绘画机器人(AttnGAN): 精准理解,精确绘制
32
this bird has a green crown black primaries and a white belly
this bird has wings that are blue and has a red belly
this bird has a yellow crown and a black eye ring that is round
a small red and white bird with a small curved beak
更多例子
33
a fruit stand display with bananas and kiwi
a herd of sheep grazing on a lush green field
an old clock next to a light post in front of a steeple
a wild pack of family dogs came running through the yard one day
AI – 人工想象: Artificial Imagination
35
深度学习驱动跨越语言和视觉的理解和表达
多模态表征学习 / 多模态注意力机制
多模态智能研究: 跨越语言和视觉的感知, 推理, 创作,多模态融合(pooling), 多模态联结(grounding)等
Q: what are sitting in the basket on a bicycle? A: dogs.
This bird is red with white and has a very short beak
多模态对话与交互机器人, 数字助理,虚拟/混合现实,有认知能力的机器人,科技艺术 …
图像描述机器人 视觉问答机器人 智能绘画机器人
视觉-语言多模态信息处理:联结实体世界与符号世界
… 语言导航机器人