inferring and executing programs for visual reasoning

末安慶大

株式会社みんなのウェディング

NN論文を肴に酒を飲む会 #3

@Googleオフィス

2017.07.12

自己紹介

名前

経歴

末安慶大 (すえやすけいた)

@playgroundxxxx

田舎の高専 ( 2010 ~ 2015 )

専攻科 ( 2015 ~ 2017 )

みんなのウェディング ( 2017.04 ~ )

趣味勢

紹介する論文

ざっくり言うと

画像に関する質問応答で、

NNを機能ごとにモジュールに分けて

モジュールの組み合わせ方と答えの

推論を同時に学習する

問題設定

Q: Are there an equal number of large things and metal spheres?

Q: What size is the cylinder that is left of the brown metal thing that is left ofthe big sphere?

Q: There is a sphere with the same size as the metal cube; is it made of the same material as the small red sphere?

Q: How many objects are either small cylinders or red things?

問題設定

a

yes

Are there an equal number of

large things and metal spheres?

x

q

NN

a

yes

Execution

Engine

Program

GeneratorAre there an equal number of


x

q

NN

NN

手法

a

yes

Execution

Engine

Program



x

q

Program Generator

Program

GeneratorAre there more cubes

then yellow things?

qsequence to sequence model

z

program

π

z = π ( q )

クエリを受け取ってprogramを出力するProgram Generator

How many cylinders are in front of the tiny

thing and on the left side of the green object?

クエリを関数の組み合わせで表現したものProgram

Program

ここは

？

Program

<SCENE>

<SCENE>

Program

<SCENE>

<SCENE>

ポーランド記法的に表現

Count FilterShape ‘cylinder’ And Relate ‘left’ Unique FilterColor ‘green’ <SCENE>

Relate ‘in front’ Unique FilterColor ‘small’ <SCENE>

Program

Are

there

more

cubes

than

yellow

things

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

GreaterThen

Count

FilterColor

‘yellow’

<SCENE>

Count

FilterShape

‘cube’

<SCENE>

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

q z

Program Generator

a

yes

Execution

Engine

Program



x

q

Execution Engine

Execution

Engine

z

Neural Module Networkφ

a = φ ( x, z )

GreaterThen Count FilterColor

‘yellow’ <SCENE> Count FilterShape

‘cube’ <SCENE>

x

a

yes

CNN

画像とprogramを受け取って答えを出力するExecution Engine

それぞれが小さいNNに対応Neural Module Network (NMN)

モチベーション

End-to-Endだと、そのタスクにしか使えないよね？

Neural Module Zooみたいなの作ったらすごくよくないですか！？

私的な意見

モジュールに分けてタスクに合わせて組み合わせたらいいじゃん

アイデア

Neural Module Network (NMN)

入力1つ

入力2つ

それぞれが小さいNNに対応

Execution Engine

CNN Classifier

Distribution over answer

yes no

実験

1. Program 教師あり + 教師なし

2. モジュールの出力可視化

3. 異なる特性の質問間の汎化具合

4. 異なるタイプの質問間の汎化具合

5. 未知語を含んだ質問の場合の精度

実験

1. Program 教師あり + 教師なし

全体の3%くらいの教師データで、フルで教師あり学習した場合とほぼ同等の精度がでている

実験

2. モジュールの出力可視化

各モジュールに専用の教師データを与えなくても、それぞれの役割について学習できている

実験

3. 異なる特性の質問間の汎化具合

Cubes

Condition A

gray, blue, brown, or yellow

Cylinders red, green, purple, cyan

Condition B

red, green, purple, cyan

gray, blue, brown, or yellow

ファインチューニング後は、オブジェクトに関して汎化できている

実験

4. 異なるタイプの質問間の汎化具合

実験1で、少ない教師データで良い結果が出たとはいえ、人工的な質問文のため、少量のデータで大部分の文の構造をカバーできるため、未知の構造について汎化できているとはいえない

平均の単語数16以下とそれ以外の質問セットで、それぞれShort, Longとして、Shortのみ教師ありで学習させた結果

実験


割と厳しそう…

実験


↑ 「box」を「cube」と解釈できている

まとめ

seq2seqとneural module networksを組み合わせることに

よって、質問を表現するモジュールの組み合わせと各モジュールの与えられた役割として正しい出力を同時に学習させることができている。

ICLRでも、強化学習の問題で必要なSkillの習得とそのSkillの組み合わせ方を別々に学習させる論文があった

Deep Learning、初期はEnd-to-endが売りだった気がするが、最近は意味的に分割するのが流行ってきている？

所感

ベースライン

LSTM

CNN + LSTM

CNN + LSTM + SA

CNN + LSTM + SA + MLP

クエリのみを考慮

CNNとLSTMの出力を結合して全結合を通して出力

上のやつにStacked Attentionというやつをかける

上のやつの全結合を多層パーセプトロンにしたやつ

データセット詳細

inferring and executing programs for visual reasoning

Science