tensor networks in machine learning...vector matrix t tensor person tensor in machine learning...

Post on 03-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tensor Networks in Machine Learning

Kohei Hayashi2017, June 20

⾃⼰紹介• 経歴• 2012年 奈良先端⼤ 博⼠(⼯学)• 2012〜2013 東⼤ ⼭⻄・⿅島研 学振PD

2013〜2016 河原林ERATO 特任助教• 2016年〜現在 産総研AI研究センター 研究員

• 研究分野• ⾏列・テンソル分解• 近似ベイズ推論• WEBデータマイニング

Vector

Matrix

t

Tensorperson

TensorinMachineLearning

• Multi-dimensionalarray– Vectorandmatrixarespecialcasesoftensor

mode

#ofmodes=order

ProblemonTensorRepresentation

Canbeultra-highdimension,i.e.,• Difficulttointerpret• Heavyusageofdiskspace

Canwehavemoreinterpretable,light-weightrepresentation?

Hypothesis:Realdata=afewpatterns+noise

Example1:CommunitiesinNetwork

Example2:GeneExpression

HowtoFindPatterns?

Asmallnumberofpatternscanbefoundinlow-dimspace• Low-dimspace=lowrankinmatricesandtensors

Revisit:MatrixRank

Thenumberofrank-onematricesthatsufficestorecoverthematrix.

XisofrankR⟺ 𝑋 = 𝑈𝑉' = ∑ 𝑢*𝑣*',*-.

Rank

X UV

=

= + +

Rank-onematrix

= + +

Rank-oneMatrix=Pattern!

+

Application

• Recommendation

CPDecomposition

• Sumofrank-onetensors

= + +…+

𝑋 =/𝑢* ∘ 𝑣* ∘ 𝑤*

,

*-.

TuckerDecomposition

• GeneralizationofCPdecomposition

X = Z

U

V

W

𝑋 =///𝑧*34

'

4-.

5

3-.

(𝑢* ∘ 𝑣3 ∘ 𝑤4),

*-.

BEYONDTENSORDECOMPOSITION

MotivationTensor decompositions are nice, but …1. Computationally intractable for higher-order tensors.

For an 𝐼×𝐼×⋯×𝐼 tensor,

• Rank-𝑅 Tucker decomposition:• 𝑂(𝑀𝐼𝑅> + 𝑀𝑅@A.) for time• 𝑂(𝑀𝐼𝑅 + 𝑅@) for space

2. Other variations?

M

Tensor Network

• A model class for tensors under multilinearity• Model structure is described by tensor network diagram

What are Tensor Networks?

Tensor Network Diagram= (Undirected) Graphical Notation

Graph TensorNode n Tensor variable 𝐴(C)

Degree of node n The order of 𝐴(C)

Free edge of node n Free index of 𝐴(C)

The i-th edge between nodes nand m

Sum operation ∑ 𝐴D

(C)𝐴D(E)�

D

CustomerJohnBill

Emmy

RestaurantABC

WeatherSunnyRainyWindy

SituationLunchLunchDinner

Rating425…

Customer

Restaurant

Weather

...

Tuckerdecomposition

HierarchicalTucker Tensortrain(TT) Othertensornetworks

=

Examples

Benefits• Complexity• E.g. tensor train: 𝑂(𝑀𝐼𝑅>) for space --- Linear in M!

• Flexibility

Tensor Networks

Tucker

Entire model space for tensors

CP

Tensor Train

Hierarchical Tucker

Getting Hot in ML• ICML 2014

• Putting MRFs on a tensor train. A. Novikov, et al.

• NIPS 2015• Tensorizing Neural Networks.

A. Novikov, et al. • COLT 2016

• On the Expressive Power of Deep Learning: A Tensor Analysis. N. Cohen, O. Sharir and A. Shashua.

• ICML 2016• Convolutional Rectifier Networks as Generalized Tensor Decompositions.

N. Cohen and A. Shashua. • NIPS 2016

• Supervised Learning with Tensor Networks. E. Stoudenmire, D. J. Schwab

• ICLR 2017• Exponential Machines.

A. Novikov, Mikhail Trofimov, Ivan Oseledets• Inductive Bias of Deep Convolutional Networks through Pooling Geometry.

N. Cohen and A. Shashua.

Connection to deep learning

NIPS 2017• On Tensor Train Rank Minimization:

Statistical Efficiency and Scalable Algorithm.Imaizumi & H NEW!

Blue Ocean?• Tensor networks have been developed in physics • ML people noticed TNs very recently (after 2010)

Wilderness!• Almost no theory• Many open problems

Challenges1. Statistical performance

“How many samples are necessary for estimation?”2. Model selection

“What tensor network is the best?”3. Deep learning (Skip)

“How tensor networks and DNNs are related?”

Statistical Performance

Learning tensor networks• Given a tensor 𝑋 ∈ 𝒳 and a tensor network 𝑔 ≔ 𝑉, 𝐸 ∈ 𝒢𝒳

with the parameter space 𝒫O, with rank 𝑅 = 𝑅.,… , 𝑅 Q

• Want to obtain an estimator ΘS ∈ 𝒫O, such that 𝑋 ≈ 𝐹O ΘS ,

where 𝐹O:𝒫O, → 𝒳.

X X=Θ.

Θ>

ΘX

g

≈ 𝐹O(Θ)=

Optimization Problem• How to obtain ΘS?• Minimize the approximation error:

ΘS = argmin_∈𝒫`a 𝑋 − 𝐹O Θ>

• If we know rank R, it is not difficult• But in real cases we do not know R...

Error Analysis• Suppose X contains some noise E:

𝑋 = 𝑋∗ + 𝐸• If the true data 𝑋∗ is a tensor network: 𝑋∗ ∈ 𝒳 𝑔 ≔ 𝐹O Θ Θ ∈𝒫O,} with unknown rank R, whatʼs the recovery error

𝑋∗ − 𝐹O _S>?

• Tomioka+ [NIPSʼ13] analyzed when 𝑔 is Tucker decomposition• What about if 𝑔 is tensor train?

=> [Imaizumi, Maehara, H. NIPSʼ17]

Model Selection

Network Determination• The space of tensor networks 𝒢𝒳 are infinitely large• How can we find an optimal 𝑔?

• What does the optimal 𝑔 mean, in term of data analysis? How can we interpret it?

X

𝒢𝒳

Special Case: Order Determination in TT

X

1

4

3

2

1 432

2 431

3 412

• For M-th order tensor, M!/2 candidates exist

• Which one should we choose?

Probabilistic Interpretation

𝑥Dg =/𝑑**𝑢D*𝑣g*

*• If X, U, V, D are non-negative,

𝑝 𝑖, 𝑗 =/𝑝 𝑟 𝑝 𝑖 𝑟 𝑝(𝑗|𝑟)�

*• This is a topic model called pLSI• HMM is similarly written as TT

DU Vr ri jX ji =

Implication• Finding a tensor network = auto latent variable modeling!

𝑝 𝑖, 𝑗 =/𝑝 𝑟 𝑝 𝑖 𝑟 𝑝(𝑗|𝑟)�

*• Latent structure is the key of representation learning• Suppose a supervised learning of (y, x) ~ p(y, x)• A classifier directly learns p(y|x)• What if data obey a cause-effect model: p(y, x) = Σz p(y|z)p(x|z)?• Inferring z must be beneficial

Observed Latent structure

Summary• Tensors are a fundamental data format, but intractable• Tensor decomposition sometime helps, but it isnʼt enough• Tensor network looks promising

top related