tensor networks in machine learning...vector matrix t tensor person tensor in machine learning...
TRANSCRIPT
Tensor Networks in Machine Learning
Kohei Hayashi2017, June 20
⾃⼰紹介• 経歴• 2012年 奈良先端⼤ 博⼠(⼯学)• 2012〜2013 東⼤ ⼭⻄・⿅島研 学振PD
2013〜2016 河原林ERATO 特任助教• 2016年〜現在 産総研AI研究センター 研究員
• 研究分野• ⾏列・テンソル分解• 近似ベイズ推論• WEBデータマイニング
Vector
Matrix
t
Tensorperson
TensorinMachineLearning
• Multi-dimensionalarray– Vectorandmatrixarespecialcasesoftensor
mode
#ofmodes=order
ProblemonTensorRepresentation
Canbeultra-highdimension,i.e.,• Difficulttointerpret• Heavyusageofdiskspace
Canwehavemoreinterpretable,light-weightrepresentation?
Hypothesis:Realdata=afewpatterns+noise
Example1:CommunitiesinNetwork
Example2:GeneExpression
HowtoFindPatterns?
Asmallnumberofpatternscanbefoundinlow-dimspace• Low-dimspace=lowrankinmatricesandtensors
Revisit:MatrixRank
Thenumberofrank-onematricesthatsufficestorecoverthematrix.
XisofrankR⟺ 𝑋 = 𝑈𝑉' = ∑ 𝑢*𝑣*',*-.
Rank
X UV
=
= + +
Rank-onematrix
= + +
Rank-oneMatrix=Pattern!
+
Application
• Recommendation
CPDecomposition
• Sumofrank-onetensors
= + +…+
𝑋 =/𝑢* ∘ 𝑣* ∘ 𝑤*
,
*-.
TuckerDecomposition
• GeneralizationofCPdecomposition
X = Z
U
V
W
𝑋 =///𝑧*34
'
4-.
5
3-.
(𝑢* ∘ 𝑣3 ∘ 𝑤4),
*-.
BEYONDTENSORDECOMPOSITION
MotivationTensor decompositions are nice, but …1. Computationally intractable for higher-order tensors.
For an 𝐼×𝐼×⋯×𝐼 tensor,
• Rank-𝑅 Tucker decomposition:• 𝑂(𝑀𝐼𝑅> + 𝑀𝑅@A.) for time• 𝑂(𝑀𝐼𝑅 + 𝑅@) for space
2. Other variations?
M
Tensor Network
• A model class for tensors under multilinearity• Model structure is described by tensor network diagram
What are Tensor Networks?
Tensor Network Diagram= (Undirected) Graphical Notation
Graph TensorNode n Tensor variable 𝐴(C)
Degree of node n The order of 𝐴(C)
Free edge of node n Free index of 𝐴(C)
The i-th edge between nodes nand m
Sum operation ∑ 𝐴D
(C)𝐴D(E)�
D
CustomerJohnBill
Emmy
RestaurantABC
WeatherSunnyRainyWindy
SituationLunchLunchDinner
Rating425…
Customer
Restaurant
Weather
...
Tuckerdecomposition
HierarchicalTucker Tensortrain(TT) Othertensornetworks
=
Examples
Benefits• Complexity• E.g. tensor train: 𝑂(𝑀𝐼𝑅>) for space --- Linear in M!
• Flexibility
Tensor Networks
Tucker
Entire model space for tensors
CP
Tensor Train
Hierarchical Tucker
Getting Hot in ML• ICML 2014
• Putting MRFs on a tensor train. A. Novikov, et al.
• NIPS 2015• Tensorizing Neural Networks.
A. Novikov, et al. • COLT 2016
• On the Expressive Power of Deep Learning: A Tensor Analysis. N. Cohen, O. Sharir and A. Shashua.
• ICML 2016• Convolutional Rectifier Networks as Generalized Tensor Decompositions.
N. Cohen and A. Shashua. • NIPS 2016
• Supervised Learning with Tensor Networks. E. Stoudenmire, D. J. Schwab
• ICLR 2017• Exponential Machines.
A. Novikov, Mikhail Trofimov, Ivan Oseledets• Inductive Bias of Deep Convolutional Networks through Pooling Geometry.
N. Cohen and A. Shashua.
Connection to deep learning
NIPS 2017• On Tensor Train Rank Minimization:
Statistical Efficiency and Scalable Algorithm.Imaizumi & H NEW!
Blue Ocean?• Tensor networks have been developed in physics • ML people noticed TNs very recently (after 2010)
Wilderness!• Almost no theory• Many open problems
Challenges1. Statistical performance
“How many samples are necessary for estimation?”2. Model selection
“What tensor network is the best?”3. Deep learning (Skip)
“How tensor networks and DNNs are related?”
Statistical Performance
Learning tensor networks• Given a tensor 𝑋 ∈ 𝒳 and a tensor network 𝑔 ≔ 𝑉, 𝐸 ∈ 𝒢𝒳
with the parameter space 𝒫O, with rank 𝑅 = 𝑅.,… , 𝑅 Q
• Want to obtain an estimator ΘS ∈ 𝒫O, such that 𝑋 ≈ 𝐹O ΘS ,
where 𝐹O:𝒫O, → 𝒳.
X X=Θ.
Θ>
ΘX
g
≈ 𝐹O(Θ)=
Optimization Problem• How to obtain ΘS?• Minimize the approximation error:
ΘS = argmin_∈𝒫`a 𝑋 − 𝐹O Θ>
• If we know rank R, it is not difficult• But in real cases we do not know R...
Error Analysis• Suppose X contains some noise E:
𝑋 = 𝑋∗ + 𝐸• If the true data 𝑋∗ is a tensor network: 𝑋∗ ∈ 𝒳 𝑔 ≔ 𝐹O Θ Θ ∈𝒫O,} with unknown rank R, whatʼs the recovery error
𝑋∗ − 𝐹O _S>?
• Tomioka+ [NIPSʼ13] analyzed when 𝑔 is Tucker decomposition• What about if 𝑔 is tensor train?
=> [Imaizumi, Maehara, H. NIPSʼ17]
Model Selection
Network Determination• The space of tensor networks 𝒢𝒳 are infinitely large• How can we find an optimal 𝑔?
• What does the optimal 𝑔 mean, in term of data analysis? How can we interpret it?
X
𝒢𝒳
…
Special Case: Order Determination in TT
X
1
4
3
2
1 432
2 431
3 412
…
• For M-th order tensor, M!/2 candidates exist
• Which one should we choose?
Probabilistic Interpretation
𝑥Dg =/𝑑**𝑢D*𝑣g*
�
*• If X, U, V, D are non-negative,
𝑝 𝑖, 𝑗 =/𝑝 𝑟 𝑝 𝑖 𝑟 𝑝(𝑗|𝑟)�
*• This is a topic model called pLSI• HMM is similarly written as TT
DU Vr ri jX ji =
Implication• Finding a tensor network = auto latent variable modeling!
𝑝 𝑖, 𝑗 =/𝑝 𝑟 𝑝 𝑖 𝑟 𝑝(𝑗|𝑟)�
*• Latent structure is the key of representation learning• Suppose a supervised learning of (y, x) ~ p(y, x)• A classifier directly learns p(y|x)• What if data obey a cause-effect model: p(y, x) = Σz p(y|z)p(x|z)?• Inferring z must be beneficial
Observed Latent structure
Summary• Tensors are a fundamental data format, but intractable• Tensor decomposition sometime helps, but it isnʼt enough• Tensor network looks promising