f ýepf v ˆ wenzw/bigdata2019.html ...bicmr.pku.edu.cn/~wenzw/bigdata/lect1.pdf · 4/68 ’pn...

Post on 14-Feb-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

大数据分析中的算法

文再文

课程代码:00136720 (本科生),00100863(本研合)

北京大学北京国际数学研究中心

http://bicmr.pku.edu.cn/~wenzw/bigdata2019.htmlwenzw@pku.edu.cn

1/68

2/68

课程信息

大数据分析中的算法侧重数值代数和最优化算法

课程代码:00136720 (本科生),00100863(本研合)

教师信息:文再文,wenzw@pku.edu.cn,微信:wendoublewen

助教信息:杨明瀚,柳伊扬

上课地点:二教301

上课时间:1-16周,每周周二1-2节,双周周四1-2节,8:00am-9:50am

课程主页:http://bicmr.pku.edu.cn/~wenzw/bigdata2020.html

3/68

参考资料

class notes, and reference books or papers“Mining of Massive Datasets”, Jure Leskovec, Anand Rajaraman,Jeff Ullman, Cambridge University Press

“Convex optimization”, Stephen Boyd and Lieven Vandenberghe

“Numerical Optimization”, Jorge Nocedal and Stephen Wright,Springer

“Optimization Theory and Methods”, Wenyu Sun, Ya-Xiang Yuan

“Matrix Computations”, Gene H. Golub and Charles F. Van Loan.The Johns Hopkins University Press

“Deep Learning", Ian Goodfellow and Yoshua Bengio and AaronCourville

4/68

大数据分析相关的网站

Mining Massive Data Sets, Stanford University

Jure Leskovec @ Stanford - Stanford Computer Science

Mining Massive Data Sets: Hadoop Labs, Stanford University

Big Data Analytics, Columbia University

Theoretical Foundations of Big Data Analysis, The SimonsInstitute for the Theory of Computing, UC Berkeley

Core Methods in Data Science, University of Washington

“Theories of Deep Learning”, STATS 385, Stanford University,Fall 2017

courses from Berkeley Artificial Intelligence Research (BAIR)Lab

5/68

课程计划

侧重数值代数和最优化的模型与算法

线性规划,半定规划

压缩感知和稀疏优化基本理论和算法

低秩矩阵恢复的基本理论和算法

随机特征值算法,随机优化算法

图和网络流问题: shorted path, maximum flow等等

机器学习算法:聚类分析,高维数据降维,链接分析,推荐系统,SVM

现代医学成像与高维图像分析:相位恢复以及低温电子显微镜

Optimal transport

深度学习(deep learning)

强化学习(reinforcement learning)

6/68

课程信息

教学方式:课堂讲授

成绩评定办法:

迟交一天(24小时)打折10%,不接受晚交4天的作业和项目(任何时候理由都不接受)

大作业,包括习题和程序:40%

课程项目一:30%

课程项目二:30%

作业要求:i)计算题要求写出必要的推算步骤,证明题要写出关键推理和论证。数值试验题应该同时提交书面报告和程序,其中书面报告有详细的推导和数值结果及分析。ii)可以同学间讨论或者找助教答疑,但不允许在讨论中直接抄袭,应该过后自己独立完成。iii)严禁从其他学生,从互联网,从往年的答案,其它课程等等任何途径直接抄袭。iv)如果有讨论或从其它任何途径取得帮助,请列出来源。

请谨慎选课

7/68

Why Optimization in Machine Learning?

Many problems in ML can be written as

minw∈W

N∑i=1

12‖w>xi − yi‖2

2 + λ‖w‖1 linear regression

minw∈W

1N

N∑i=1

log(1 + exp(−yiw>xi)) + λ‖w‖1 logistic regression

minw∈W

N∑i=1

`(f (w, xi), yi) + λr(w) general formulation

The pairs (xi, yi) are given data, yi is the label of the data point xi

`i(·): measures model fit for data point i (avoids under-fitting)r(w): measures model “complexity" (avoids over-fitting)f (w, x): linear function or models constructed from deep neuralnetworks

8/68

Convolutional neural network (CNN)

A pioneering digit classification neural network by LeCun et. al. Itwas applied by several banks to recognise hand-written numbers onchecks.https://stats385.github.io/networks

9/68

AlexNet

Designed by Hinton and etc in 2012 and achieved a top-5 error of15.3%. The network consisted of five convolutional layers, some ofwhich were followed by max-pooling layers, and three fully-connectedlayers with a final 1000-way softmax. All the convolutional and fullyconnected layers were followed by the ReLU nonlinearity.https://stats385.github.io/networks

10/68

VGGNet

A 19 layer convolutional neural network from VGG group, Oxford, thatwas simpler and deeper than AlexNet. All large-sized filters inAlexNet were replaced by cascades of 3x3 filters (with nonlinearity inbetween). Max pooling was placed after two or three convolutionsand after each pooling the number of filters was always doubled.https://stats385.github.io/networks

11/68

Deconvolutional networks

A generative network that is a special kind of convolutional networkthat uses transpose convolutions, also known as a deconvolutionallayers. https://stats385.github.io/architectures

12/68

Recurrent neural networks (RNN)

RNNs are built on the same computational unit as the feed forwardneural network. RNNs do not have to be organized in layers anddirected cycles are allowed. This allows them to have internalmemory and as a result to process sequential data. One can convertan RNN into a regular feed forward neural network by "unfolding" it intime, as depicted in the figures.https://stats385.github.io/architectures

13/68

Optimization algorithms in Deep learning

随机梯度类算法

pytorch/caffe2里实现的算法有adadelta, adagrad, adam,nesterov, rmsprop, YellowFinhttps://github.com/pytorch/pytorch/tree/master/caffe2/sgd

pytorch/torch里有:sgd, asgd, adagrad, rmsprop, adadelta,adam, adamaxhttps://github.com/pytorch/pytorch/tree/master/torch/optim

tensorflow实现的算法有:Adadelta, AdagradDA, Adagrad,ProximalAdagrad, Ftrl, Momentum, adam, Momentum,CenteredRMSProp具体实现:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc

14/68

Stochastic Algorithms in Deep learning

Consider problem minx∈Rd1n

∑ni=1 fi(x)

References: chapter 8 inhttp://www.deeplearningbook.org/

Gradient descent

xt+1 = xt − αt

n

n∑i=1

∇fi(xt)

Stochastic gradient descent

xk+1 = xt − αt∇fi(xt)

SGD with momentum

vt+1 = µtvt − αt∇fi(xt)

xt+1 = xt + vt+1

15/68

Stochastic Algorithms in Deep learning

Adaptive Subgradient Methods(Adagrad): let gt = ∇fi(xt),g2

t = diag[gtgTt ] ∈ Rd, and initial G1 = g2

1. At step t

xt+1 = xt − αt√

Gt + ε1d∇fi(xt)

Gt+1 = Gt + g2t+1

in the upper and the following iterations we use element-wisevector-vector multiplication.

16/68

Reinforcement Learning

AlphaGo: supervised learning + policy gradients + valuefunctions + Monte-Carlo tree search

17/68

What is RL

Reinforcement learning

Branch of machine learning concerned with taking sequences ofactions Usually described in terms: Policies (select next action), Valuefunctions (measure goodness of states or state-action pairs),Dynamics Models (predict next states and rewards)

is learning what to do–how to map situations to actions–so as tomaximize a numerical reward signal. The decision-maker is called theagent, the thing it interacts with, is called the environment.

A reinforcement learning task that satisfies the Markov property iscalled a Markov Decision process, or MDP

We assume that all RL tasks can be approximated with Markovproperty. So this talk is based on MDP.

18/68

Agent and Environment

The agent selects actions based on the observations and rewardsreceived at each time-step t

The environment selects observations and rewards based on theactions received at each time-step t

19/68

Some applications

20/68

Definition

A Markov Decision Process is a tuple (S,A,P, r, γ):S is a finite set of states, s ∈ SA is a finite set of actions, a ∈ AP is the transition probability distribution.probability from state s with action a to state s′: P(s′|s, a)also called the model or the dynamicsr is a reward function, r(s, a, s′)sometimes just r(s) or ra

sor rt after time step tγ ∈ [0, 1] is a discount factor

21/68

Tetris

State: the current board, the current falling tileAction: Rotate or shift the falling shapeOne-step reward: if a level is cleared by the current action,reward 1, o.w. reward 0;Transition Probability: future tile is uniformly distributed

22/68

Optimization

Maximize the expected total discounted return of an episode

maxπ

E[

∞∑t=0

γtrt|π]

or,maxπ

E[R(τ)|τ = s0, a0, s1, a1, ... ∼ π]

Policy π(a|s) is a probability

23/68

Linear Programming Formulation

Solving Markov Decision Processes (MDP) can be formulated as LP:Primal

max∑

a

raxa

s.t. ∀s ∈ S,∑a∈A

xa = 1 + γPa,sxa

x ≥ 0

Dualmin

∑s

vs

s.t. ∀s ∈ S, a ∈ A, vs ≥ ra + γ∑

s′Pa,s′vs′

24/68

线性规划,半定规划,锥规划

Primal

min c>x

s.t. Ax = b

x ∈ K

Dual

max b>y

s.t. A>y + s = c

s ∈ K∗

“大”问题:LP: K is the nonnegative orthantvector x ∈ Rn, millions of variables

SOCP: K is the circular or Lorentz conevector x ∈ Rn, millions of variables

SDP: K is the cone of positive semidefinite matrices matrixx ∈ Sn×n, n up to 5000

25/68

Compressive Sensing

Find the sparest solutionGiven n=256, m=128.A = randn(m,n); u = sprandn(n, 1, 0.1); b = A*u;

50 100 150 200 250−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

min

x‖x‖0

s.t. Ax = b(a) `0-minimization

50 100 150 200 250−0.6

−0.4

−0.2

0

0.2

0.4

min

x‖x‖2

s.t. Ax = b(b) `2-minimization

50 100 150 200 250−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

min

x‖x‖1

s.t. Ax = b(c) `1-minimization

26/68

Wavelets and Images (Thanks: Richard Baraniuk)

27/68

Wavelet Approximation (Thanks: Richard Baraniuk)

28/68

Compressive sensing

Given (A, b,Ψ), find the sparsest point:

x∗ = arg min‖Ψx‖0 : Ax = b

From combinatorial to convex optimization:

x = arg min‖Ψx‖1 : Ax = b

1-norm is sparsity promotingBasis pursuit (Donoho et al 98)Many variants: ‖Ax− b‖2 ≤ σ for noisy b

Theoretical question: when is ‖ · ‖0 ↔ ‖ · ‖1 ?

29/68

Restricted Isometry Property (RIP)

Definition (Candes and Tao [2005])Matrix A obeys the restricted isometry property (RIP) with constant δs

if(1− δs)‖c‖2

2 ≤ ‖Ac‖22 ≤ (1 + δs)‖c‖2

2

for all s-sparse vectors c.

Theorem (Candes and Tao [2006])If x is a k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique `1minimizer.

RIP essentially requires that every set of columns with cardinalityless than or equal to s behaves like an orthonormal system.

30/68

MRI: Magnetic Resonance Imaging

(a) MRI Scan (b) Fourier Coefficients (c) Image

Is it possible to cut the scan time into half?

31/68

MRI (Thanks: Wotao Yin)

MR images often have sparse sparse representations undersome wavelet transform Φ

Solvemin

u‖Φu‖1 +

µ

2‖Ru− b‖2

R: partial discrete Fourier transformThe higher the SNR (signal-noise ratio) is, the better the imagequality is.

(a) full sampling (b) 39% sampling,SNR=32.2

32/68

MRI: Magnetic Resonance Imaging

(a) full sampling (b) 39% sampling,SNR=32.2

(c) 22% sampling,SNR=21.4

(d) 14% sampling,SNR=15.8

33/68

The Netflix Prize

Training data100 million ratings, 480,000 users, 17,770 movies6 years of data: 2000-2005

Test dataLast few ratings of each user (2.8 million)Evaluation criterion: root mean squared error (RMSE)Netflix Cinematch RMSE: 0.9514

Competition2700+ teams$1 million prize for 10% improvement on Cinematch

34/68

Netflix Problem: 1 million dollar award

Given m movies x ∈ X and n customers y ∈ Ypredict the “rating” W(x, y) of customer y for movie xtraining data: known ratings of some customers for some moviesGoal: complete the matrixother applications: collaborative filtering, system identification,etc.

35/68

Matrix Rank Minimization

Given X ∈ Rm×n, A : Rm×n → Rp, b ∈ Rp, we considermatrix completion problem:

min rank(X), s.t. Xij = Mij, (i, j) ∈ Ω

nuclear norm minimization:

min ‖X‖∗ s.t. A(X) = b

where ‖X‖∗ =∑

i σi and σi = ith singular value of matrix X.SDP reformulation

min Tr(W1) + Tr(W2)

s.t.(

W1 XX> W2

) 0

A(X) = b

36/68

Video separation

Partition the video into moving and static parts

37/68

Sparse and low-rank matrix separation

Given a matrix M, we want to find a low rank matrix W and asparse matrix E, so that W + E = M.Convex approximation:

minW,E‖W‖∗ + µ‖E‖1, s.t. W + E = M

Robust PCA

38/68

图与网络流问题

39/68

图与网络流问题

2014 ACM SIGMOD Programming ContestShortest Distance Over Frequent Communication Paths定义社交网络的边: 相互直接至少有x条回复并且相互认识。给定网络里两个人p1和p2以及另外一个整数x,寻找图中p1和p2之间数量最少节点的路径

Interests with Large Communities

Socialization Suggestion

Most Central People (All pairs shorted path)定义网络:论坛中有标签t的成员,相互直接认识。给定整数k和标签t,寻找k个有highest closeness centrality values的人

40/68

DIMACS Implementation Challenges

http://dimacs.rutgers.edu/Challenges/

2014, 11th: Steiner Tree Problems2012, 10th: Graph Partitioning and Graph Clustering2005, 9th: The Shortest Path Problem2001, 8th: The Traveling Salesman Problem2000, 7th: Semidefinite and Related Optimization Problems1998, 6th: Near Neighbor Searches1995, 5th: Priority Queues, Dictionaries, and Multi-Dim. PointSets1994, 4th: Two Problems in Computational Biology: FragmentAssembly and Genome Rearrangements1993, 3rd: Effective Parallel Algorithms for CombinatorialProblems1992, 2nd: Maximum Clique, Graph Coloring, and Satisfiability1991, 1st: Network Flows and Matching

41/68

Combinatorial optimization problems

Maximum cut problemgiven weights W for all edges in a graph, find the partition of thenodes of G into two sets V1 and V2 so that the sum of the weightsof the edges between V1 and V2 is maximal

Maximum k-cut problem (Frequency assignment problem)given a graph G, assign colors to the vertices in such a way thatthe number of non-defect edges (with endpoints of differentcolors) is maximal

Maximum stable set problemsX (G): clique number of a graph G = largest clique in Gω(G): coloring number of a graph G = smallest number of colorsneeded to color a graph, so that vertices connected by an edgehave different colorsθ(G) (θ+(G)): Lovasz theta function of G: defined by a SDP

ω(G) ≤ θ(G) ≤ θ+(G) ≤ X (G).

42/68

Semidefinite programming (SDP) relaxations

Maximum cut problem

minX∈Sn〈C,X〉 s.t. Xi,i = 1, X 0

Frequency assignment problem (Maximum k-cut problem)

min⟨ 1

2k diag(We) + k−12k W,X

⟩s.t. Xi,j ≥ −1

k−1 , ∀(i, j) ∈ E\T, Xi,j = −1k−1 , ∀(i, j) ∈ T,

diag(X) = e, X 0.

Maximum stable set problems

θ(G) := max⟨ee>,X

⟩s.t. Xij = 0, (i, j) ∈ E, 〈I,X〉 = 1,X 0,

θ+(G) := max⟨ee>,X

⟩s.t. Xij = 0, (i, j) ∈ E, 〈I,X〉 = 1,X 0,X ≥ 0

43/68

Communities in the Networks

Many networks have community structures. Nodes in the samecluster have high connection intensity.

Figure: https://www.slideshare.net/NicolaBarbieri/community-detection

44/68

Communities in the Networks

Figure: Simmons College Facebook Network, the four clusters are labeledby different graduation year: 2006 in green, 2007 in light blue, 2008 inpurple and 2009 in red. Figure from Chen, Li and Xu, 2016.

45/68

Partition Matrix and Assignment Matrix

For any partition ∪ka=1Ca = [n], define the partition matrix X

Xij =

1, if i, j ∈ Ca, for some a,0, else .

Low rank solution

X =

11 1 11 1 11 1 1

1 11 1

=

1111

11

×1

1 1 11 1

46/68

Modularity Maximization

The modularity (MEJ Newman, M Girvan, 2004) is defined by

Q = 〈A− 12λ

ddT ,X〉

where λ = |E|.The Integral modularity maximization problem:

max 〈A− 12λddT ,X〉

s.t. X ∈ 0, 1n×n is a partiton matrix.

Probably hard to solve.

47/68

Convex Relaxation

doubly nonnegative relaxation of modularity maximization:

min 〈−A +1

2λddT ,X〉

s.t. Xii = 1, i = 1, . . . , n,

X 0,

0 ≤ Xij ≤ 1

This is an SDP problemwith theoretical guarantee under certain conditionsBut it is a huge-scale problem to solve

48/68

Dimensionality Reduction

Given n data points Xi ∈ Rp and assume that this dataset has intrinsicdimensionality d where d p. How transform the data set X into anew dataset Y with dimensionality d?

max∑

ij

‖yi − yj‖2

s.t. ‖yi − yj‖2 = ‖xi − xj‖2.

Maximal variance unfolding (MVU):

max Tr(K)

s.t. Kii + Kjj − 2Kij = ‖xi − xj‖2∑ij

Kij = 0

K 0.

which is SDP problem.

49/68

Supporting Vector Machine (SVM)

Suppose we have two-class discrimination data. We assign the firstclass with 1 and the second with 1. A powerful discrimination methodis the Supporting Vector Machine (SVM). With yi = 1 (in the firstclass), let the data point i be given by ai ∈ Rd, i = 1, . . . , n1, and withyi = −1 (in the second class), let the data point j be given by bj ∈ Rd,j = 1, . . . , n2. We wish to find a hyper-plane in Rd to separate ais frombjs. Mathematically, we wish to find ω ∈ Rd and β ∈ R such that

a>i ω + β > 1,∀i

andb>j ω + β < −1, ∀j,

where x : ω>x + β = 0 is the desired separating hyper-plane. This is alinear program!

50/68

Supporting Vector Machine (SVM)

When the data is noisy. Then we solve

min∑

i

|πi|+∑

j

|σj|

s.t. a>i ω + β ≥ 1− πi,∀ib>j ω + β ≤ −1 + σj,∀jπ, σ ≥ 0

which is a linear program, or

min ω>ω + µ

∑i

π2i +

∑j

σ2j

s.t. a>i ω + β ≥ 1− πi,∀i

b>j ω + β ≤ −1 + σj,∀jπ, σ ≥ 0

which is a quadratic program.

51/68

Supply chain

52/68

Supply chain

53/68

facility localization

54/68

Medical imaging

55/68

Medical imaging

56/68

Medical imaging

57/68

Phase Retrieval

Phase carries more information than magnitude

Y |F(Y)| phase(F(Y)) iF(|F(Y)|.*phase(F(S)))

S |F(S)| phase(F(S)) iF(|F(S)|.*phase(F(Y)))

Question: recover signal without knowing phase?

58/68

Classical Phase Retrieval

Feasibility problem

find x ∈ S ∩M or find x ∈ S+ ∩M

given Fourier magnitudes:

M := x(r) | |x(ω)| = b(ω)

where x(ω) = F(x(r)), F : Fourier transformgiven support estimate:

S := x(r) | x(r) = 0 for r /∈ D

orS+ := x(r) | x(r) ≥ 0 and x(r) = 0 if r /∈ D

59/68

Ptychographic Phase Retrieval (Thanks: Chao Yang)

Given bi = |F(Qiψ)| for i = 1, . . . , k, can we recover ψ?

Ptychographic imaging along with advances in detectors andcomputing have resulted in X-ray microscopes with increased spatialresolution without the need for lenses

60/68

Recent Phase Retrieval Model Problems

Given A ∈ Cm×n and b ∈ Rm

find x, s.t. |Ax| = b.

(Candes et al. 2011b, Alexandre d’Aspremont 2013)

SDP Relaxation: |Ax|2 is a linear function of X = xx∗

minX∈Sn

Tr(X)

s.t. Tr(aia∗i X) = b2i , i = 1, . . . ,m,

X 0

Exact recovery conditions

61/68

Single Particle Cryo-Electron Microscopy

Projection images Pi(x, y) =∫∞−∞ φ(xR1

i + yR2i + zR3

i )dz+ “noise”.

φ : R3 → R is the electric potential of the molecule.Cryo-EM problem: Find φ and R1, . . . ,Rn given P1, . . . ,Pn.

62/68

A Toy Example

63/68

Single Particle Cryo-Electron Microscopy

Least Squares Approach

minR1,...,RK∈SO(3)

∑i 6=j

wij

∥∥∥Ri (~cij, 0)T − Rj (~cji, 0)T∥∥∥2

Since∥∥∥Ri (~cij, 0)T

∥∥∥ =∥∥∥Rj (~cji, 0)T

∥∥∥ = 1, we obtain

maxR1,...,RK∈SO(3)

∑i 6=j

wij〈Ri (~cij, 0)T ,Rj (~cji, 0)T〉

SDP Relaxation:

maxG∈R2K×2K trace ((W S) G)

s.t. Gii = I2, i = 1, 2, · · · ,K,G < 0

64/68

Randomized Linear Algebra Algorithms

(Thanks: Petros Drineas and Michael W. Mahoney)Goal: To develop and analyze (fast) Monte Carlo algorithms forperforming useful computations on large (and later not so large!)matrices and tensors.

Matrix MultiplicationComputation of the Singular Value DecompositionComputation of the CUR DecompositionTesting Feasibility of Linear ProgramsLeast Squares ApproximationTensor computations: SVD generalizationsTensor computations: CUR generalization

Such computations generally require time which is superlinear in thenumber of nonzero elements of the matrix/tensor, e.g., O(n3) for n× nmatrices.

65/68

CPU vs. GPU

March 1, 2015:Intel Xeon Processor E7-8880 v2 15 cores, $5729.00Intel Xeon Phi Coprocessor 7120A, 61 cores, $4235.00Tesla K80, 4992 ( 2496 per GPU) cores, $4959.99

66/68

Super Computer

http://www.top500.org/lists/2014/11/

Top 1: Tianhe-2 (MilkyWay-2): TH-IVB-FEP Cluster, Intel XeonE5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P,NUDT, 3,120,000 cores#24: Edison is NERSC’s newest supercomputer, a Cray XC30,with a peak performance of 2.57 petaflops/sec, 133,824 computecores, 357 terabytes of memory, and 7.56 petabytes of disk.

67/68

Optimization Formulation

Mathmatical optimization problem

minf (x)

s.t. ci(x) = 0, i ∈ Eci(x) ≥ 0, i ∈ I

x = (x1, . . . , xn)>: variablef (x) : Rn → R: objective functionci(x) : Rn → R: constraintsoptimal solution x∗: a feasible point with the smallest value of f

68/68

Classification

Continuous versus discrete optimizationUnconstrained versus constrained optimizationGlobal and local optimizationStochastic and deterministic optimizationLinear/nonlinear/quadratic programming, Convex/nonconvexoptimizationLeast square problem, equation solvingsparse optimization, PDE-constrained optimization, robustoptimization

top related