linear convergent decentralized optimization with compression

17
Linear Convergent Decentralized Optimization with Compression Xiaorui Liu http://cse.msu.edu/ ~ xiaorui/ Joint work with Yao Li, Rongrong Wang, Jiliang Tang, and Ming Yan Data Science and Engineering Lab Department of Computer Science and Engineering Michigan State University ICLR 2021, May 6th Xiaorui Liu (MSU) LEAD ICLR 2021 1 / 15

Upload: others

Post on 08-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Convergent Decentralized Optimization with Compression

Linear Convergent Decentralized Optimization withCompression

Xiaorui Liuhttp://cse.msu.edu/~xiaorui/

Joint work with Yao Li, Rongrong Wang,Jiliang Tang, and Ming Yan

Data Science and Engineering LabDepartment of Computer Science and Engineering

Michigan State University

ICLR 2021, May 6th

Xiaorui Liu (MSU) LEAD ICLR 2021 1 / 15

Page 2: Linear Convergent Decentralized Optimization with Compression

Introduction

Problem

x∗ := arg minx∈Rd

[f (x) :=

1

n

n∑i=1

fi (x)]

fi (·) is the local objective in agent i .

Centralization Decentralization

Xiaorui Liu (MSU) LEAD ICLR 2021 2 / 15

Page 3: Linear Convergent Decentralized Optimization with Compression

Introduction

Matrix notations

Xk =

(xk1)>

...(xkn)>

∈ Rn×d ,

∇F(Xk) =

(∇f1(xk1))>

...(∇f1(xkn))>

∈ Rn×d ,

Symmetric W ∈ Rn×n encodes the communication network.

WX = X iff x1 = x2 = · · · = xn,

−1 < λn(W) ≤ λn−1(W) ≤ · · ·λ2(W) < λ1(W) = 1.

Xiaorui Liu (MSU) LEAD ICLR 2021 3 / 15

Page 4: Linear Convergent Decentralized Optimization with Compression

Introduction

Communication Compression for decentralized optimization

DCD-SGD, ECE-SGD [TGZ+18]QDGD, QuanTimed-DSGD[RMHP19, RTM+19]DeepSqueeze [TLQ+19]CHOCO-SGD [KSJ19]. . .

Reduce to DGD-type algorithms, which suffer from convergence bias

X∗ 6= WX∗ − η∇F(X∗).

Their convergences degrade on heterogeneous data.

LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence.

Xiaorui Liu (MSU) LEAD ICLR 2021 4 / 15

Page 5: Linear Convergent Decentralized Optimization with Compression

Algorithm: LEAD

NIDS [LSY19] / D2 [TLY+18] (stochastic version of NIDS)

Xk+1 =I + W

2(2Xk − Xk−1 − η∇F(Xk ; ξk) + η∇F(Xk−1; ξk−1)),

A two step reformulation [LY19]:

Dk+1 = Dk +I−W

2η(Xk − η∇F(Xk)− ηDk),

Xk+1 = Xk − η∇F(Xk)− ηDk+1,

Concise and conceptual form of LEAD:

Yk = Xk − η∇F(Xk ; ξk)− ηDk

Yk = CompressionProcedure(Yk)

Dk+1 = Dk +γ

2η(I−W)Yk

Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1

Xiaorui Liu (MSU) LEAD ICLR 2021 5 / 15

Page 6: Linear Convergent Decentralized Optimization with Compression

Algorithm: LEAD

LEAD

Yk = Xk − η∇F(Xk ; ξk)− ηDk

Yk = CompressionProcedure(Yk)

Dk+1 = Dk +γ

2η(I−W)Yk=

γ

2η(Yk − Yk

w )

Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1

Compression Procedure

Qk = Compress(Yk −Hk) B Compression

Yk = Hk + Qk

Ykw = Hk

w + WQk B Communication

Hk+1 = (1− α)Hk + αYk

Hk+1w = (1− α)Hk

w + αYkw

Xiaorui Liu (MSU) LEAD ICLR 2021 6 / 15

Page 7: Linear Convergent Decentralized Optimization with Compression

Algorithm: LEAD

How LEAD works?

Gradient Correction

Xk+1 = Xk − η(∇F(Xk ; ξk) + Dk+1)

F(Xk ; ξk) + Dk+1 → 0

Difference Compression

Qk = Compress(Yk −Hk)

Yk → X∗,Hk → X∗ ⇒ Yk −Hk → 0⇒ ‖Qk − (Yk −Hk)‖ → 0

Implicit Error Compensation

Ek = Yk − Yk

Dk+1 = Dk +γ

2η(Yk − Yk

w ) = Dk +γ

2η(I−W)Yk +

γ

2η(Ek −WEk)

Xiaorui Liu (MSU) LEAD ICLR 2021 7 / 15

Page 8: Linear Convergent Decentralized Optimization with Compression

Assumption

Compression: EQ(x) = x, E‖x− Q(x)‖22 ≤ C‖x‖22 for some C ≥ 0.

fi (·) is µ-strongly convex and L-smooth:

fi (x) ≥ fi (y) + 〈∇fi (y), x− y〉+µ

2‖x− y‖2,

fi (x) ≤ fi (y) + 〈∇fi (y), x− y〉+L

2‖x− y‖2.

Gradient: Eξ∇fi (x; ξ) = ∇fi (x), Eξ‖∇fi (x; ξ)−∇fi (x)‖22 ≤ σ2.

Xiaorui Liu (MSU) LEAD ICLR 2021 8 / 15

Page 9: Linear Convergent Decentralized Optimization with Compression

Theory

κf =L

µ, κg =

λmax(I−W)

λ+min(I−W)

Theorem (Complexity bounds when σ = 0)

LEAD converges to the ε-accurate solution with the iterationcomplexity

O((

(1 + C )(κf + κg ) + Cκf κg)

log1

ε

).

When C = 0 (i.e., no compression) or C ≤ κf +κg

κf κg+κf +κg, the iteration

complexity isO(

(κf + κg ) log1

ε

).

This recovers the convergence rate of NIDS [LSY19].

Xiaorui Liu (MSU) LEAD ICLR 2021 9 / 15

Page 10: Linear Convergent Decentralized Optimization with Compression

Theory

Theorem (Complexity bounds when σ = 0)

With C = 0 (or C ≤ κf +κg

κf κg+κf +κg) and fully connected communication

graph (i.e., W = 11>

n ), the iteration complexity is

O(κf log1

ε).

This recovers the convergence rate of gradient descent [Nes13].

Theorem (Error bound when σ > 0)

1

n

n∑i=1

E∥∥∥xki − x∗

∥∥∥2 . O(1

k

)

Xiaorui Liu (MSU) LEAD ICLR 2021 10 / 15

Page 11: Linear Convergent Decentralized Optimization with Compression

Experiment

0 25 50 75 100 125 150 175Epoch

106

104

102

100

102

||Xk

X* |

|2 F

DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

‖Xk − X∗‖F

0 1000000 2000000 3000000 4000000 5000000Bits transmitted

106

104

102

100

102

||Xk

X* |

|2 F

DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

‖Xk − X∗‖F

Linear regression (σ = 0)

Xiaorui Liu (MSU) LEAD ICLR 2021 11 / 15

Page 12: Linear Convergent Decentralized Optimization with Compression

Experiment

0 25 50 75 100 125 150 175Epoch

107

105

103

101

101

Con

sens

us E

rror

DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

Consensus error

0 25 50 75 100 125 150 175Epoch

1010

108

106

104

102

100

102

Com

pres

sion

Err

or

QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

Compression error

Linear regression (σ = 0)

Xiaorui Liu (MSU) LEAD ICLR 2021 12 / 15

Page 13: Linear Convergent Decentralized Optimization with Compression

Experiment

0 10 20 30 40 50 60 70 80Epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Loss

DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

f (Xk)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4Bits transmitted 1e9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Loss

DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

f (Xk)

Logistic regression (σ > 0).

Xiaorui Liu (MSU) LEAD ICLR 2021 13 / 15

Page 14: Linear Convergent Decentralized Optimization with Compression

Experiment

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Bits transmitted 1e14

0.5

1.0

1.5

2.0

Loss

Homogeneous dataDGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

Loss f (Xk)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Bits transmitted 1e14

1.0

1.5

2.0

2.5

Loss

Heterogeneous dataDGD (32 bits)NIDS (32 bits)QDGD* (2 bits)DeepSqueeze* (2 bits)CHOCO-SGD* (2 bits)LEAD (2 bits)

Loss f (Xk)

Stochastic optimization on deep learning (∗ divergence).

Xiaorui Liu (MSU) LEAD ICLR 2021 14 / 15

Page 15: Linear Convergent Decentralized Optimization with Compression

Conclusion

LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence for strongly convexand smooth objectives

LEAD supports unbiased compression of arbitrary precision

LEAD works well for nonconvex problems such as training deep neuralnetworks

LEAD is robust to parameter settings, and needs minor effort forparameter tuning

Welcome to check our paper and poster for more details

Xiaorui Liu (MSU) LEAD ICLR 2021 15 / 15

Page 16: Linear Convergent Decentralized Optimization with Compression

Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi,Decentralized stochastic optimization and gossip algorithms withcompressed communication, Proceedings of the 36th InternationalConference on Machine Learning, PMLR, 2019, pp. 3479–3487.

Zhi Li, Wei Shi, and Ming Yan, A decentralized proximal-gradientmethod with network independent step-sizes and separatedconvergence rates, IEEE Transactions on Signal Processing 67 (2019),no. 17, 4494–4506.

Yao Li and Ming Yan, On linear convergence of two decentralizedalgorithms, arXiv preprint arXiv:1906.07225 (2019).

Yurii Nesterov, Introductory lectures on convex optimization: A basiccourse, vol. 87, Springer Science & Business Media, 2013.

Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and RamtinPedarsani, An exact quantized decentralized gradient descentalgorithm, IEEE Transactions on Signal Processing 67 (2019), no. 19,4934–4947.

Xiaorui Liu (MSU) LEAD ICLR 2021 15 / 15

Page 17: Linear Convergent Decentralized Optimization with Compression

Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, HamedHassani, and Ramtin Pedarsani, Robust and communication-efficientcollaborative learning, Advances in Neural Information ProcessingSystems, 2019, pp. 8388–8399.

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu,Communication compression for decentralized training, Advances inNeural Information Processing Systems, 2018, pp. 7652–7662.

Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, TongZhang, and Ji Liu, Deepsqueeze: Decentralization meetserror-compensated compression, CoRR abs/1907.07346 (2019).

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu, D2:Decentralized training over decentralized data, Proceedings of the 35thInternational Conference on Machine Learning, 2018, pp. 4848–4856.

Xiaorui Liu (MSU) LEAD ICLR 2021 15 / 15