linear convergent decentralized optimization with compression

Linear Convergent Decentralized Optimization withCompression

Xiaorui Liuhttp://cse.msu.edu/~xiaorui/

Joint work with Yao Li, Rongrong Wang,Jiliang Tang, and Ming Yan

Data Science and Engineering LabDepartment of Computer Science and Engineering

Michigan State University

ICLR 2021, May 6th

Xiaorui Liu (MSU) LEAD ICLR 2021 1 / 15

http://cse.msu.edu/~xiaorui/

Introduction

Problem

x∗ := arg minx∈Rd

[f (x) :=

1

n

n∑i=1

fi (x)]

fi (·) is the local objective in agent i .

Centralization Decentralization


Introduction

Matrix notations

Xk =

(xk1)>

...(xkn)>

∈ Rn×d ,

∇F(Xk) =

(∇f1(xk1))>

...(∇f1(xkn))>

∈ Rn×d ,

Symmetric W ∈ Rn×n encodes the communication network.

WX = X iff x1 = x2 = · · · = xn,

−1 < λn(W) ≤ λn−1(W) ≤ · · ·λ2(W) < λ1(W) = 1.


Introduction

Communication Compression for decentralized optimization

DCD-SGD, ECE-SGD [TGZ+18]QDGD, QuanTimed-DSGD[RMHP19, RTM+19]DeepSqueeze [TLQ+19]CHOCO-SGD [KSJ19]. . .

Reduce to DGD-type algorithms, which suffer from convergence bias

X∗ 6= WX∗ − η∇F(X∗).

Their convergences degrade on heterogeneous data.

LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence.


Algorithm: LEAD

NIDS [LSY19] / D2 [TLY+18] (stochastic version of NIDS)

Xk+1 =I + W

2(2Xk − Xk−1 − η∇F(Xk ; ξk) + η∇F(Xk−1; ξk−1)),

A two step reformulation [LY19]:

Dk+1 = Dk +I−W

2η(Xk − η∇F(Xk)− ηDk),

Xk+1 = Xk − η∇F(Xk)− ηDk+1,

Concise and conceptual form of LEAD:

Yk = Xk − η∇F(Xk ; ξk)− ηDk

Yk = CompressionProcedure(Yk)

Dk+1 = Dk +γ

2η(I−W)Yk

Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1


Algorithm: LEAD

LEAD

Yk = Xk − η∇F(Xk ; ξk)− ηDk

Yk = CompressionProcedure(Yk)

Dk+1 = Dk +γ

2η(I−W)Yk=

γ

2η(Yk − Yk

w )

Xk+1 = Xk − η∇F(Xk ; ξk)− ηDk+1

Compression Procedure

Qk = Compress(Yk −Hk) B Compression

Yk = Hk + Qk

Ykw = Hk

w + WQk B Communication

Hk+1 = (1− α)Hk + αYk

Hk+1w = (1− α)Hk

w + αYkw


Algorithm: LEAD

How LEAD works?

Gradient Correction

Xk+1 = Xk − η(∇F(Xk ; ξk) + Dk+1)

F(Xk ; ξk) + Dk+1 → 0

Difference Compression

Qk = Compress(Yk −Hk)

Yk → X∗,Hk → X∗ ⇒ Yk −Hk → 0⇒ ‖Qk − (Yk −Hk)‖ → 0

Implicit Error Compensation

Ek = Yk − Yk

Dk+1 = Dk +γ

2η(Yk − Yk

w ) = Dk +γ

2η(I−W)Yk +

γ

2η(Ek −WEk)


Assumption

Compression: EQ(x) = x, E‖x− Q(x)‖22 ≤ C‖x‖22 for some C ≥ 0.

fi (·) is µ-strongly convex and L-smooth:

fi (x) ≥ fi (y) + 〈∇fi (y), x− y〉+µ

2‖x− y‖2,

fi (x) ≤ fi (y) + 〈∇fi (y), x− y〉+L

2‖x− y‖2.

Gradient: Eξ∇fi (x; ξ) = ∇fi (x), Eξ‖∇fi (x; ξ)−∇fi (x)‖22 ≤ σ2.


Theory

κf =L

µ, κg =

λmax(I−W)

λ+min(I−W)

Theorem (Complexity bounds when σ = 0)

LEAD converges to the ε-accurate solution with the iterationcomplexity

O((

(1 + C )(κf + κg ) + Cκf κg)

log1

ε

).

When C = 0 (i.e., no compression) or C ≤ κf +κg

κf κg+κf +κg, the iteration

complexity isO(

(κf + κg ) log1

ε

).

This recovers the convergence rate of NIDS [LSY19].


Theory

Theorem (Complexity bounds when σ = 0)

With C = 0 (or C ≤ κf +κg

κf κg+κf +κg) and fully connected communication

graph (i.e., W = 11>

n ), the iteration complexity is

O(κf log1

ε).

This recovers the convergence rate of gradient descent [Nes13].

Theorem (Error bound when σ > 0)

1

n

n∑i=1

E∥∥∥xki − x∗

∥∥∥2 . O(1

k

)


Experiment

0 25 50 75 100 125 150 175Epoch

106

104

102

100

102

||Xk

X* |

|2 F

DGD ( 32 bits)NIDS ( 32 bits)QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

‖Xk − X∗‖F

0 1000000 2000000 3000000 4000000 5000000Bits transmitted

106

104

102

100

102

||Xk

X* |

|2 F


‖Xk − X∗‖F

Linear regression (σ = 0)


Experiment

0 25 50 75 100 125 150 175Epoch

107

105

103

101

101

Con

sens

us E

rror


Consensus error

0 25 50 75 100 125 150 175Epoch

1010

108

106

104

102

100

102

Com

pres

sion

Err

or

QDGD ( 2 bits)DeepSqueeze ( 2 bits)CHOCO-SGD ( 2 bits)LEAD ( 2 bits)

Compression error

Linear regression (σ = 0)


Experiment

0 10 20 30 40 50 60 70 80Epoch

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Loss

DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

f (Xk)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4Bits transmitted 1e9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Loss

DGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

f (Xk)

Logistic regression (σ > 0).


Experiment

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Bits transmitted 1e14

0.5

1.0

1.5

2.0

Loss

Homogeneous dataDGD (32 bits)NIDS (32 bits)QDGD (2 bits)DeepSqueeze (2 bits)CHOCO-SGD (2 bits)LEAD (2 bits)

Loss f (Xk)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Bits transmitted 1e14

1.0

1.5

2.0

2.5

Loss

Heterogeneous dataDGD (32 bits)NIDS (32 bits)QDGD* (2 bits)DeepSqueeze* (2 bits)CHOCO-SGD* (2 bits)LEAD (2 bits)

Loss f (Xk)

Stochastic optimization on deep learning (∗ divergence).


Conclusion

LEAD is the first primal-dual decentralized optimization algorithmwith compression and attains linear convergence for strongly convexand smooth objectives

LEAD supports unbiased compression of arbitrary precision

LEAD works well for nonconvex problems such as training deep neuralnetworks

LEAD is robust to parameter settings, and needs minor effort forparameter tuning

Welcome to check our paper and poster for more details


Anastasia Koloskova, Sebastian U. Stich, and Martin Jaggi,Decentralized stochastic optimization and gossip algorithms withcompressed communication, Proceedings of the 36th InternationalConference on Machine Learning, PMLR, 2019, pp. 3479–3487.

Zhi Li, Wei Shi, and Ming Yan, A decentralized proximal-gradientmethod with network independent step-sizes and separatedconvergence rates, IEEE Transactions on Signal Processing 67 (2019),no. 17, 4494–4506.

Yao Li and Ming Yan, On linear convergence of two decentralizedalgorithms, arXiv preprint arXiv:1906.07225 (2019).

Yurii Nesterov, Introductory lectures on convex optimization: A basiccourse, vol. 87, Springer Science & Business Media, 2013.

Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and RamtinPedarsani, An exact quantized decentralized gradient descentalgorithm, IEEE Transactions on Signal Processing 67 (2019), no. 19,4934–4947.


Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, HamedHassani, and Ramtin Pedarsani, Robust and communication-efficientcollaborative learning, Advances in Neural Information ProcessingSystems, 2019, pp. 8388–8399.

Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu,Communication compression for decentralized training, Advances inNeural Information Processing Systems, 2018, pp. 7652–7662.

Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, TongZhang, and Ji Liu, Deepsqueeze: Decentralization meetserror-compensated compression, CoRR abs/1907.07346 (2019).

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu, D2:Decentralized training over decentralized data, Proceedings of the 35thInternational Conference on Machine Learning, 2018, pp. 4848–4856.


linear convergent decentralized optimization with compression

Documents