searching for activation functios · 2020. 1. 11. · searching for activation functios. lehrstuhl...
TRANSCRIPT
![Page 1: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/1.jpg)
Jenny Seidenschwarz
Technische Universität München
Seminar Course AutoML
Munich, 4th of July 2019
Searching for Activation Functios
![Page 2: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/2.jpg)
Lehrstuhl für MusterverfahrenFakultät für MustertechnikTechnische Universität München
2Jenny Seidenschwarz (TUM) | Seminar Course AutoML
Activation Functions
• Gradient preserving property
• More easy to optimize
x1
xn
w1j
Linear
Trafo
w2j
wnj
x2 Activation
Function… … …
State of the art default scalar activation function: ReLU 𝒎𝒂𝒙(𝟎, 𝒙)
Figure: ReLU activation function
Figure: Basic structure neural network
![Page 3: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/3.jpg)
3Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Research Goal
Find new scalar activation functions …
… using automated search technique …
… compare them systematically to existing activation functions …
… across multiple different challenging datasets!
![Page 4: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/4.jpg)
Automated Search
![Page 5: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/5.jpg)
Challenge: balance size and expressivity of search space
→ Simple binary expression tree [1]
→ Selection of unary and binary functions
5Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Search Space
x
x Unary
Unary
Binary
Binary
x
Unary: 𝑥, −𝑥, 𝑥 , 𝑥2, 𝑥3, 𝑥 , 𝛽𝑥, 𝑥 + 𝛽, log 𝑥 + 𝜖 ,
exp 𝑥 , sin 𝑥 , cos 𝑥 , sinh 𝑥 , cosh 𝑥 , tanh 𝑥 ,
𝑡𝑎𝑛−1 𝑥 , 𝑠𝑖𝑛ℎ−1 𝑥 , 𝑠𝑖𝑛𝑐 𝑥 ,max 0, 𝑥 ,
min 0, 𝑥 , 𝜎 𝑥 , log 1 + exp 𝑥 , exp −𝑥2 , erf x , β
BInary: 𝑥1 + 𝑥2, 𝑥1𝑥2, 𝑥1 − 𝑥2,𝑥1
𝑥2+𝜖, max(𝑥1, 𝑥2),
m𝑖𝑛 𝑥1, 𝑥2 , 𝜎 𝑥1 𝑥2, exp −𝛽(𝑥1 − 𝑥2)2 , exp(−𝛽|𝑥1 −
𝑥2|), β𝑥1 + (1 − 𝛽)𝑥2
core unit
Unary
Unary
Figure: Core Unit, adapted from [5]
![Page 6: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/6.jpg)
6Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Search approach
small search space
big search space
exhaustive
search
reinforcement
learning-based
search
train child
network with
found
activation
functions
get list of best
performing
update search
algorithm and
find best
empirical
evaluation and
experiments
![Page 7: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/7.jpg)
RNN-controller [2] with domain specific language [1]
Train batch of generated activation functions
• ResNet-20
• Image classification on CIFAR-10
• 10k steps
7Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller
core unit
Figure: RNN-Controller architecture [5]
![Page 8: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/8.jpg)
8Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller update
RNN Controller with PPO:
→ Clipping ensures updates in „trust region“
→ Sample efficient
• Objective function
𝐿𝐶𝐿𝐼𝑃 𝜃 = 𝔼𝑡 min 𝜎𝑡𝐺𝑡 , 𝑐𝑙𝑖𝑝 𝜎𝑡 , 1 − 𝜀, 1 + 𝜀 𝐺𝑡 , 𝑤𝑖𝑡ℎ 𝜎𝑡 =𝜋𝜃 𝑎𝑡 𝑠𝑡)
𝜋𝜃𝑜𝑙𝑑 𝑎𝑡 𝑠𝑡)
RNN Controller with REINFORCE:
• Objective function ℒ𝜃𝑐 = 𝔼𝜋𝜃,𝜏𝐺𝑡 , where 𝐺𝑡 = σ𝑘=𝑡
𝑇−1 𝛾𝑘−𝑡𝑟𝑘+1
Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐
Figure: clipping function,
adapted from [3]
![Page 9: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/9.jpg)
RNN Controller with PPO:
• Objective gradient
→ 𝐺𝑘 = accuracy of child network
→ 𝑏 = exponential moving average of rewards
RNN Controller with REINFORCE:
• Objective gradient
9Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
RNN-Controller update
Policy gradient methods: 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 → ∆𝜃 ⟸ 𝛼 ∇ℒ𝜃𝑐
One child network: ∇ℒ𝜃𝑐 = σ𝑡=1𝑇 ∇𝜃𝑐 𝑙𝑜𝑔 𝜋 𝑎𝑡 𝑠𝑡 , 𝜃𝑐 (𝐺 − 𝑏)
Figure: clipping function, adapted from [3]
![Page 10: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/10.jpg)
Findings on Activation Functions
![Page 11: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/11.jpg)
1. 1-2 core units perform best
2. Top activation functions always take raw preactivation x as input to final binary function
3. Periodic functions (sin, cos, etc. ) used by some top performing activation functions
4. Activation functions that use division perform poorly
11Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Findings on Activation Functions
𝑥𝜎 𝛽𝑥
𝑥(𝑠𝑖𝑛ℎ−1 𝑥 )2
min 𝑥, sin 𝑥
(𝑡𝑎𝑛ℎ−1(𝑥))2−𝑥
max(𝑥, 𝜎 𝑥 )
cos 𝑥 − 𝑥
m𝑎𝑥 𝑥, tanh(𝑥)
𝑠𝑖𝑛𝑐 𝑥 + 𝑥
Figure: best 8 activation functions [5]
![Page 12: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/12.jpg)
Experiments to ensure generalization to deeper networks
12Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Validation of Performance
Swish
Figure: Generalization to deeper architectures of 8 best activation functions found [5]
(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy
![Page 13: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/13.jpg)
• Nonlinearly interpolation between ReLU and linear function
• Smooth function
• Non-monotinoc function
• Unbounded above and bounded below (like ReLU)
13Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swish𝑓 𝑥 = 𝑥 𝜎(𝛽𝑥)
Figure: Swish activation function for different 𝛽 values
![Page 14: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/14.jpg)
Benchmark of Swish
![Page 15: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/15.jpg)
Benchmarked Swish to ReLU and other baseline activation functions
• Different models
• Different challenging real world datasets
• Test with fixed β = 1 and trainable β
15Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Further Experiments with Swish
![Page 16: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/16.jpg)
CIFAR 10 and 100: ResNet-164, Wide ResNet 28-10 and DenseNet 100-12
• Median of 5 runs for comparison
ImageNet: Inception-ResNet-v2, Inception-v4, Inception-v3, MobileNet and Mobile NASNet-A
• Fixed number of steps, 3 learning rates with RMSProp
• Epsecially good performance on mobile sized modelm slightly underperform Inception-v4
English-German-translation: 12 layer Base Transformer
• Two different learning rates, 300K steps with Adam optimizer
16Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Further Experiments with Swish
Figure: Overview performace in experiments [5]
![Page 17: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/17.jpg)
Performance of Swish
![Page 18: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/18.jpg)
Learnable 𝛽:
18Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swish – learnable parameter β
Figure: distribution of trained 𝛽 on Mobile NASNet-A [5]
![Page 19: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/19.jpg)
Non-monotinic bump for 𝑥 < 0
19Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swich – Challenging current Belief
Figure: Preactivations for 𝛽 = 1 on
ResNet-32 [5]
Figure: Swish function for different 𝛽
values with non-monotonic bump
![Page 20: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/20.jpg)
No gradient preserving characteristics of derivative:
20Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Swich – Challenging current Belief
Figure: Derivative of swish function for
different 𝛽 values
Figure: Derivative of ReLU
![Page 21: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/21.jpg)
Main contributions:
• Used a search space as in [1] to find activation functions with a RNN controller [2], that was
updated with PPO [3]
• Systematically compared activation functions
• Found new activation function that constantly outperform or is on par with ReLU
Critical aspects:
• Search space restricts results
• Search space designed after human intuition
• Restriction of training steps and training on small architectures might suppress even better
activation functions
Future research:
• Only two core units, but more unary and binary functions
• Also take non-scalar activation functions into account
21Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Conclusion
![Page 22: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/22.jpg)
[1] Bello, I. & Zoph, B. & Vasudevan, V. & Le, Quoc V. (2017). Neural Optimizer Search with
Reinforcement Learning.
[2] Zoph, B., & Le, Quoc V. (2017). Neural Architecture Search with Reinforcement Learning.
ArXiv, abs/1611.01578.
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy
Optimization Algorithms. ArXiv, abs/1707.06347.
[4] Elfwing, S. & Uchibe, E. & Doya, K. (2018). Sigmoid-Weighted Linear Units for Neural
Network Function Approximation in Reinforcement Learning. Neural Networks. 107.
10.1016/j.neunet.2017.12.012.
[5] Ramachandran, P., Zoph, B., & Le, Q.V. (2018). Searching for Activation Functions. ArXiv,
abs/1710.05941.
22Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
References
![Page 23: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/23.jpg)
Back-up
![Page 24: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/24.jpg)
If you want to use Swish:
• Already implemented in tensorflow as tf.nn.swish(x)
• When using batch norm: set scale parameter
• Derivative of swish: 𝑓′ 𝑥 = 𝛽𝑓 𝑥 + 𝜎(𝛽𝑥)(1 − 𝛽𝑓 𝑥 )
24Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Things to note
![Page 25: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/25.jpg)
25Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiment Results - CIFAR
Figure: Benchmark experiments of Swish function to baseline functions on CIFAR [5]
(a) CIFAR-10 accuracy (b) CIFAR-100 accuracy
![Page 26: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/26.jpg)
26Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Training curves of Mobile NASNet-Aon
ImageNet. Best viewed in color
(b) Mobile NASNet-A on ImageNet, with3 different
runs ordered by top-1 accuracy. Theadditional 2 GELU
experiments are still trainingat the time of submission.
![Page 27: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/27.jpg)
27Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Inception-ResNet-v2 on ImageNetwith 3 different
runs. Note that the ELUsometimes has instabilities at
the start oftraining, which accounts for the first result
(b) MobileNet on ImageNet.
![Page 28: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/28.jpg)
28Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on ImageNet
Figure: Benchmark experiments of Swish function to baseline functions on ImageNet [5]
(a) Inception-v3 on ImageNet (b) Inception-v4 on ImageNet
![Page 29: Searching for Activation Functios · 2020. 1. 11. · Searching for Activation Functios. Lehrstuhl für Musterverfahren Fakultät für Mustertechnik Technische Universität München](https://reader034.vdocuments.pub/reader034/viewer/2022051905/5ff80f0c1f8816347e38e85c/html5/thumbnails/29.jpg)
29Jenny Seidenschwarz (TUM) | Practical Course AutoML | Searching for activation functions
Experiments on Machine Translation
Figure: Benchmark experiments of Swish function to baseline functions on WTM
English→German (BLEU score) [5]