adaptive bayesian optimization for organic material...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

Master Thesis

Adaptive Bayesian Optimization

for Organic Material Screening

유기소재 스크리닝을 위한 적응적 베이지안 최적화

February 2016

Seoul National University

The Graduate School

Interdisciplinary Program in Neuroscience

Sangwoong Yoon

Adaptive Bayesian Optimization

for Organic Material Screening

유기소재 스크리닝을 위한 적응적 베이지안 최적화

지도교수 장 병 탁

이 논문을 이학석사 학위논문으로 제출함

2015 년 12 월

서울대학교 대학원

협동과정 뇌과학전공

윤 상 웅

윤상웅의 이학석사 학위논문을 인준함

2015 년 12 월

위 원 장 김 건 희 (인)

부위원장 장 병 탁 (인)

위 원 이 상 훈 (인)

Abstract

Adaptive Bayesian Optimization forOrganic Material Screening

Sangwoong Yoon

Interdisciplinary Program in Neuroscience

The Graduate School

Seoul National University

Bayesian optimization (BO) is an efficient black-box optimization method which

utilizes the power of statistical models built upon previously searched points.

The efficacy of BO largely depends on the choice of the statistical model, but

it is usually difficult to determine beforehand which model would yield the

best optimization performance for a given task. This thesis investigates a mod-

ified problem setting for BO where multiple candidate surrogate functions are

available, and experiments two novel strategies based on multi-armed bandit

algorithms. The proposed strategies attempt to discriminate among the candi-

date models, and therefore referred as adaptive BO’s. The strategies are tested

on optimization test bed functions, and the chemical screening scheduling prob-

lem where the issue of selecting a surrogate function to use become particularly

salient. Surprisingly, it is discovered that the baseline strategy which blends

multiple candidate functions uniform-randomly performs non-trivial perfor-

mance. The results presented in the thesis shows that the relaxation of the

number of surrogate functions in BO yields interesting dynamics.

Keywords: Bayesian Optimization, Multi-Armed Bandit, Gaussian Process,

Chemoinformatics

Student Number: 2014-21320

i

Contents

Abstract i

Contents iv

List of Figures v

Chapter 1 Introduction 1

Chapter 2 Preliminaries 6

2.1 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Multi-Armed Bandit . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 3 Organic Material Screening: The Motivational Ap-

plication 10

3.1 Beyond Structure-Property Relationship . . . . . . . . . . . . . 10

3.2 Dataset: Electronic Properties of Organic Molecules . . . . . . 11

Chapter 4 Bayesian Optimization with Multiple Surrogate Func-

tions 13

4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Baseline: Random Arm . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Proposed Strategies . . . . . . . . . . . . . . . . . . . . . . . . 14

Chapter 5 Experiments 17

5.1 Benchmark Functions . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Screening over Organic Molecules . . . . . . . . . . . . . . . . . 18

ii

Chapter 6 Discussion and Conclusion 23

6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Bibliography 26

국문초록 30

iii

List of Figures

Figure 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Figure 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Figure 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Figure 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Figure 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Figure 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iv

Chapter 1

Introduction

Bayesian optimization (BO) is a powerful model-based optimization technique

which can handle difficult optimization problems where the gradient is unavail-

able and function evaluation is expensive. BO builds a statistical model, or a

surrogate function, upon previously acquired data points, generalizing their

information to unexplored areas of the data space. Then, BO calls the acqui-

sition function to measure the acquisition priority of unseen data points. A

point with the highest acquisition function value is chosen as the next search

point, and its function value is queried to the (possibly noisy) oracle, augment-

ing the dataset. The whole task can be formally written as the following.

Given {gi(x)}Ki=1 , a(x), and D,

find x∗ = arg maxxi∈D

f(xi).

where gi(x) is the prespecified candidate surrogate function, and a(x) is the

prespecified acquisition function. D, the domain of f , is either a continuous

vector space or a set of finite points. The detailed procedure of BO is depicted

in Algorithm1.

In recent years, BO has attracted a great amount of interest from the ma-

chine learning community, and there have been advancements in terms of its

theory and application. Some notable theoretical achievements are the proven

bounds of popular heuristics [1, 2], information-theoretic acquisition functions

[3], and strategies for conditional input spaces [4]. Also, a great deal of effort

was devoted for BO in high dimensional space [5]. Numerous suggested ap-

plications have demonstrated the effectiveness of BO. Such examples include

1

Algorithm 1 Bayesian Optimization

Input: g(x): The surrogate function,

a(x): The acquisition function,

f(x): The underlying function to be optimized,

D: The function domain

Output: f(x+): The incumbent best function value

1: T ← {(xi1, f(xi1))} . Initialization

2: while termination condition do

3: update g(x) with T

4: xnew ← arg maxx′∈D a(x′)

5: Evaluate f(xnew)

6: T ← T ∪ {(xnew, f(xnew))}

7: end while

adaptive robot gait control [6], reinforcement learning [7], adaptive Monte Carlo

[8]. One area that raised sharp attraction was automated machine learning, es-

pecially the tuning of hyperparameters in deep neural networks [9, 10]. More

thorough review on the progress and application of BO can be found on the

recent surveys [7, 11].

In addition to its effectiveness, Bayesian optimization is an intriguing re-

search area, because in BO, modeling from data and decision making under

uncertainty are intertwined. BO, thus, possesses interesting relationships with

various neighboring machine learning fields. BO borrows all the tricks of statis-

tical modeling from supervised learning, in particular Gaussian Process (GP).

From the decision making perspective, BO can be viewed as a special kind

of bandit problem, which also faces the exploration-exploitation dilemma. Fur-

thermore, BO and active learning both have a similar iterative scheme in which

each data is labeled one at a time, although the objectives and therefore the

criteria for data selection are different.

Although it has never been proven explicitly, the effectiveness of Bayesian

2

optimization is likely to arise from the generalization ability of its surrogate

function g(x). BO exploits the information propagated from the previous search

points to guide its search. The prediction from g(x) may not be perfect, yet pro-

vides useful information to schedule a more efficient search. The generalization

power should depend on the choice of the surrogate function whose inductive

bias may or may not align with that of the agnostic underlying function. In

fact, the choice of surrogate function in BO significantly affect the optimiza-

tion performance, as empirically shown in Figure1.1. Therefore, the need for

discriminating among the available set of surrogate functions. In other words,

Bayesian optimization has to be performed adaptively in order to hedge the

risk of selecting a poorly performing surrogate function.

Model selection in supervised learning is usually performed by cross vali-

dation, but this is not directly applicable to BO. In BO scenarios, the number

of data is often too small to split the dataset into training and validation sets.

Moreover, the data points are not independently sampled, possibly causing a

highly biased estimate of the generalization (A similar problem was reported

in active learning literatures [12]). Therefore, a model selection method that is

tailor-made for Bayesian optimization is needed.

It is most reasonable for the model selection in Bayesian optimization to

be done online. In other words, it should not require more than a single-pass

run of Bayesian optimization. Since each function evaluation is assumed to be

prohibitively costly, repeating multiple runs of BO, each of which involves a

number of function evaluations is infeasible. However it justifies extra compu-

tation for model selection in between the function evaluations.

From these motivations, this thesis proposes adaptive Bayesian optimization

strategies which can select the best performing surrogate function among a set

of putative surrogate functions. Identifying the best model and exploiting it can

naturally be formulated into a multi-armed bandit (MAB) problem, where each

surrogate function is a bandit arm. In this setting, on every round, one surrogate

function is probabilistically selected and the next queried data is decided by the

recommendation of the selected surrogate function. The proposed strategies as

well as a baseline strategy, which turned out to perform unexpectedly well, are

tested on global optimization benchmarks and the organic material screening

3

task.

Another contribution of this thesis is to demonstrate a new application of

Bayesian optimization which explicitly requires the model selection. In this

thesis, BO is applied to schedule the computational screening over candidate

molecules to find molecules with the desired electronic property in a minimum

number of searches. In this task, a set of candidate molecules is given, and

the electronic property, e.g. atomization energy, or energy levels of Highest Oc-

cupied Molecular Orbital (HOMO) and Lowest Unoccupied Molecular Orbital

(LUMO), is evaluated via quantum mechanical simulation one at a time. First,

the functional form between the molecular property and molecular structure is

not known in advance. Fixing a type of surrogate function before the actual BO

run, therefore, can be dangerous. Finally, the quantum mechanical calculation

of chemical property is notorious for its computational burden. Even though

there are various factors affect the computational cost, the simulation can gen-

erally take several hours to a few days [13, 14], preventing repeated runs for

model selection.

There have been related works that share the idea of meta-decision making.

One class of researches explored online selection or blending of data acquisition

policies. [15] used the Hedge algorithm to adaptively select among acquisition

functions, and [16, 12] used a contextual bandit algorithm to blend among

active learning strategies. In active learning, the data acquisition policy is of

the prime importance because the model to be trained is specified beforehand.

On the contrary, in BO the difference between acquisition functions are not

as clear-cut as in active learning. For example, both Expected Improvement

(EI) and Upper Confidence Bound (UCB) criteria are known to work well

in practice, and it is not very clear exactly when one outperforms the other.

Despite this, it is reasonable to dynamically switch from an exploratory strategy

to an exploitative strategy as BO proceeds.

On the other hand, [17] focused on the very problem of dynamic kernel

selection, which this thesis also concentrates on. It clearly showed that the per-

formance of BO strongly depends on the choice of the kernel function. However,

their proposed strategies lack justifications such as a bandit formulation, and

the reported performances were only modest.

4

0 10 20 30 40 50Data points

0.0

0.5

1.0

1.5

2.0

2.5

Regr

et

Hartmann-6D

rbfardrbfpoly3linearmat32ardmat32expo

0 10 20 30 40 50Data points

0

5

10

15

20

25

30

35

40

Regr

etRastrigin 2D

rbfardrbfpoly3linearmat32ardmat32expo

Figure 1.1 The performance of Bayesian optimization depends on the choice of

the surrogate function, and also the characteristic of the underlying function.

5

Chapter 2

Preliminaries

2.1 Bayesian Optimization

As depicted in Algorithm1, in Bayesian optimization, the underlying function

(or the objective function) f(x) is optimized using the model (or the surrogate

function) g(x) built upon known pairs of inputs xi and their function value

f(xi). The acquisition function a(x) decides the next search point from infor-

mation from the surrogate function. The surrogate function and acquisition

function are the key components of BO.

Surrogate Functions: Gaussian Processes

Gaussian Processes in machine learning are defined by the mean function and

covariance function, and are realized by a multivariate Gaussian distribution

over any given set of data points in the domain whose mean and covariance are

determined by the mean function m(x) and the covariance function k(x,x′).

f1:N ∼ N (m1:N ,K)

where f1:N ≡ (f(x1) . . . f(xN ))>, m1:N ≡ (m(x1) . . .m(xN ))>, and Kij =

k(xi,xj) . The mean function is frequently set to a zero function, and the

covariance function is set to a positive semi-definite kernel function.

GP can also be viewed as a kernelized Bayesian linear regression [18], and

therefore enjoying advantages from both kernel methods and Bayesian treat-

ment. GP is flexible due to the nonlinearity from kernel function, yet robust

6

due to the Bayesian treatment. Moreover, as a kernel-based algorithm, it is

possible to incorporate prior knowledge through the choice of kernel.

Inference on the predictive mean and variance of the test data points x∗ is

done by conditioning the joint Gaussian density defined above.

E [f∗|x1:N ] = m∗ + k>∗K−1(x1:N −m1:N )

Var [f∗|x1:N ] = k(x∗,x∗)− k>∗K−1k∗

where k∗ = (k(x1,x∗) . . . k(xN ,x∗))>. It is highly desirable that the predictive

variance is analytically obtainable, since the value is essential in calculating

acquisition function values during BO.

The main drawback of GP is its limited scalability. Since matrix inversion

is involved, GP scales with the cubic of the number of training data points. It

is typically estimated that GP is not applicable to roughly more than 10,000

training data points unless special kinds of techniques are applied. However,

because the cost of data acquisition is assumed to be expensive, the size of data

is usually in the feasible regime for GP. Even so, researches such as [19] tried

to improve the scalability of BO, by replacing GP with a neural network.

Acquisition Functions

An acquisition function in Bayesian Optimization decides which point in the

domain to explore next based on the current prediction of the surrogate func-

tion, and therefore plays a central role. To perform a successful optimization,

the acquisition function needs to balance between exploration (selecting a point

whose predictive uncertainty is high) and exploitation (selecting a point whose

predictive mean in high). An optimal balance may be calculated from a dy-

namic programming-like computation, but such method is highly unlikely to be

tractable. Instead, a few heuristic criteria, for example Expected Improvement

(EI) and Gaussian Process Upper Confidence Bound (GP-UCB), are popularly

used. The effectiveness of those criteria are first proven empirically, and then

theoretically in recent studies. ([2] for EI, and [1] for GP-UCB)

The Expected Improvement (EI) is a decision-theoretic criterion which

selects the point with the highest expected utility. Here, utility is defined as

7

the improvement over the best value found so far. Formally, the improvement

I at a point x is I(y, y∗|x) =

y − y∗, if y > y∗

0 elsewhere y∗ is the incumbent

optimum. Its expectation can be exactly calculated by exploiting the Gaus-

sianity of the predictive distribution. Therefore the EI acquisition function

aEI(x) = Ey [I(y, y∗|x)], is calculated as

Ey [I(y, y∗|x)] =∫∞y∗I(y, y∗|x) dy

= (µ(x)− y∗)Φ(µ(x)−y∗σ(x)

)+ σ(x)φ

(µ(x)−y∗σ(x)

)where µ(x) and σ(x) are the predictive mean and standard deviation at x,

and φ(·) and Φ(·) are the probability density function and cumulative distri-

bution function of a standard normal distribution. The equations are obtained

by assuming the maximization, and can be changed properly for minimization

setting.

The Gaussian Process Upper Confidence Bound (GP-UCB) acqui-

sition function is motivated by upper confidence bound based strategies for

the multi-armed bandit problem, for instance [20]. In GP-UCB, the point with

the highest (or the lowest if it is minimization) confidence interval value. In

BO, the confidence interval can be obtained from predictive distribution of the

surrogate function.

aGP−UCB = µ(x) + βσ(x)

where β governs the exploration-exploitation tradeoff. β is often scheduled to

decrease in order to shift from exploration to exploitation as the BO proceeds.

As choosing the value of β or setting the decaying schedule of β is non-trivial,

the EI acquisition function which does not have any free parameter is used

throughout the experiments in this thesis.

2.2 Multi-Armed Bandit

Multi-Armed Bandit is the simplest problem that exhibits the exploration-

exploitation tradeoff and learning from evaluative feedbacks. In the original

MAB, there are K slot machines with different reward levels, and the player

8

plays one machine at a time. Then the reward corresponding to the played

machine is given as a feedback. Note that the reward from a single machine

may differ from a trial to trial, since the machine is for gambling. The objective

of the game is to maximize the expected cumulative sum of rewards. The object

is often stated in terms of regret, the difference between rewards received from

the selected action and from the optimal action that could have chosen, and it

should be minimized. In order to achieve the objective, the player must carefully

balance between exploitation (keep playing the best rewarding machine) and

exploration (trying unplayed machines just in case there is a better machine

among the unplayed). Due to its simple yet rich structure, extensive researches

have done on the problem, and numerous variants have been proposed, although

they are note the main focus of this thesis.

Besides the modern variants of Multi-Armed Bandit, probably the most

straightforward categorization of bandit problems is made along whether the

reward structure has a fixed statistical form or not. In the former, called the

stochastic MAB, rewards are independently generated from a fixed distribution.

The latter is called the adversarial MAB, and no particular structure other

than the fact that the rewards are bounded can be assumed on its reward

structure. Therefore, the adversarial MAB is certainly a harder problem, which

demands the player to constantly explore alternative arms. EXP3, a classical

solution for the adversarial MAB, was presented by [21]. In EXP3 strategy,

the probability distribution over actions is mainly determined by the history of

received rewards, but additionally mixed with a random distribution, enforcing

constant exploration. Another strategy, HEDGE [22], unlike EXP3, assumes

feedbacks from all arms are available.

9

Chapter 3

Organic Material Screening: TheMotivational Application

3.1 Beyond Structure-Property Relationship

Chemical research has strong motivation for using machine learning techniques.

Many of its investigatory tools such as wet experiments and quantum mechan-

ics simulations require non-trivial amount of resources. In particular, finding a

novel molecule with a desired property is one of the most arduous job in prac-

tice, because it usually involves experimental or computational screening over

a candidate molecule set whose size often scales up to hundreds or thousands.

Machine learning researchers have contributed to alleviate the burden of the

process of novel molecule discovery. The most popular approach is to replace

experiments or simulations with a machine learning algorithm, and is termed

quantitative structure property/activity relationship (QSPR/QSAR) analysis

in chemoinformatics. The approach aims to build a statistical model that can

make predictions on molecular properties given the molecular structure. This

endeavor has a long history and has advanced along with the improvement of

artificial intelligence techniques. Algorithms including decision trees [23], mul-

tilayer perceptrons [24], and support vector machines [25, 26] have applied to

QSPR and continuously made progress in terms of prediction accuracy. Deep

neural networks have also been applied to chemoinformatics, thanks to their

booming popularity. In [13], a multi-task deep neural network was trained to

predict electronic properties of molecules that are usually calculated by time-

consuming quantum mechanics simulations. Another multi-task deep neural

10

network trained in [27] predicts chemical/biological activities from molecular

descriptors and won the first prize in the Merck molecular activity competi-

tion.1

QSPR/QSAR approaches, however, have crucial limitations. First of all,

the model’s prediction become reliable only after training over a large amount

of data gathered from expensive experiments. For example, the neural network

in [13] needed more than 5,000 quantum simulation results to accomplish the

reported performance. It would be preferable for an algorithm to function with

a smaller amount of data. Furthermore, since the diversity of chemicals is mas-

sive, transferability between datasets can be very low. For instance, a model

trained with candidate molecules of organic solar panel may perform poorly

with drug candidate molecules, because the two sets of chemicals may have

different characteristics. Moreover, even with recent improvements in perfor-

mance, the prediction is still not error-free. It is risky to make decisions solely

based on the model’s prediction.

The computer-guided screening may be a more plausible scenario. It is hard

for machine learning algorithms to completely replace the conventional chemical

investigation apparatuses, but machine learning algorithms can be used to aid

the sequence of chemical investigation more efficient. Such approaches are called

(optimal) experimental design [28], and Bayesian optimization can be viewed

as one of them. BO, by recommending next search points which are most likely

to have optimal properties, can guide the search over the chemical compound

space.

3.2 Dataset: Electronic Properties of Organic Molecules

Among diverse molecular properties of interest, in this thesis the utility of

Bayesian optimization in the chemistry domain is demonstrated on the task

of screening for molecules with the desired electronic property for two reasons.

First, molecular electronic properties also have huge industrial importance, such

as applications to photovoltaic cells and light emitting devices. Second, building

a unified automated system is more feasible. Unlike biochemical properties,

1https://www.kaggle.com/c/MerckActivity

11

which are typically evaluated by wet experiments, electronic properties can be

calculated in silico by quantum mechanics simulations. As the search guidance

(Bayesian optimization) and function evaluation (quantum simulation) can be

performed in a single machine or cluster, the whole loop is closed and can.

Recently, there have been advancements in modeling electronic properties of

organic molecules, and a rich dataset was made public. QM7b [13], the released

dataset, contains 7,211 organic molecules consisting upto 23 atoms of C, H, O,

N, S, and Cl, with 14 property values per a molecule. The list of the properties

includes Highest Occupied Molecular Orbital (HOMO) energy level, Lowest

Unoccupied Molecular Orbital (LUMO) energy level, polarizability, and a few

others. Some properties are duplicated, since they are estimated from different

quantum simulation methods. Among the properties, we focus on the band gap

(LUMO energy level − HOMO energy level), because this determines which

wavelength of light the molecule will mostly interact with. Such interaction is

of crucial importance in applications where the interaction with the light is

involved.

12

Chapter 4

Bayesian Optimization with MultipleSurrogate Functions

4.1 Problem Formulation

In this thesis, a typical setting of Bayesian optimization is modified to yield

a novel BO setting which explicitly accounts the effect of surrogate functions

on the progress of optimization. In the proposed formulation, a set of candi-

date surrogate functions {gi(x)} are given before the BO starts, yet it is

unknown which surrogate function (or which blending strategy) would provide

the best optimization performance. Assuming multiple possible choice of sur-

rogate functions imposes an additional layer of decision making on the top of

the conventional BO. The whole problem is still an optimization, and therefore

the goal is to find the optimal point x∗ with the least number of function eval-

uations. The modified problem can be stated formally as the following. If the

object is minimization, arg max is replaced with arg min.

Given {gi(x)}Ki=1 , a(x), and D,

find x∗ = arg maxxi∈D

f(xi).

where gi(x)’s are the prespecified candidate surrogate functions, and a(x) is

the prespecified acquisition function. D, the domain of f , is either a continuous

vector space or a set of finite points.

To tackle the problem of BO with multiple candidate surrogate functions,

the multi-armed bandit framework is adopted in this thesis. Each candidate

surrogate functions are viewed as an arm in the MAB setting, and the queried

value, i.e. the y value of the newly acquired data point, or values derived from

13

it is considered as a reward. Note that it is also possible to take approaches

which do not have connections to MAB, and [17] pursued such direction. The

following sections describe the approaches investigated in this thesis.

4.2 Baseline: Random Arm

One of the most naive approach to take advantages of information from mul-

tiple surrogate functions is to randomly select a surrogate function to use.

From the analogy made to the MAB setting, this strategy will be referred as

RandomArm in the rest of the thesis.

4.3 Proposed Strategies

The two multi armed bandit-based strategies are proposed in this thesis. The

both strategies continuously evaluate their arms (or their candidate surrogate

functions) as Bayesian optimization proceeds, but they differ in which infor-

mation to use for the evaluation. The simpler one, referred as the partial in-

formation strategy or the EXP3-based adaptive Bayesian optimization, feeds

the newly acquired function value as the reward to the EXP3 algorithm, and

the EXP3 updates the probabilities assigned to each surrogate functions. De-

spite of its clarity, the EXP3-based strategy is expected to suffer from poor

scalability with the number of arms. In order to remedy the problem, the full

information strategy, or the HEDGE-based adaptive Bayesian optimization,

is also devised. It attempts to calculate feedback information for all surro-

gate functions, from the posterior mean value of each surrogate functions. The

HEDGE-based adaptive Bayesian optimization is largely inspired by [15]. The

strategies are described in detail in Algorithm2 and Algorithm3.

14

Algorithm 2 EXP3-based Adaptive Bayesian Optimization

Input: {gi(x)}Ki=1: The set of candidate surrogate functions,a(x; git): The acquisition function calculated based on git ,f(x): The underlying function to be optimized,D: The function domainγ ∈ (0, 1]

Output: f(x+): The incumbent best function value1: T ← {(xi1, f(xi1))} . The initial search point2: for i = 1, ...,K, ci(1) = 0 . Initialization of cumulative rewards3: t = 14: while termination condition do5: Update {gi(x)}Ki=1 with T6: For i = 1, ...,K, set wi(t) = exp(γci(t)/K)

7: For i = 1, ...,K, set pi(t) = (1− γ) wi(t)∑Kj=1 wj(t)

+ γK

8: Select it ∈ {1, ...,K} probabilistically according to p1(t), ..., pK(t)9: xnew ← arg maxx′∈D a(x

′; git)

10: Evaluate f(xnew)11: Receive reward rit(t) = f(xnew) . can be scaled to [0, 1]12: Set rj(t) = rj(t)/pj(t) for j = 1, ...,K if j = it, otherwise 013: Update ci(t+ 1) = ci(t) + rj(t)14: T ← T ∪ {(xnew, f(xnew))}15: t = t+ 116: end while

15

Algorithm 3 HEDGE-based Adaptive Bayesian Optimization

Input: {gi(x)}Ki=1: The set of candidate surrogate functions,a(x; git): The acquisition function calculated based on git ,f(x): The underlying function to be optimized,D: The function domainγ ∈ (0, 1]

Output: f(x+): The incumbent best function value1: T ← {(xi1, f(xi1))} . The initial search point2: for i = 1, ...,K, ci(1) = 0 . Initialization of cumulative rewards3: t = 14: Update {gi(x)}Ki=1 with T5: while termination condition do6: For i = 1, ...,K, set wi(t) = exp(γci(t))

7: For i = 1, ...,K, set pi(t) = wi(t)∑Kj=1 wj(t)

8: Select it ∈ {1, ...,K} probabilistically according to p1(t), ..., pK(t)9: xnew ← arg maxx′∈D a(x

′; git)

10: Evaluate f(xnew)11: T ← T ∪ {(xnew, f(xnew))}12: For j = 1, ...,K and k = 1, ...,K, evaluate µj(xk) . The posterior

means of surrogate functions on every suggested point13: Receive rewards, for j = 1, ...,K, rk(t) =

∑Kj=1 µj(xk)pj(t)

14: Update ck(t+ 1) = ck(t) + rk(t) for j = 1, ...,K15: t = t+ 116: end while

16

Chapter 5

Experiments

The proposed strategies for BO with multiple candidate surrogate functions

are demonstrated on global optimization benchmark functions and the organic

material screening task. In the experiments, Gaussian Processes with different

kernel functions are used as candidate surrogate functions. Hence the demon-

strated task can be viewed as a dynamic selection among kernels. However, it

should be noted that the proposed problem setting and strategies are not re-

stricted to kernel-related scheme. For example, the selection between multiple

possible data representations can also be performed in the proposed setting.

The implementation of Gaussian Processes and their kernels is from GPy

[29], a Python package for GP. Hyperparameters are treated by optimizing the

marginal likelihood with hyperpriors which are broad log-normal distributions.

Expected Improvement acquisition function is used for Bayesian optimiza-

tion. All the following experiments assume that the domain is a set of finite

elements which is given in advance. This is to follow the constraint of the

motivational application, the screening of molecules, where each element x in

domain D corresponds to a molecule. To avoid the sampling bias in the situa-

tion, the dataset D = {xi} is sampled again on every repeated experiment. For

the benchmark functions, 1,000 data points are generated randomly in their

specified domain, and for the quantum machine dataset, 2,000 molecules are

randomly selected among 7211 molecules. Note that, in this case, the optimiza-

tion of acquisition function can be done by an exhaustive search and therefore

heuristic optimization methods such as DiRect [30] is not needed.

Every following figure is from the experiments 100 times repeated with the

subset resampling, unless specified otherwise. On every repetition, the start-

17

0 10 20 30 40 50Data points

0.0

0.5

1.0

1.5

2.0

2.5

Regr

et

Hartmann 6D

mat32linearEXP3HEDGErandom_arm

0 10 20 30 40 50Data points

0

5

10

15

20

25

30

35

40

Regr

et

Rastrigin 2D

rbflinearEXP3HEDGErandom_arm

Figure 5.1 The proposed strategies are tested on the bechmark functions giventwo candidate surrogate functions.

ing point of BO is shared for every strategies or kernels being compared to

ensure the fairness of the comparison. The performance is measured by the no-

tion of regret r, which is the difference between the global optimum f(x∗) =

maxx∈D f(x) and the the optimum acquired so far f(x+) = maxx∈T f(x). It

should be noted that the actual value of regret is unknown during BO since

the global optimum value is unknown.

5.1 Benchmark Functions

The proposed strategies are first test to the popular benchmarks for global

optimization. Among many benchmark functions from [31], Rastrigin (2D) and

Hartmann (6D) functions are chosen.

It is interesting that the number of candidate surrogate functions and their

combination (for example, two similarly performing surrogate functions, or one

good and one bad functions) affects the performance of strategies. Therefore

varying conditions are tested and the results are shown in Figure 5.1 through

Figure 5.4.

5.2 Screening over Organic Molecules

As stated in Section 3 in this thesis, Bayesian optimization may be an effective

solution for searching a molecule with the most desirable electronic property

18

0 5 10 15 20 25 30 35 40Data points

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Arm

sel

ectin

g pr

obab

ility

Probabilities of Arms: Hartmann 6D, EXP3

mat32linear

0 5 10 15 20 25 30 35 40Data points

0.0

0.2

0.4

0.6

0.8

1.0

Arm

sel

ectin

g pr

obab

ility


rbflinear

Figure 5.2 The probability of selecting each surrogate function (averaged overmultiple runs) are displayed.

0 10 20 30 40 50Data points

0.0

0.5

1.0

1.5

2.0

2.5

Regr

et

Hartmann-6D

mat32rbfardexpolinearEXP3HEDGErandom_arm

0 5 10 15 20 25 30 35 40Data points

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

Arm

sel

ectin

g pr

obab

ility


mat32rbfardexpolinear

0 5 10 15 20 25 30 35 40Data points

0.23

0.24

0.25

0.26

0.27

0.28

Arm

sel

ectin

g pr

obab

ility

Probabilities of Arms: Hartmann 6D, HEDGE

mat32rbfardexpolinear

Figure 5.3 The performance of proposed strategies given four candidate sur-rogate functions, and the averaged probabilities of selecting each candidatesurrogate function.

19

0 10 20 30 40 50Data points

0.0

0.5

1.0

1.5

2.0

2.5

Regr

et

Hartmann-6D

mat32rbfardexpopoly3mat32ardlinearEXP3HEDGErandom_arm

0 5 10 15 20 25 30 35 40Data points

0.150

0.155

0.160

0.165

0.170

0.175

0.180

Arm

sel

ectin

g pr

obab

ility


mat32rbfardexpopoly3mat32ardlinear

0 5 10 15 20 25 30 35 40Data points

0.166

0.168

0.170

0.172

0.174

Arm

sel

ectin

g pr

obab

ility

Probabilities of Arms: Hartmann 6D, HEDGE

mat32rbfardexpopoly3mat32ardlinear

Figure 5.4 The performance of proposed strategies given six candidate surrogatefunctions, and the averaged probabilities of selecting each candidate surrogatefunction.

0 10 20 30 40 50 60 70Data points

1.0

1.5

2.0

Regr

et

Hartmann-6D

rbfardmat32ardexpolinearRandarm

Figure 5.5 RandomArm strategy is applied with four similarly performing ker-nels, outperforming all of them.

20

0 20 40 60 80 100Data points

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Regr

et

QM7b Band Gap Maximization

rbfpoly3linearmat32exporandom

0 20 40 60 80 100 120 140 160Data points

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Regr

et

QM7b Band Gap Maximization: 3 Arms

EXP3HEDGE-MODmat32rbfexpoRandarm

Figure 5.6 (Left) Bayesian optimization is applied to find a molecule with themaximum HOMO-LUMO bandgap. The choice of the surrogate function affectsthe regret curve. ’Random’ refers to a non-BO search strategy that performssearch in a random order. (Right) The proposed strategies are tested given threepopular candidate surrogate functions.

among a set of candidate molecules. When BO is applied to the chemical screen-

ing task, the molecular property of interest is a function value, and molecules

are elements of the domain. Then, the structure-property relationship is the

function being optimized, and quantum simulations corresponds to evaluations

of the function.

To show the feasibility of the approach, QM7b dataset is used. As the

dataset contains the results from quantum simulations, in the following exper-

iments, quantum simulations are not performed again but the property values

are just queried to the dataset. It saves time by preventing repeated execu-

tion of same calculations, and therefore enables repeated experiments. Among

14 electronic properties provided in the dataset, the HOMO-LUMO bandgap

calculated from GW simulation is chosen as a target electronic property.

Standard Bayesian optimization is demonstrated in QM7b dataset, and the

result is shown on the left panel in Figure 5.6. The figure verifies the effec-

tiveness of BO in chemical screening task. For the most of surrogate functions,

BO yields lower regret than that of randomly ordered screening. However, if

RBF kernel is chosen, than it would yield roughly the same regret with that

of the random search. Hence, choosing an adequate surrogate function, or at

least hedging the risk of choosing a wrong surrogate function is important in

21

this case.

It is intriguing that non-popular kernels (the linear kernel and the third

order polynomial kernel) achieved better performance. To consider a more re-

alistic scenario, the adaptive strategies are tested with the rest three kernels.

The result is shown in the right panel of Figure 5.6. The proposed methods did

not show notable improvement compared to the RandomArm strategy. How-

ever, more surprisingly, the regret of RandomArm is similar or lower than that

of BO with exponential kernel which showed the lowest regret among the three.

22

Chapter 6

Discussion and Conclusion

6.1 Discussion

One of the most counter-intuitive observation is that pulling an inferior arm,

i.e., acquire data points from the suboptimal surrogate function, does not al-

ways exacerbate the optimization performance. When a suboptimal function

is only modestly inferior, it often enables the blending strategies (even Ran-

domArm) to achieve superior performance over that of the optimal arm. This

phenomenon is most clearly demonstrated in Figure 5.5, and also in the right

panel of Figure 5.6. It can be hypothesized that the blending strategies work

as ensemble models in supervised learning: Merging information from multiple

models is likely to improve the prediction. It may also explain the unexpected

performance of RandomArm in Figure 5.1, where the regret curve of Ran-

domArm is notably close to the better surrogate function. Even suboptimal

surrogate functions are able to provide useful information for the optimization.

The approach taken in this thesis assumes that selecting a best-performing

surrogate function (or ruling out suboptimal surrogate functions) would yield

improved optimization performance. However, from the observation addressed

above, it is not always the case. By sharing data points, candidate surrogate

functions share information, and therefore the intriguing dynamics is generated.

This is clearly different from the scenario with multiple acquisition functions

investigated in [15].

In experiments shown above, MAB-based strategies did not always show

pronounced improvement over the RandomArm strategy. One possible reason

is that MAB-based strategies usually needs a ’warm-up’ period to make a sharp

23

decision. As shown in the figures, probabilities of selecting an arm is close to

uniform in the beginning of BO, and then evolves slowly. It is because MAB

methods used in this thesis make decisions based on the cumulative rewards,

and multiple rounds of playing arms are needed to produce enough difference.

BO typically runs in a ’small data’ regime, so this ’slow decision’ characteristic

of MAB does not make a perfect fit.

[15] used the posterior predictive mean value as a reward, but it can be

problematic with multiple surrogate functions, because predictive mean value

may scale differently. For example, during the experiments conducted in this

thesis, linear kernels tends to provide extreme posterior mean values which can

be expected from its linearity. If used without caution, those extreme posterior

value may significantly skew the decision making process. One alternative is to

exploit information expressed in a form of probability, available in a different

form of BO such as [3]. It can be an interesting future research direction.

Exploiting the information structure behind the problem is a key of a suc-

cessful solution to an MAB game (or other decision making problems). Cur-

rently, the puzzling dynamics among candidate surrogate functions is not ac-

counted properly and it is probably the main reason why the experimented

strategies underperform RandomArm strategy. In order to invent a strategy

that incorporate the specific information structure, deeper understanding on

the model-reality discrepancy is needed. This may lead to research on a BO

version of computational learning theory.

6.2 Conclusion

The problem of performing Bayesian optimization in the multiple candidate

surrogate function setting is investigated in this thesis. Two multi-armed bandit-

based strategies, EXP3-based and HEDGE-based ones, are proposed and tested

against the RandomArm strategy as a baseline. The superiority of the proposed

methods are not clear, since it outperforms the baseline only occasionally. How-

ever, the experimental results provided in this thesis indicates interesting dy-

namics regarding surrogate functions and the performance of BO. Performing

BO with multiple surrogate functions leads to blending of information from

24

surrogate functions, often resulting in performance beyond expectation. This

effect is prominent even for RandomArm strategy, suggesting that it can be a

useful hedging technique in the situation where the risk of selecting a wrong

surrogate function is high. The organic material screening task is a good ex-

ample of such situation. With BO, molecules with desired properties can be

found with less number of property simulations, and blending strategies can

hedge the risk of selecting a wrong surrogate functions as well as improve the

optimization performance. The whole results in this thesis are closely related

to an intriguing aspect of the model-based decision making problem, where a

model may deviate from the reality, or multiple candidate models may be avail-

able. There is only little known about it, and hence revealing the underlying

mechanism of how models affect the decision performance should be a very

exciting direction of future research.

25

Bibliography

[1] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-

theoretic regret bounds for gaussian process optimization in the bandit set-

ting,” Information Theory, IEEE Transactions on, vol. 58, no. 5, pp. 3250–

3265, 2012.

[2] A. D. Bull, “Convergence rates of efficient global optimization algorithms,”

The Journal of Machine Learning Research, vol. 12, pp. 2879–2904, Nov.

2011.

[3] J. M. Hernandez-Lobato, M. W. Hoffman, and Z. Ghahramani, “Predictive

entropy search for efficient global optimization of black-box functions,” in

Advances in Neural Information Processing Systems, pp. 918–926, 2014.

[4] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kegl, “Algorithms for

hyper-parameter optimization,” in Advances in Neural Information Pro-

cessing Systems, pp. 2546–2554, 2011.

[5] K. Kandasamy, J. Schneider, and B. Poczos, “High dimensional bayesian

optimisation and bandits via additive models,” in Proceedings of the 32nd

International Conference on Machine Learning (ICML-15) (D. Blei and

F. Bach, eds.), pp. 295–304, JMLR Workshop and Conference Proceedings,

2015.

[6] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret, “Robots that can adapt

like animals,” Nature, vol. 521, no. 7553, pp. 503–507, 2015.

[7] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on bayesian

optimization of expensive cost functions, with application to active

user modeling and hierarchical reinforcement learning,” arXiv preprint

arXiv:1012.2599, 2010.

26

[8] C. E. RASMUSSEN, “Gaussian processes to speed up hybrid monte carlo

for expensive bayesian integrals,” Bayesian statistics, vol. 7, pp. 651–659,

2008.

[9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-

tion of machine learning algorithms,” in Advances in neural information

processing systems, pp. 2951–2959, 2012.

[10] K. Swersky, J. Snoek, and R. P. Adams, “Freeze-thaw bayesian optimiza-

tion,” arXiv preprint arXiv:1406.3896, 2014.

[11] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Tak-

ing the human out of the loop: A review of bayesian optimization,” tech.

rep., Universities of Harvard, Oxford, Toronto, and Google DeepMind,

2015.

[12] Y. Baram, R. El-Yaniv, and K. Luz, “Online choice of active learning

algorithms,” The Journal of Machine Learning Research, vol. 5, pp. 255–

291, 2004.

[13] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen,

A. Tkatchenko, K.-R. Muller, and O. A. von Lilienfeld, “Machine learn-

ing of molecular electronic properties in chemical compound space,” New

Journal of Physics, vol. 15, no. 9, p. 095003, 2013.

[14] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe,

A. Tkatchenko, A. V. Lilienfeld, and K.-R. Muller, “Learning invariant

representations of molecules for atomization energy prediction,” in Ad-

vances in Neural Information Processing Systems, pp. 440–448, 2012.

[15] M. D. Hoffman, E. Brochu, and N. de Freitas, “Portfolio allocation for

bayesian optimization,” in Uncertainty in Artificial Intelligence, pp. 327–

336, 2011.

[16] W.-N. Hsu and H.-T. Lin, “Active learning by learning,” in Twenty-Ninth

AAAI Conference on Artificial Intelligence, 2015.

27

[17] I. Roman, R. Santana, A. Mendiburu, and J. A. Lozano, “Dynamic ker-

nel selection criteria for bayesian optimization,” in BayesOpt 2014: NIPS

Workshop on Bayesian Optimization, 2014.

[18] C. E. Rasmussen, Gaussian processes for machine learning. MIT Press,

2006.

[19] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Pat-

wary, M. Prabhat, and R. Adams, “Scalable bayesian optimization using

deep neural networks,” in Proceedings of the 32nd International Confer-

ence on Machine Learning (ICML-15), 2015.

[20] P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,”

The Journal of Machine Learning Research, vol. 3, pp. 397–422, 2003.

[21] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochas-

tic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32,

no. 1, pp. 48–77, 2002.

[22] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a

rigged casino: The adversarial multi-armed bandit problem,” in Founda-

tions of Computer Science, 1995. Proceedings., 36th Annual Symposium

on, pp. 322–331, IEEE, 1995.

[23] D. M. Hawkins, S. S. Young, and A. Rusinko, “Analysis of a large

structure-activity data set using recursive partitioning,” Quantitative

Structure-Activity Relationships, vol. 16, no. 4, pp. 296–302, 1997.

[24] J. Devillers, Neural networks in QSAR and drug design. Academic Press,

1996.

[25] R. Burbidge, M. Trotter, B. Buxton, and S. Holden, “Drug design by ma-

chine learning: support vector machines for pharmaceutical data analysis,”

Computers & Chemistry, vol. 26, no. 1, pp. 5–14, 2001.

[26] U. Norinder, “Support vector machine models in drug design: applications

to drug transport processes and qsar using simplex optimisations and vari-

able selection,” Neurocomputing, vol. 55, no. 1, pp. 337–346, 2003.

28

[27] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multi-task neural networks

for qsar predictions,” arXiv preprint arXiv:1406.1231, 2014.

[28] K. Chaloner and I. Verdinelli, “Bayesian experimental design: A review,”

Statistical Science, pp. 273–304, 1995.

[29] The GPy authors, “GPy: A gaussian process framework in python.” http:

//github.com/SheffieldML/GPy, 2012–2015.

[30] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, “Lipschitzian opti-

mization without the lipschitz constant,” Journal of Optimization Theory

and Applications, vol. 79, no. 1, pp. 157–181, 1993.

[31] A.-R. Hedar, “Test functions for unconstrained global optimization.”

http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/

Hedar_files/TestGO_files/Page364.htm, 2012–2015.

29

http://github.com/SheffieldML/GPy

http://github.com/SheffieldML/GPy

http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO_files/Page364.htm

http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO_files/Page364.htm

국문초록

베이지안 최적화는 탐색한 점들로 구축한 확률 모형을 기반으로 최적화를 수행

하는 모델 기반 최적화 기법이다. 베이지안 최적화의 성능은 어떠한 종류의 확률

모형을 쓰느냐에 따라 크게 좌우되는데, 많은 경우 사전에 어떤 모형이 가장 잘

동작할지 알 수 없다는 것이 문제이다. 본 논문에서는 여러 개의 대리 함수를 사

용할 수 있도록 수정된 베이지안 최적화 기법 문제를 제시하고, 이를 해결하기

위한 두 가지 multi-armed bandit 기반의 전략을 실험한다. 제안된 전략들은 여러

개의 대리 함수들 중 어떤 것을 사용할지 적응적으로 결정한다. 이 전략들은 최

적화 벤치마크 함수들과 유기분자 스크리닝 과제에 대해서 시험되었다. 유기분자

스크리닝 과제에서는 어떤 대리 함수를 쓰느냐가 중요한 문제이기 때문에 제안되

는 전략들이 아주 중요한 역할을 한다. 놀랍게도, 성능평가의 기준점으로 사용된,

각 대리 함수를 임의적으로 선택하는 전략이 준수한 성능을 보여주었다. 이러한

결과들은 베이지안 최적화에서 대리 함수의 개수 조건을 완화하였을 때 흥미로운

현상이 나타나, 유의미한 향후 연구방향임을 시사한다.

주요어: 베이지안 최적화, Multi-Armed Bandit, 가우시안 과정, 화학정보학

학번: 2014-21320

30

adaptive bayesian optimization for organic material...

Documents