材料信息学引论与基本方法 - sjtucms.sjtu.edu.cn/doc/courseware/2019/materials...surya r....

48
材料信息学引论与基本方法 1 张澜庭,[email protected] 2019年6月4日

Upload: others

Post on 29-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

材料信息学引论与基本方法

1

张澜庭,[email protected]

2019年6月4日

Page 2: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

以需求为导向,以数据为核心,以材料设计为对象

2

什么样的数据

如何产生数据

如何分析/挖掘数

⚫系统化的⚫有目的性、设计性

⚫非零敲碎打的量的堆积

⚫高通量实验

⚫高通量计算

⚫大量数据的处理能力(大数据?)

⚫智能分析(机器学习?)

Page 3: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

3

加工制备processing

结构structure

性能property

表现performance

材料学中的工艺-结构-性能-表现关联(PSPP linkage)

Page 4: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

4

Surya R. Kalidindi, Hierarchical Materials Informatics: Novel Analytics for Materials Data, Elsevier 2016,

p.13

Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery for

Accelerated Experimentation and Application, Krishna Rajan ed, Elsevier 2013, p.452

Page 5: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

材料基因组工作模式

• 1)快速大量实验,量变引起质变• 基于高通量合成与表征实验,直接快速优化与筛选(类

似穷举)

• 2)计算引领,实验验证• 基于计算模拟,预测有希望的候选材料,缩小实验范围

(先纸上谈兵)

• 3)机器学习,数据挖掘(材料信息学)• 基于大量数据,机器学习找出特征性参量,预测出候选

材料

5

Page 6: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

6http://mits.nims.go.jp/index_en.html

Page 7: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Creep life of 38 Ni-base alloys

7Zhao, J.-C. and Henry, M.F. (2002), CALPHAD—Is It Ready for Superalloy Design?. Adv. Eng. Mater.,

4: 501–508.

Regression using

alloy chemistry only

Regression using

chemistry and ’ fraction

R=0.32 R=0.98

Page 8: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

材料信息学(materials informatics)

• 生物信息学,是一个跨学科领域,开发用于理解生物数据的方法和软件工具

• Bioinformatics, an interdisciplinary field that develops methods and software tools for understanding biological data.

• 材料信息学,一个将信息学原理应用于材料科学和工程学以更好地理解材料的使用,选择,开发和发现的研究领域

• a field of study that applies the principles of informatics to materials science and engineering to better understand the use, selection, development, and discovery of materials.

8

https://en.wikipedia.org/wiki/Materia

ls_informatics

https://en.wikipedia.org/wiki/Bi

oinformatics

Page 9: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Machine learning (机器学习)

9Courtesy: 上大岳晓东副教授

Page 10: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

10

Page 11: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

多种方法的结合

• Mastering the game of Go with deep neural networks and tree search (2016)

• Mastering the game of Go

without human knowledge

(2017)

Self-play reinforcement learning in

AlphaGo Zero

蒙特卡洛树搜索 抽样剪枝

如何评估棋盘?深度神经网络!

Prof. Hinton, Science, 2006

1. data sets were big enough;

2. computers were fast enough;

3. the initial weights were close

enough to a good solution.

11

Page 12: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

我们会成为下一个柯洁吗? X

如果你不能战胜它,引领它!If you can’t beat it, lead

it.

怎样看待人工智能

12

Page 13: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

13

Page 14: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

14

机器学习改变材料发现方式

2016年5月Nature文章的启示性意义:

1)通过机器学习从“失败”数据中“学习”

规律,并对新材料进行预测。这些数据在过去被认为“失败”,沉睡在数据本上多年,可能永不见天日

2)只有不好的“结果”,没有不好的数据

3)对比有经验的化学家人工判断,结果机器预测结果成功率以89%:78%胜出

5)展示了机器学习方法的强大,就像AlphaGo对围棋的冲击

Page 15: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

SVM-derived decision tree

Nature 533, 73–76

(05 May 2016)

发掘现有因果条件的关联关系

Machine-learning-assisted materials discovery using failed experiments

15

Page 16: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

人工智能和数据技术在材料领域跨国公司中加速先进材料商业化的策略

• 与软件行业相比,新材料行业的投资风险,与生

物技术类似,存在周期长、高度不确定性、商业

化费用巨大(数亿美元)的特点

Elicia Maine and Prunesh Seegopaul, Nature Materials 15, 487–491 (2016), DOI: 10.1038/nmat4625 16

Page 17: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

人工智能和数据技术在材料领域跨国公司中加速先进材料商业化的策略

• 为了应对这种风险,BASF, Dow Chemical, EvonikIndustries等巨头开始将计算密集型的科学和商业组件带入一个新的领域

• 企业正在采用人工智能和机器学习的计算方法,通过利用实验室和制造工厂现有的大量数据,加快速度并拓宽实验范围

• 随着云计算使数据存储和访问成为一种廉价商品,IT的重点已转向开发软件,支持从材料科学研发产生的数据中提取含义

17

Page 18: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

BASF的案例:除草剂的开发

利用可视化的机器学习开发除草剂,达到95%准确率,节省90%时间,消除操作人偏差,节约开销,用户友好

https://www.basf.com/documents/corp/en/investor-relations/calendar-and-

publications/presentations/2017/170628_BASF_RD_Roundtable-

2017_Trethewey.pdf18

Page 19: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

BASF的案例:文献信息的获取

利用机器学习和语义本体分析文献,提高文献搜寻命中率,例如从关键词搜索的48000篇缩小到38篇相关文献

https://www.basf.com/documents/corp/en/investor-relations/calendar-and-

publications/presentations/2017/170628_BASF_RD_Roundtable-

2017_Trethewey.pdf19

Page 20: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Citrine Informatics的案例

• Citrine Informatics:2013年成立的位于硅谷材料数据和人工智能(AI)平台开发公司

• 4月19日宣布,腾讯提供了800万美元的融资,以满足国际上对材料AI发展的需求

• 2017年12月7日,Panasonic(松下)设定用材料信息技术优化开发材料

20

Page 21: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Machine learning (机器学习)

• Arthur Samuel (1959). Machine Learning:令计算机能够在没有明确编程的情况下进行学习的

领域 (Field of study that gives computers the ability to

learn without being explicitly programmed).

• Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

21

Page 22: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

机器学习算法的基本分类• 有监督学习

(supervised learning, ‘right answers’given)• 回归• 分类

• 无监督学习(unsupervised learning)

22

Page 23: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

回归问题(根据已知的数据做出预测)

23

已知数据:训练集

Page 24: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

回归问题(根据已知的数据做出预测)

24

已知数据:训练集

Page 25: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

回归问题(根据已知的数据做出预测)

25

已知数据:训练集

N个参量:多元回归分析线性 vs 非线性回归模型

𝑓 𝑥 = 𝑎0 +

𝑛=1

𝑎𝑛𝑥𝑛

Page 26: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Case study: fatigue strength predictor

26

http://info.eecs.northwestern.edu/SteelFatigueStrengthPredictor

Page 27: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

练习数据(400+条)download data from http://cms.sjtu.edu.cn/download.html

27

NT THT THt2 THQ

Cr

CT Ct3 DT Dt4 QmT TT Tt5 TCr

正火温度

硬化温度

硬化时间

硬化处理冷却速度

渗碳温度

渗碳时间

扩散处理温度

扩散处理时间

淬火介质温度

回火温度

回火时间

回火处理冷却速度

C Ni Cr Mo

化学 成分

Fatigue性能值:疲劳强度(@107周次)

Page 28: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

课堂实验内容

• 利用Microsoft Azure云计算平台完成数据的回归分析

• Sign in Azure Machine Learning Studio: https://studio.azureml.net

• Example data (data for exercise.xlsx): download from http://cms.sjtu.edu.cn/download.html

• Practice cross-validation

28

Page 29: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

复杂的分类问题

29

显性变量 vs 隐性变量

0+1x1+2x2+3x1x2

+4x12+5x2

2+…..

Page 30: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

非晶形成问题• 大部分金属凝固后结晶,只有少部分在特殊条件

下形成非晶,了解/预测非晶形成的奥秘,是材料学研究的热门话题。

30Alfred Ludwig, AFRL-AFOSR-UK-TR-2012-0047, 2012

Kawazoe, Y., Yu, J. Z., Tsai, A. P. & Masumoto T (eds).

Nonequilibrium Phase Diagrams of Ternary Amorphous

Alloys. (Springer-Verlag, Berlin, Germany, 1997).

Page 31: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

https://www.nature.com/articles/npjcompumats201628#supplementary-information

npj Computational Materials volume 2, Article number: 16028 (2016)

31

Given a

composition,

AxByCz

Attributes Target property(Determines) (Determines)

Page 32: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Magpie: A Materials-Agnostic Platform for Informatics and Exploration

32

1. 构建特征值:基于元素的数据表

(元素的原子序数等…)每个化合物产生145个特征值

2. 算法:数据输送给 已有的机器学习软件

“WEKA”

3. 模型的评估

获得一个模型(决策树,神经网络…)

Magpie中将这些步骤集成在了一个平台上(建立特征值,与WEKA连结,评估,实验相关的功能)

尝试多种算法建立一个最好的模型

以此模型进行预测

Page 33: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Weka: Waikato Environment for Knowledge Analysis

• open source software issued under the GNU General Public License

• https://www.cs.waikato.ac.nz/ml/weka/index.html

33

Page 34: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

34

Composition Target property

Attributes (145)

Page 35: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

145 attributes/属性

35

Stoichiometric

attribute

6个 单由化合物中所含的不同元素的原子分数决定的

Elemental

property

statistics

132个 化合物中,各个原子的元素性质(原子序数,质量,周期表中位置,价电子数,能带的能量)的最大值最小值平均值等统计量

Valance orbital

occupation

attribute

4个 不同电子轨道s, d, e, f的平均价电子数占总价电子数的比

Ionic compound

attribute

3个 三个用来描述离子特性的量:1个是表示化合物能否形成中性化合物其余两个是由电负性进行计算决定的属性

Page 36: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Stoichiometric attribute (6)

• These attributes capture the fraction of the elements present and are not affected by what those elements are.

• Lp norms (范数), p=0, 2, 3, 5, 7, 10

• L7 for Fe2O3:

36

Page 37: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Elemental property statistics (132)

• Statistics of the elemental properties in the following table, eg. min, max, range, fraction-weighted mean, average deviation, &mode

37

Page 38: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Valance orbital occupation attribute (4)

• Fraction-weighted average of the number of valance electrons in each orbital divided by the fraction-weighted average of the total number of valance electrons.

• eg. The fraction of p electrons for Fe2O3:

38

Page 39: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Ionic compound attribute (3)

• To determine whether a material is ionicallybonded.

• 1st: a Boolean denoting whether it is possible to form a neutral, ionic compound assuming each element takes exactly one of its common charge states.

• 2nd: the maximum ionic character between any two elements in the material

• 3rd: mean ionic character

39

Page 40: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Decision Tree (决策树)

40【参考】决策树-周志华

Page 41: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Decision Tree (决策树)

41【参考】决策树-周志华

Page 42: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Decision Tree (决策树)

42【参考】决策树-周志华

Algorithm:

1) Node=root

2) Pick the best attribute to split the dataset?

3) Create new descendants of the node

4) Sort training examples from the split

5) Loop to 2)

If all is split => STOP

Page 43: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Decision Tree (决策树)

43【参考】决策树-周志华

Algorithm:

1) Node=root

2) Pick the best attribute to split the dataset?

3) Create new descendants of the node

4) Sort training examples from the split

5) Loop to 2)

If all is split => STOP

What is the best split?

The best split is the split that put all positive to the right and all

negative to the left.

The best separation of different classes

=> Max information gain

The more mixed a set is, the higher the entropy:

Information gain method: Calculate the entropy of the data

before and after the split

Page 44: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

44

Page 45: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

45

Page 46: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Problem of a single decision tree: only produce one hypothesis

- This hypothesis works very well on the training dataset, but may not be so good if you

apply this concept to predict other samples

- Reason: a single decision tree is a correct model, but may not be the true model.

Solution: Ensemble modelBuild several trees and let all the trees decide what is the most possible result

For each tree, introduce a factor of randomness when building the tree.

Random Forest Select the splitting attributes from a random subset of features

Output:

y1, y2, …, yn => Y

1. Average the result

2. Majority voting

3. Weighted voting

……

Introducing more randomness:

Random Subspace1. We split the whole training data set and

create subsets by randomly selecting

features.

2. Then we use different subset to train

each tree

Randomness in dataset (random subspace)

+ Randomness in training (random forest)

= more likely to obtain the true model !!

= a higher accuracy

Ensemble learning (集成学习)

Page 47: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

Ensemble learning (集成学习)

47【参考】集成学习-周志华

形象说法:专家会诊;三个臭皮匠顶个诸葛亮

为了确保集成学习的结果比单个学习器好,对个体学习器间不存在强依赖关系、可同时生成并行化,常采用“随机森林”(Random Forest)模型。

Page 48: 材料信息学引论与基本方法 - SJTUcms.sjtu.edu.cn/doc/courseware/2019/materials...Surya R. Kalidindi, in Informatics for Materials Science and Engineering: Data-driven Discovery

课堂实验内容

• Use Magpie code (42 MB) (guide and code from http://cms.sjtu.edu.cn/download.html) to predict amorphous formation in a ternary system

• Walk through the code

• Predict Sb-Te-Ge system using supplementary data

48