歷屆 技術領袖培訓班 經理人班 綜合補充¶œ合補充.pdf ·...
TRANSCRIPT
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw http://www.hmwu.idv.tw
吳漢銘國立臺北大學 統計學系
http://www.hmwu.idv.tw
歷屆技術領袖培訓班/經理人班綜合補充
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
為什麼人工智慧時代我們要學好機率統計?
https://read01.com/zh-tw/5MxODzD.html
2/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
實例: Curve Fitting【用「化學反應速率」得知武漢肺炎結束時間】台大教授用「一張圖」精算疫情結束日,網友讚:跪著看完這篇 Posted on 2020/02/04
https://buzzorange.com/2020/02/04/when-will-wuhan-virus-end/?fbclid=IwAR2h7gzgO2LAAHL6vJKcxPH-CfAt-bqdUkNqHdU08tFqjwWtjnuPBCRtiyU
臺灣大學化學系徐丞志
《經濟學人》(The Economist)2020/02/13
3/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
玩數據? 肺炎會算數學?「死亡率」暗藏驚悚內幕 神人一圖曝真相NOWnews 2020/02/06 13:18 https://times.hinet.net/news/22770288
您數據系?
4/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
新瓶舊酒?
不!機器學習才不只是統計學的美化!https://kknews.cc/tech/n3yrpyq.html
https://ccckmit.github.io/aibook/htm/basic.html人工智慧 (陳鍾誠 於 金門大學)
5/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
統計vs機器學習• 統計vs機器學習,數據領域的「少林和武當」!
https://read01.com/O3dPexn.html
• 數據科學「內戰」:統計vs.機器學習https://read01.com/ePRGMz7.html
• 機器學習VS統計模型https://kknews.cc/zh-tw/tech/gz22r3y.html
• 運算思維:一張圖看懂機器翻譯(人工智慧)的原理https://web.ntnu.edu.tw/~samtseng/present/CT_STM.html[運用電腦來做自動翻譯: 機率、貝氏定理]
• 臉書人工智慧研究主管、紐約大學教授揚.勒丘恩(Yann LeCun): 「人工智慧完全是數學。」
• 人工智慧浪潮下的數學教育https://www.ettoday.net/news/20180508/1161306.htm
• 人工智慧大商機https://www.hbrtaiwan.com/article_content_AR0007381.html
• 統計學和機器學習到底有什麼區別?http://bangqu.com/iw4cp6.html
• 不要只關心怎麼優化模型,這不是機器學習的全部http://bangqu.com/niYN6Z.html
6/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.twhttps://kknews.cc/zh-tw/tech/5z36z28.html
7/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.twhttps://udn.com/news/story/7266/3672385
8/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Statistical learning
http://web.stanford.edu/~hastie/CASI/ https://web.stanford.edu/~hastie/pub.htm https://web.stanford.edu/~hastie/ElemStatLearn/
9/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
林共進教授之演講10/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Terminology 11/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Different culture 12/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
網路文章的說法1. 資料採擷最重要的一步是資料清洗和缺失填補: 如何清
洗,怎麼填?這一步是建模的關鍵!
2. 模型優化及調參問題: 不懂演算法原理就不知道怎麼調。LDA,SVD,SVM,隨機森林,神經網路,貝葉斯,最大熵,EM,混合高斯,HMM等等,哪個不是根據嚴格的凸優化及機率模型或者資訊理論嚴格推導出來的? 這些都是實打實的數學機率統計基礎。
3. 模型的顯著性評價: 在一些演算法,雖然指標證明是優秀的,但是如果因數的假設性檢驗證明不顯著的話,無疑是不可信賴的模型。
13/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
吳喜之教授
著名統計學家、中國人民大學統計學院教授吳喜之教授
https://www.jianshu.com/p/80adbe6f7213
14/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
15/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
作者:吳喜之https://kknews.cc/tech/vzj4vrq.html
16/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
https://kknews.cc/science/bk58y3m.html
https://kknews.cc/tech/4v4ymkg.html
https://kknews.cc/news/lnxbqz.html
17/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Decision Tree (決策樹)
Yan-yan SONG and Ying LU, Decision tree methods: applications for classification and prediction, Shanghai Arch Psychiatry. 2015 Apr 25; 27(2): 130–135.
Images source: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining 1st Edition, Publisher: Pearson; 1 edition (May 12, 2005)
18/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
The Class Probability Mass Function for a Partition
• If X is a numeric attribute, we have to evaluate split points of the form X ≤ v.
• Consider only the midpoints between two successive distinct values for X in the sample D.
• Let {v1 ,...,vm} denote the set of all such midpoints, such that v1 < v2 < ∙∙∙ < vm .
• For each split point X≤ v, we have to estimate the class PMFs:
Review: Breslow, L. A.and Aha,D. W. (1997).Simplifyingdecision trees:Asurvey.Knowledge Engineering Review, 12(1): 1–40.
19/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Estimate the Class PMFs
Review: Breslow, L. A.and Aha,D. W. (1997).Simplifyingdecision trees:Asurvey.Knowledge Engineering Review, 12(1): 1–40.
Using the Bayes theorem
20/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Estimate the Class PMFs
Review: Breslow, L. A.and Aha,D. W. (1997).Simplifyingdecision trees:Asurvey.Knowledge Engineering Review, 12(1): 1–40.
21/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Decision Tree (決策樹)
Yan-yan SONG and Ying LU, Decision tree methods: applications for classification and prediction, Shanghai Arch Psychiatry. 2015 Apr 25; 27(2): 130–135.
QUEST (Quick, Unbiased, Efficient, Statistical Tree.) Wei-Yin Loh and Yu-Shan Shih (1997)
22/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
簡單線性迴歸 (Simple Linear Regression)
參數估計: 最小平方法
可當成評估指標
23/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
簡單線性迴歸 (Simple Linear Regression)
參數估計: 最大概似法
推論
可做統計推論
24/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
机器学习工程师面试宝典http://lamda.nju.edu.cn/zhangh/(概率方法/支持向量机/集成学习/决策树)機器學習算法中的概率方法http://bangqu.com/7Thm6P.html
https://buzzorange.com/techorange/2019/05/02/difference-between-statistics-and-machine-learning/?fbclid=IwAR2IC5pn0YYu0f-3hqsi5BLcmxdVdFlA24zlh_dhSufHYRZrjf92C51iOtE
• 機器學習和統計的主要差別在「目的」• 統計模型與機器學習在線性迴歸上的差異• 機器學習是基於統計學• 統計學習理論—機器學習的統計學基礎• 例證• 那麼哪個方法更優呢?
25/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
K-means: special case of EM applied to Gaussian mixtures
26/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Mixture Densities• Given a sample x and k, learning = estimating
the component density and proportions.• Assume p(x|Gi) ~ parametric model
=> only estimate their parameters
Clustering: learning the mixture parameters from data
27/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Supervised and Unsupervised Learning 28/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
K-means Clustering• K-means is a partition methods for clustering.• Data are classified into k groups as specified by the user. • Two different clusters cannot have any objects in common,
and the k groups together constitute the full data set.
Converged
Optimization problem:Minimize the sum of squared within-cluster distances
2
1 ( ) ( )
1( ) ( , )2
K
E i jk C i C j k
W C d x x
29 /45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
K-means Clustering
When Xt is represented by mi, there is an error that is proportional to the distance.Find k reference vectors mj (prototypes/codebook vectors/codewords) which best represent data.K-means clustering a special case of the EM algorithm.
K-means for clustering=> find groups in the data=> groups are represented by their centers.
30/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
K-means Clustering• The best reference vectors are those that minimize the total
reconstruction error.• Disadvantage:
a local search. Final mi highly depend on the initial mi.
31/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Expectation-Maximization (EM)• EM: look for the component density
parameters that maximize the likelihood of the sample.
• Log likelihood of sample• EM: is used in maximum likelihood
estimation where the problem involves two sets of random variables of which one, X, is observation and the other, Z, is hidden.
• Goal: to find the parameter vector Фthat maximizes the likelihood of the observed values of X, L(Ф|X).
• Z values are not observed, we can’t work directly wth the complete data likelihood Lc.
• Work with its expectation Q, given Xand the current parameter values Фl.
EM
Dempster, Laird, and Rubin 1977; Rednerand Walker 1984.
32/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
EM Algorithm• An increase in Q
implies an increase in the incomplete likelihood.
• In the case of mixtures, the hidden variables are the sources of observations: which observation belongs to which component. K-means:
(1) calculation of bi (E-step)
(2) re-estimateion of mi (M-step)
Estimate these labels given current compon
Update class given estimated labels
33/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
EM Algorithm (conti.)34/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
E-Step• The expected value of the hidden variable, E[z], is the posterior
probability that xt is generated by component Gi.
35/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
M-step 36/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
EM in Gaussian Mixtures
• Hard label: 0/1• Soft label: assign the input to a cluster with a certain
probability.• K-means: special case of EM applied to Gaussian
mixtures (input are assumed independent with equal and share variances and where labels are hardened.
• K-means: pave the input density with circle.• EM: use ellipse of arbitrary shapes and orientations.
37/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
數據科學家• 想學機器學習應該準備具備哪些數學知識?Posted on 2017/11/09https://buzzorange.com/techorange/2017/11/09/how-to-learn-machine-learning/
• 【IT 轉職全攻略】數學不學好也想當數據科學家?想轉型,你會需要這份攻略Posted on 2018/02/11https://buzzorange.com/techorange/2018/02/11/data-scientists-have-to-improve-math/
• 不学好数学也想当数据科学家?不存在的原创: 文摘菌 大数据文摘 2月6日https://mp.weixin.qq.com/s/3d5UL3HajI2-0Z6QA6kNiA
• 數據科學家們的無知困境:只會 coding ,卻忘記用更大的格局思考問題Posted on 2017/11/14https://buzzorange.com/techorange/2017/11/14/you-should-think-bigger/
• 數據科學家必讀的五本書:重要的不是會打 Code,而是背後的資料邏輯思維Posted on 2018/09/11https://buzzorange.com/techorange/2018/09/11/these-non-code-book-worth-reading-for-programmers/
38/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Interval Data: Modeling and Visualization2019 Jun 17 (Mon), 10:30 AM中研院-統計所 6005會議室(環境變遷研究大樓A棟)茶會:上午10:10統計所6005會議室(環境變遷研究大樓A棟)Prof. Dennis K.J. Lin (林共進教授)Department of Statistics, Pennsylvania State University, USA.
39/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
40/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
統計真的有這麼難嗎?
• 統計之所以會讓人覺得困難,主要與科學的本質與統計的知識結構有關 第一是科學態度的嚴謹性使然 第二是科學問題的複雜性使然 第三是統計知識的層次性問題 第四是統計的應用與研究課題的整合問
題
• 問題與解決 難的不是統計學,問題在於人們心中
的焦慮與無知的恐懼。 只要克服心理上的因素,統計也不過
是一門高度實用性的學問而已。
41/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
統計學超簡單漫畫統計學入門(簡體中文) http://www.zhukun.org/haoty/teaching/teaching_MStats/manhua_tjsrm.pdf
42/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
科普書籍43/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
Deep Learning with RPratap Dangeti, 2017, Statistics for Machine Learning: Techniques for exploring supervised, unsupervised, and reinforcement learning models with Python and R Kindle Edition, Packt Publishing.
44/45
http://www.hmwu.idv.twhttp://www.hmwu.idv.tw
機器學習、深度學習、AI中的數學與統計[日]涌井良幸、涌井貞美深度學習的數學出版商: 人民郵電出版日期: 2019-06-01
機器學習的數學基礎 : AI、深度學習打底必讀醫學統計學專家 西內啟 著胡豐榮博士, 徐先正 合譯出版商: 旗標科技出版(2020-01-31)
45/45