20170415 當julia遇上資料科學
TRANSCRIPT
1
當 遇上資料科學Julia Taiwan 發起人 杜岳華
2
自我介紹 杜岳華 疾病管制署小小研發替代役 想成為生醫資料科學家 陽明生醫資訊所碩士 成大醫學檢驗生物技術系學士 成大資訊工程系學士
3
Why Julia?
4
In scientific computing and data science…
5
Other users
6
Avoid two language problem
One language for rapid development The other for performance
Example: Python for rapid development C for performance
7
itertools的效能 一篇文章描述兩者的取捨 「一般來說,我們不會去優化所有的程式碼,因為優化有很大的代價 :一般性與可讀性。 通常跑得快與寫的快,是要做取捨的。 這裡的例子很好想像,大家只要比較 R 的程式碼與
Rcpp 的程式碼就好了。」
http://wush.ghost.io/itertools-performance/
8
使用 Julia 就不用做取捨了阿 !!
9
Julia 的特色 Write like Python, run like C.
擁有 python 的可讀性 (readibility) 擁有 C 的效能 Easy to parallelism 內建套件管理器 ……
10
Julia codea = [1, 2, 3, 4, 5]
function square(x)return x^2
end
for x in aprintln(square(x))
end
11
https://julialang.org/benchmarks/
Julia performance
12
Who use Julia?
13
Nobel prize in economic sciences The founder of QuantEcon “His team at NYU uses Julia for macroeconomic modeling and
contributes to the Julia ecosystem.”
https://juliacomputing.com/case-studies/thomas-sargent.html
14
In 2015, economists at the Federal Reserve Bank of New York (FRBNY) published FRBNY’s most comprehensive and complex macroeconomic models, known as Dynamic Stochastic General Equilibrium, or DSGE models, in Julia.
https://juliacomputing.com/case-studies/ny-fed.html
15
UK cancer researchers turned to Julia to run simulations of tumor growth. Nature Genetics, 2016 Approximate Bayesian Computation (ABC) algorithms require potentially
millions of simulations - must be fast BioJulia project for analyzing biological data in Julia Bayesian MCMC methods Lora.jl and Mamba.jl
https://juliacomputing.com/case-studies/nature.html
16
IBM and Julia Computing analyzed eye fundus images provided by Drishti Eye Hospitals.
Timely screening for changes in the retina can help get them to treatment and prevent vision loss. Julia Computing’s work using deep learning makes retinal screening an activity that can be performed by a trained technician using a low cost fundus camera.
https://juliacomputing.com/case-studies/ibm.html
17
Path BioAnalytics is a computational biotech company developing novel precision medicine assays to support drug discovery and development, and treatment of disease.
https://juliacomputing.com/case-studies/pathbio.html
18
The Sloan Digital Sky Survey contains nearly 5 million telescopic images of 12 megabytes each – a dataset of 55 terabytes.
In order to analyze this massive dataset, researchers at UC Berkeley and Lawrence Berkeley National Laboratory created a new code named Celeste.
https://juliacomputing.com/case-studies/intel-astro.html
19
http://pkg.julialang.org/pulse.html
Julia Package Ecosystem Pulse
20
Introduction to Julia
21
一切都從數字開始… 在 Julia 中數字有下列幾種形式
整數 浮點數 有理數 複數
22
Julia 的整數跟浮點數是有不同位元版本的IntegerInt8Int16Int32Int64Int128
UnsignedUint8Uint16Uint32Uint64Uint128
FloatFloat16Float32Float64
23
有理數 有理數表示 自動約分 自動調整負號 接受分母為 0
2//3 # 2//3-6//12 # -1//25//-20 # -1//45//0 # 1//0
num(2//10) # 1den(7//14) # 2
2//4 + 1//7 # 9//143//10 * 6//9 # 1//510//15 == 8//12 # truefloat(3//4) # 0.75
24
複數1 + 2im (1 + 2im) + (3 - 4im) # 4 - 2im(1 + 2im)*(3 - 4im) # 11 + 2im(-4 + 3im)^(2 + 1im) # 1.950 + 0.651im
real(1 + 2im) # 1imag(3 + 4im) # 4conj(1 + 2im) # 1 - 2imabs(3 + 4im) # 5.0angle(3 + 3im)/pi*180 # 45.0
25
我們來宣告變數吧! 指定或不指定型別
x = 5y = 4::Int64z = x + yprintln(z) # 9
26
變數可以很隨便 動態型別語言特性 Value is immutable
x = 5println(x) # 5println(typeof(x)) # Int64
x = 6.0println(x) # 6.0println(typeof(x)) # Float64
27
x
6.0Float64
5Int64
28
靜態型別與動態型別 靜態型別跟動態型別最大的差別在於型別是跟著變數還是值。
5
5
xint
xint
Static type
Dynamic type
29
躺著玩、坐著玩、趴著玩,還是運算子好玩 +x : 就是 x 本身 -x : 變號 x + y, x - y, x * y, x / y : 一般四則運算 div(x, y) : 商 x % y : 餘數,也可以用 rem(x, y) x \ y : 反除,等價於 y / x x ^ y : 次方
30
操縱數字的機械核心 ~x : bitwise not x & y : bitwise and x | y : bitwise or x $ y: bitwise xor x >>> y :無正負號,將 x 的位元右移 y 個位數 x >> y :保留正負號,將 x 的位元右移 y 個位數 x << y : 將 x 的位元左移 y 個位數
https://www.technologyuk.net/mathematics/number-systems/images/binary_number.gif
31
方便的更新方法 += -= *= /= \= %= ^=
&= |= $= >>>= >>= <<=
x += 5等價於x = x + 5
32
超級比一比 x == y :等於 x != y, x ≠ y :不等於 x < y :小於 x > y :大於 x <= y, x ≤ y :小於或等於 x >= y, x ≥ y :大於或等於
a, b, c = (1, 3, 5)a < b < c # true
33
不同型別的運算與轉換 算術運算會自動轉換 強型別
3.14 * 4 # 12.56
parse(“5”) # 5convert(AbstractString, 5) # “5”
34
強型別與弱型別
string
Strong type
Weak type
5 “5”strin
gint
5 “5”strin
gint
+
+Implicitly
35
感覺這樣有點乾 我們來寫個小遊戲好了
36
來寫個猜拳遊戲好了
paper = 1 # 這代表布scissor = 2 # 這代表剪刀stone = 3 # 這代表石頭
37
判斷輸贏 If 判斷式 短路邏輯
if scissor > paper println("scissor win!!")endif < 判斷式 > < 程式碼 >end
if 3 > 5 && 10 > 0 …end
38
使用者輸入
println(" 請輸入要出的拳” )println(“1 代表布, 2 代表剪刀, 3 代表石頭: ")s = readline(STDIN)x = parse(s)
39
組織起來
if x == paper println(" 你出布 ")elseif x == scissor println(" 你出剪刀 ")elseif x == stone println(" 你出石頭 ")end
if < 判斷式 1> < 程式碼 1>elseif < 判斷式 2> < 程式碼 2>else < 程式碼 3>end
40
電腦怎麼出拳 rand(): 隨機 0~1 rand([]): 從裡面選一個出來
y = rand([1, 2, 3])
41
巢狀比較if x == y println(" 平手 ")elseif x == paper println(" 你出布 ") if y == scissor println(" 電腦出剪刀 ") println(" 電腦贏了 ") elseif y == stone println(" 電腦出石頭 ") println(" 你贏了 ") end...
42
我的義大利麵條elseif x == scissor println(" 你出剪刀 ") if y == paper println(" 電腦出布 ") println(" 你贏了 ") elseif y == stone println(" 電腦出石頭 ") println(" 電腦贏了 ") endelseif x == stone println(" 你出石頭 ") if y == scissor println(" 電腦出剪刀 ") println(" 你贏了 ") elseif y == paper println(" 電腦出布 ") println(" 電腦贏了 ") endend
if x == y println(" 平手 ")elseif x == paper println(" 你出布 ") if y == scissor println(" 電腦出剪刀 ") println(" 電腦贏了 ") elseif y == stone println(" 電腦出石頭 ") println(" 你贏了 ") end
43
我看到重複了 函式是消除重複的好工具! 像我們之前有寫了非常多的條件判斷,其實重複性很高,感覺很蠢,我們可以設法把出拳的判斷獨立出來。
44
函式來幫忙
function add(a, b) c = a + b return cend
45
函式怎麼講話 pass-by-sharing
個人認為跟 call by reference 比較像就是了
5xint function foo(a)
enda
46
簡化重複function shape(x) if x == paper return " 布 " elseif x == scissor return " 剪刀 " elseif x == stone return " 石頭 " endend
47
要怎麼處理判定輸贏 ?
簡化了重複 可是沒有處理判定輸贏
48
你需要的是一個矩陣 突然神說了一句話,解救了凡人的我。 XD
是的,或許你需要一個表來讓你查。 | 布 剪刀 石頭------------------- 布 | 0 -1 1剪刀 | 1 0 -1石頭 | -1 1 0
49
介紹 Array
homogenous start from 1 mutable
[ ]
2 3 5
A = [2, 3, 5]A[2] # 3
50
多維陣列
A = [0, -1, 1; 1, 0, -1; -1, 1, 0]
A[1, 2]
51
字串的簡易操作 concatenate
x 要是字串
" 你出 " * x
52
簡化完畢 稱為重構
refactoring
x_shape = shape(x)y_shape = shape(y)println(" 你出 " * x_shape)println(" 電腦出 " * y_shape)
win_or_lose = A[x, y]if win_or_lose == 0 println(" 平手 ")elseif win_or_lose == 1 println(" 你贏了 ")else println(" 電腦贏了 ")end
53
我想玩很多次
while < 判斷式 > < 程式碼 >end
x = …while < 持續條件 > ... x = …end
54
停止條件
s = readline(STDIN)x = parse(s)while x != -1 ... s = readline(STDIN) x = parse(s)end
55
Julia其他常用語法 For loop Comprehension Collections
56
For loop
for i = 1:5 # for 迴圈,有限的迴圈次數 println(i)end
57
Array搭配 for loop
strings = ["foo","bar","baz"]
for s in strings println(s)end
58
數值運算 介紹各種 Array 函式
zeros(Float64, 2, 2) # 2-by-2 matrix with 0
ones(Float64, 3, 3) # 3-by-3 matrix with 1
trues(2, 2) # 2-by-2 matrix with true
eye(3) # 3-by-3 diagnal matrix
rand(2, 2) # 2-by-2 matrix with random number
59
Comprehension
[x for x = 1:3]
[x for x = 1:20 if x % 2 == 0]
["$x * $y = $(x*y)" for x=1:9, y=1:9]
[1, 2, 3]
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
[“1 * 1 = 1“, “1 * 2 = 2“, “1 * 3 = 3“ ...]
60
Tuple
Immutable
tup = (1, 2, 3)
tup[1] # 1tup[1:2] # (1, 2)
(a, b, c) = (1, 2, 3)
61
Set
Mutable
filled = Set([1, 2, 2, 3, 4])push!(filled, 5)
intersect(filled, other)union(filled, other)setdiff(Set([1, 2, 3, 4]), Set([2, 3, 5]))
Set([i for i=1:10])
62
Dict
Mutable
filled = Dict("one"=> 1, "two"=> 2, "three"=> 3)
keys(filled)values(filled)
Dict(x=> i for (i, x) in enumerate(["one", "two", "three", "four"]))
63
Julia special features
64
支援 UTF8符號 打 `\alpha<tab>` => α α = 1 # 作為變數名稱 μ = 0 σ = 1 normal = Normal(μ, σ)
65
Easy to optimize
Allow generalization and flexibility, and enable to optimize.
Hints: Avoid global variables Add type declarations Measure performance with @time and pay attention to
memory allocation ……
66
Easy to profile
Use @time ProfileView.view()
67
增進 MATLAB-style 的程式效能 有人在論壇上提到如何增進程式效能,作者發現原本的程式碼約有 50% 的時間用在 garbage collection ,意味著有一半的時間花在記憶體的分配及釋放 作者進一步提到,以 array-by-array 的操作方式是在自
MATLAB背景的人會寫出的程式,若改成 element-by-element 的方式就有大幅的改善 P.S. 在 v0.6 之後加入了新的功能,不再讓 cos(aEll).*gridX .-
sin(aEll).*gridY 這樣的運算分配三次記憶體,而是只有一次http://kristofferc.github.io/post/vectorization_performance_study/
68
Easy to parallelize
for i = 1:100000do_something()
end
@parallel for i = 1:100000do_something()
end
69
Package manager
julia> Pkg.update()
julia> Pkg.add(“Foo”)
julia> Pkg.rm(“Foo”)
70
@code_native
julia> @code_native add(1, 2) .textFilename: REPL[2] pushq %rbp movq %rsp, %rbpSource line: 2 leaq (%rcx,%rdx), %rax popq %rbp retq nopw (%rax,%rax)
function add(a, b) return a+bend
71
@code_llvm
julia> @code_llvm add(1, 2.0)
; Function Attrs: uwtabledefine double @julia_add_71636(i64, double) #0 {top: %2 = sitofp i64 %0 to double %3 = fadd double %2, %1 ret double %3}
function add(a, b) return a+bend
72
Julia packages
73
74
75
76
77
78
DataTables.jl
julia> using DataTablesjulia> dt = DataTable(A = 1:4, B = ["M", "F", "F", "M"])
4×2 DataTables.DataTable │ Row A B │ │ │├─────┼───┼───┤ │ 1 1 M │ │ │ │ 2 2 F │ │ │ │ 3 3 F │ │ │ │ 4 4 M │ │ │
79
DataTables.jl
julia> dt[:A]4-element NullableArrays.NullableArray{Int64,1}: 1 2 3 4
julia> dt[2, :A]Nullable{Int64}(2)
80
DataTables.jl
julia> dt = readtable("data.csv")
julia> dt = DataTable(A = 1:10);julia> writetable("output.csv", dt)
81
DataTables.jljulia> names = DataTable(ID = [1, 2], Name = ["John Doe", "Jane Doe"])julia> jobs = DataTable(ID = [1, 2], Job = ["Lawyer", "Doctor"])
julia> full = join(names, jobs, on = :ID)2×3 DataTables.DataTable │ Row ID Name Job │ │ │ │├─────┼────┼──────────┼────────┤ │ 1 1 John Doe Lawyer │ │ │ │ │ 2 2 Jane Doe Doctor │ │ │ │
82
Query.jl
julia> q1 = @from i in dt begin @where i.age > 40 @select {number_of_children=i.children, i.name} @collect DataTableend
83
StatsBase.jl Mean Functions
mean(x, w) geomean(x) harmmean(x)
Scalar Statistics var(x, wv[; mean=...]) std(x, wv[; mean=...]) mean_and_var(x[, wv][,
dim]) mean_and_std(x[, wv][,
dim])
zscore(X, μ, σ) entropy(p) crossentropy(p, q) kldivergence(p, q) percentile(x, p) nquantile(x, n) quantile(x) median(x, w) mode(x)
84
StatsBase.jl Sampling from Population
sample(a) Correlation Analysis of
Signals autocov(x, lags[;
demean=true]) autocor(x, lags[;
demean=true]) corspearman(x, y) corkendall(x, y)
85
Distributions.jl Continuous Distributions
Beta(α, β) Chisq(ν) Exponential(θ) Gamma(α, θ) LogNormal(μ, σ) Normal(μ, σ) Uniform(a, b)
Discrete Distributions Bernoulli(p) Binomial(n, p) DiscreteUniform(a, b) Geometric(p) Hypergeometric(s, f, n) NegativeBinomial(r, p) Poisson(λ)
86
GLM.jl
julia> data = DataFrame(X=[1,2,3], Y=[2,4,7])
3x2 DataFrame|-------|---|---|| Row # | X | Y || 1 | 1 | 2 || 2 | 2 | 4 || 3 | 3 | 7 |
87
GLM.jl
julia> OLS = glm(@formula(Y ~ X), data, Normal(), IdentityLink())
DataFrameRegressionModel{GeneralizedLinearModel,Float64}:
Coefficients: Estimate Std.Error z value Pr(>|z|)(Intercept) -0.666667 0.62361 -1.06904 0.2850X 2.5 0.288675 8.66025 <1e-17
88
GLM.jl
julia> newX = DataFrame(X=[2,3,4]);julia> predict(OLS, newX, :confint)
3×3 Array{Float64,2}: 4.33333 1.33845 7.32821 6.83333 2.09801 11.5687 9.33333 1.40962 17.257 # The columns of the matrix are prediction, 95% lower and upper confidence bounds
89
Gadfly.jl
90
Plots.jl# initialize the attractorn = 1500dt = 0.02σ, ρ, β = 10., 28., 8/3x, y, z = 1., 1., 1.
# initialize a 3D plot with 1 empty seriesplt = path3d(1, xlim=(-25,25), ylim=(-25,25), zlim=(0,50), xlab = "x", ylab = "y", zlab = "z", title = "Lorenz Attractor", marker = 1)# build an animated gif, saving every 10th frame@gif for i=1:n dx = σ*(y - x) ; x += dt * dx dy = x*(ρ - z) - y ; y += dt * dy dz = x*y - β*z ; z += dt * dz push!(plt, x, y, z)end every 10
91
Data
JuliaData DataTables.jl CSV.jl DataStreams.jl CategoricalArrays.jl
JuliaDB
92
File
JuliaIO FileIO.jl JSON.jl LightXML.jl HDF5.jl GZip.jl
93
Differential equation JuliaDiff
ForwardDiff.jl: Forward Mode Automatic Differentiation for Julia ReverseDiff.jl: Reverse Mode Automatic Differentiation for Julia TaylorSeries.jl
JuliaDiffEq DifferentialEquations.jl
Discrete Equations (function maps, discrete stochastic (Gillespie/Markov) simulations)
Ordinary Differential Equations (ODEs) Stochastic Differential Equations (SDEs) Algebraic Differential Equations (DAEs) Delay Differential Equations (DDEs) (Stochastic) Partial Differential Equations ((S)PDEs)
94
Probability JuliaStats JuliaOpt
JuMP.jl Convex.jl
JuliaML LearnBase.jl LossFunctions.jl ObjectiveFunctions.jl PenaltyFunctions.jl
Klara.jl: MCMC inference in Julia
Mamba.jl: Markov chain Monte Carlo (MCMC) for Bayesian analysis in julia
95
Graph / Network
JuliaGraphs LightGraphs.jl GraphPlot.jl
96
Plot
Gadfly.jl JuliaPlots
Plots.jl
97
Glue
JuliaPy PyCall.jl pyjulia Conda.jl PyPlot.jl Pandas.jl Seaborn.jl SymPy.jl
JuliaInterop RCall.jl JavaCall.jl CxxWrap.jl MATLAB.jl
98
Programming
JuliaCollections Iterators.jl DataStructures.jl SortingAlgorithms.jl FunctionalCollections.jl
Combinatorics.jl
99
Web
JuliaWeb Requests.jl HttpServer.jl WebSockets.jl HTTPClient.jl
100
跟其他語言的比較 Python R Perl
101
Jobs
Apple, Amazon, Facebook, BlackRock, Ford, Oracle
Comcast, Massachusetts General Hospital
Farmers Insurance
Los Alamos National Laboratory and the National Renewable Energy Laboratory
https://juliacomputing.com/press/2017/01/18/jobs.html
102
Julia Taiwan
FB社群 : https://www.facebook.com/groups/1787971081482186/