20170415 當julia遇上資料科學

Post on 16-Apr-2017

101 Views

Category:

Technology

8 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

當 遇上資料科學Julia Taiwan 發起人 杜岳華

2

自我介紹 杜岳華 疾病管制署小小研發替代役 想成為生醫資料科學家 陽明生醫資訊所碩士 成大醫學檢驗生物技術系學士 成大資訊工程系學士

3

Why Julia?

4

In scientific computing and data science…

5

Other users

6

Avoid two language problem

One language for rapid development The other for performance

Example: Python for rapid development C for performance

7

itertools的效能 一篇文章描述兩者的取捨 「一般來說,我們不會去優化所有的程式碼,因為優化有很大的代價 :一般性與可讀性。 通常跑得快與寫的快,是要做取捨的。 這裡的例子很好想像,大家只要比較 R 的程式碼與

Rcpp 的程式碼就好了。」

http://wush.ghost.io/itertools-performance/

8

使用 Julia 就不用做取捨了阿 !!

9

Julia 的特色 Write like Python, run like C.

擁有 python 的可讀性 (readibility) 擁有 C 的效能 Easy to parallelism 內建套件管理器 ……

10

Julia codea = [1, 2, 3, 4, 5]

function square(x)return x^2

end

for x in aprintln(square(x))

end

11

https://julialang.org/benchmarks/

Julia performance

12

Who use Julia?

13

Nobel prize in economic sciences The founder of QuantEcon “His team at NYU uses Julia for macroeconomic modeling and

contributes to the Julia ecosystem.”

https://juliacomputing.com/case-studies/thomas-sargent.html

14

In 2015, economists at the Federal Reserve Bank of New York (FRBNY) published FRBNY’s most comprehensive and complex macroeconomic models, known as Dynamic Stochastic General Equilibrium, or DSGE models, in Julia.

https://juliacomputing.com/case-studies/ny-fed.html

15

UK cancer researchers turned to Julia to run simulations of tumor growth. Nature Genetics, 2016 Approximate Bayesian Computation (ABC) algorithms require potentially

millions of simulations - must be fast BioJulia project for analyzing biological data in Julia Bayesian MCMC methods Lora.jl and Mamba.jl

https://juliacomputing.com/case-studies/nature.html

16

IBM and Julia Computing analyzed eye fundus images provided by Drishti Eye Hospitals.

Timely screening for changes in the retina can help get them to treatment and prevent vision loss. Julia Computing’s work using deep learning makes retinal screening an activity that can be performed by a trained technician using a low cost fundus camera.

https://juliacomputing.com/case-studies/ibm.html

17

Path BioAnalytics is a computational biotech company developing novel precision medicine assays to support drug discovery and development, and treatment of disease.

https://juliacomputing.com/case-studies/pathbio.html

18

The Sloan Digital Sky Survey contains nearly 5 million telescopic images of 12 megabytes each – a dataset of 55 terabytes.

In order to analyze this massive dataset, researchers at UC Berkeley and Lawrence Berkeley National Laboratory created a new code named Celeste.

https://juliacomputing.com/case-studies/intel-astro.html

19

http://pkg.julialang.org/pulse.html

Julia Package Ecosystem Pulse

20

Introduction to Julia

21

一切都從數字開始… 在 Julia 中數字有下列幾種形式

整數 浮點數 有理數 複數

22

Julia 的整數跟浮點數是有不同位元版本的IntegerInt8Int16Int32Int64Int128

UnsignedUint8Uint16Uint32Uint64Uint128

FloatFloat16Float32Float64

23

有理數 有理數表示 自動約分 自動調整負號 接受分母為 0

2//3 # 2//3-6//12 # -1//25//-20 # -1//45//0 # 1//0

num(2//10) # 1den(7//14) # 2

2//4 + 1//7 # 9//143//10 * 6//9 # 1//510//15 == 8//12 # truefloat(3//4) # 0.75

24

複數1 + 2im (1 + 2im) + (3 - 4im) # 4 - 2im(1 + 2im)*(3 - 4im) # 11 + 2im(-4 + 3im)^(2 + 1im) # 1.950 + 0.651im

real(1 + 2im) # 1imag(3 + 4im) # 4conj(1 + 2im) # 1 - 2imabs(3 + 4im) # 5.0angle(3 + 3im)/pi*180 # 45.0

25

我們來宣告變數吧! 指定或不指定型別

x = 5y = 4::Int64z = x + yprintln(z) # 9

26

變數可以很隨便 動態型別語言特性 Value is immutable

x = 5println(x) # 5println(typeof(x)) # Int64

x = 6.0println(x) # 6.0println(typeof(x)) # Float64

27

x

6.0Float64

5Int64

28

靜態型別與動態型別 靜態型別跟動態型別最大的差別在於型別是跟著變數還是值。

5

5

xint

xint

Static type

Dynamic type

29

躺著玩、坐著玩、趴著玩,還是運算子好玩 +x : 就是 x 本身 -x : 變號 x + y, x - y, x * y, x / y : 一般四則運算 div(x, y) : 商 x % y : 餘數,也可以用 rem(x, y) x \ y : 反除,等價於 y / x x ^ y : 次方

30

操縱數字的機械核心 ~x : bitwise not x & y : bitwise and x | y : bitwise or x $ y: bitwise xor x >>> y :無正負號,將 x 的位元右移 y 個位數 x >> y :保留正負號,將 x 的位元右移 y 個位數 x << y : 將 x 的位元左移 y 個位數

https://www.technologyuk.net/mathematics/number-systems/images/binary_number.gif

31

方便的更新方法 += -= *= /= \= %= ^=

&= |= $= >>>= >>= <<=

x += 5等價於x = x + 5

32

超級比一比 x == y :等於 x != y, x ≠ y :不等於 x < y :小於 x > y :大於 x <= y, x ≤ y :小於或等於 x >= y, x ≥ y :大於或等於

a, b, c = (1, 3, 5)a < b < c # true

33

不同型別的運算與轉換 算術運算會自動轉換 強型別

3.14 * 4 # 12.56

parse(“5”) # 5convert(AbstractString, 5) # “5”

34

強型別與弱型別

string

Strong type

Weak type

5 “5”strin

gint

5 “5”strin

gint

+

+Implicitly

35

感覺這樣有點乾 我們來寫個小遊戲好了

36

來寫個猜拳遊戲好了

paper = 1 # 這代表布scissor = 2 # 這代表剪刀stone = 3 # 這代表石頭

37

判斷輸贏 If 判斷式 短路邏輯

if scissor > paper println("scissor win!!")endif < 判斷式 > < 程式碼 >end

if 3 > 5 && 10 > 0 …end

38

使用者輸入

println(" 請輸入要出的拳” )println(“1 代表布, 2 代表剪刀, 3 代表石頭: ")s = readline(STDIN)x = parse(s)

39

組織起來

if x == paper println(" 你出布 ")elseif x == scissor println(" 你出剪刀 ")elseif x == stone println(" 你出石頭 ")end

if < 判斷式 1> < 程式碼 1>elseif < 判斷式 2> < 程式碼 2>else < 程式碼 3>end

40

電腦怎麼出拳 rand(): 隨機 0~1 rand([]): 從裡面選一個出來

y = rand([1, 2, 3])

41

巢狀比較if x == y println(" 平手 ")elseif x == paper println(" 你出布 ") if y == scissor println(" 電腦出剪刀 ") println(" 電腦贏了 ") elseif y == stone println(" 電腦出石頭 ") println(" 你贏了 ") end...

42

我的義大利麵條elseif x == scissor println(" 你出剪刀 ") if y == paper println(" 電腦出布 ") println(" 你贏了 ") elseif y == stone println(" 電腦出石頭 ") println(" 電腦贏了 ") endelseif x == stone println(" 你出石頭 ") if y == scissor println(" 電腦出剪刀 ") println(" 你贏了 ") elseif y == paper println(" 電腦出布 ") println(" 電腦贏了 ") endend

if x == y println(" 平手 ")elseif x == paper println(" 你出布 ") if y == scissor println(" 電腦出剪刀 ") println(" 電腦贏了 ") elseif y == stone println(" 電腦出石頭 ") println(" 你贏了 ") end

43

我看到重複了 函式是消除重複的好工具! 像我們之前有寫了非常多的條件判斷,其實重複性很高,感覺很蠢,我們可以設法把出拳的判斷獨立出來。

44

函式來幫忙

function add(a, b) c = a + b return cend

45

函式怎麼講話 pass-by-sharing

個人認為跟 call by reference 比較像就是了

5xint function foo(a)

enda

46

簡化重複function shape(x) if x == paper return " 布 " elseif x == scissor return " 剪刀 " elseif x == stone return " 石頭 " endend

47

要怎麼處理判定輸贏 ?

簡化了重複 可是沒有處理判定輸贏

48

你需要的是一個矩陣 突然神說了一句話,解救了凡人的我。 XD

是的,或許你需要一個表來讓你查。 | 布 剪刀 石頭------------------- 布 | 0 -1 1剪刀 | 1 0 -1石頭 | -1 1 0

49

介紹 Array

homogenous start from 1 mutable

[ ]

2 3 5

A = [2, 3, 5]A[2] # 3

50

多維陣列

A = [0, -1, 1; 1, 0, -1; -1, 1, 0]

A[1, 2]

51

字串的簡易操作 concatenate

x 要是字串

" 你出 " * x

52

簡化完畢 稱為重構

refactoring

x_shape = shape(x)y_shape = shape(y)println(" 你出 " * x_shape)println(" 電腦出 " * y_shape)

win_or_lose = A[x, y]if win_or_lose == 0 println(" 平手 ")elseif win_or_lose == 1 println(" 你贏了 ")else println(" 電腦贏了 ")end

53

我想玩很多次

while < 判斷式 > < 程式碼 >end

x = …while < 持續條件 > ... x = …end

54

停止條件

s = readline(STDIN)x = parse(s)while x != -1 ... s = readline(STDIN) x = parse(s)end

55

Julia其他常用語法 For loop Comprehension Collections

56

For loop

for i = 1:5 # for 迴圈,有限的迴圈次數 println(i)end

57

Array搭配 for loop

strings = ["foo","bar","baz"]

for s in strings println(s)end

58

數值運算 介紹各種 Array 函式

zeros(Float64, 2, 2) # 2-by-2 matrix with 0

ones(Float64, 3, 3) # 3-by-3 matrix with 1

trues(2, 2) # 2-by-2 matrix with true

eye(3) # 3-by-3 diagnal matrix

rand(2, 2) # 2-by-2 matrix with random number

59

Comprehension

[x for x = 1:3]

[x for x = 1:20 if x % 2 == 0]

["$x * $y = $(x*y)" for x=1:9, y=1:9]

[1, 2, 3]

[2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

[“1 * 1 = 1“, “1 * 2 = 2“, “1 * 3 = 3“ ...]

60

Tuple

Immutable

tup = (1, 2, 3)

tup[1] # 1tup[1:2] # (1, 2)

(a, b, c) = (1, 2, 3)

61

Set

Mutable

filled = Set([1, 2, 2, 3, 4])push!(filled, 5)

intersect(filled, other)union(filled, other)setdiff(Set([1, 2, 3, 4]), Set([2, 3, 5]))

Set([i for i=1:10])

62

Dict

Mutable

filled = Dict("one"=> 1, "two"=> 2, "three"=> 3)

keys(filled)values(filled)

Dict(x=> i for (i, x) in enumerate(["one", "two", "three", "four"]))

63

Julia special features

64

支援 UTF8符號 打 `\alpha<tab>` => α α = 1 # 作為變數名稱 μ = 0 σ = 1 normal = Normal(μ, σ)

65

Easy to optimize

Allow generalization and flexibility, and enable to optimize.

Hints: Avoid global variables Add type declarations Measure performance with @time and pay attention to

memory allocation ……

66

Easy to profile

Use @time ProfileView.view()

67

增進 MATLAB-style 的程式效能 有人在論壇上提到如何增進程式效能,作者發現原本的程式碼約有 50% 的時間用在 garbage collection ,意味著有一半的時間花在記憶體的分配及釋放 作者進一步提到,以 array-by-array 的操作方式是在自

MATLAB背景的人會寫出的程式,若改成 element-by-element 的方式就有大幅的改善 P.S. 在 v0.6 之後加入了新的功能,不再讓 cos(aEll).*gridX .-

sin(aEll).*gridY 這樣的運算分配三次記憶體,而是只有一次http://kristofferc.github.io/post/vectorization_performance_study/

68

Easy to parallelize

for i = 1:100000do_something()

end

@parallel for i = 1:100000do_something()

end

69

Package manager

julia> Pkg.update()

julia> Pkg.add(“Foo”)

julia> Pkg.rm(“Foo”)

70

@code_native

julia> @code_native add(1, 2) .textFilename: REPL[2] pushq %rbp movq %rsp, %rbpSource line: 2 leaq (%rcx,%rdx), %rax popq %rbp retq nopw (%rax,%rax)

function add(a, b) return a+bend

71

@code_llvm

julia> @code_llvm add(1, 2.0)

; Function Attrs: uwtabledefine double @julia_add_71636(i64, double) #0 {top: %2 = sitofp i64 %0 to double %3 = fadd double %2, %1 ret double %3}

function add(a, b) return a+bend

72

Julia packages

73

74

75

76

77

78

DataTables.jl

julia> using DataTablesjulia> dt = DataTable(A = 1:4, B = ["M", "F", "F", "M"])

4×2 DataTables.DataTable │ Row A B │ │ │├─────┼───┼───┤ │ 1 1 M │ │ │ │ 2 2 F │ │ │ │ 3 3 F │ │ │ │ 4 4 M │ │ │

79

DataTables.jl

julia> dt[:A]4-element NullableArrays.NullableArray{Int64,1}: 1 2 3 4

julia> dt[2, :A]Nullable{Int64}(2)

80

DataTables.jl

julia> dt = readtable("data.csv")

julia> dt = DataTable(A = 1:10);julia> writetable("output.csv", dt)

81

DataTables.jljulia> names = DataTable(ID = [1, 2], Name = ["John Doe", "Jane Doe"])julia> jobs = DataTable(ID = [1, 2], Job = ["Lawyer", "Doctor"])

julia> full = join(names, jobs, on = :ID)2×3 DataTables.DataTable │ Row ID Name Job │ │ │ │├─────┼────┼──────────┼────────┤ │ 1 1 John Doe Lawyer │ │ │ │ │ 2 2 Jane Doe Doctor │ │ │ │

82

Query.jl

julia> q1 = @from i in dt begin @where i.age > 40 @select {number_of_children=i.children, i.name} @collect DataTableend

83

StatsBase.jl Mean Functions

mean(x, w) geomean(x) harmmean(x)

Scalar Statistics var(x, wv[; mean=...]) std(x, wv[; mean=...]) mean_and_var(x[, wv][,

dim]) mean_and_std(x[, wv][,

dim])

zscore(X, μ, σ) entropy(p) crossentropy(p, q) kldivergence(p, q) percentile(x, p) nquantile(x, n) quantile(x) median(x, w) mode(x)

84

StatsBase.jl Sampling from Population

sample(a) Correlation Analysis of

Signals autocov(x, lags[;

demean=true]) autocor(x, lags[;

demean=true]) corspearman(x, y) corkendall(x, y)

85

Distributions.jl Continuous Distributions

Beta(α, β) Chisq(ν) Exponential(θ) Gamma(α, θ) LogNormal(μ, σ) Normal(μ, σ) Uniform(a, b)

Discrete Distributions Bernoulli(p) Binomial(n, p) DiscreteUniform(a, b) Geometric(p) Hypergeometric(s, f, n) NegativeBinomial(r, p) Poisson(λ)

86

GLM.jl

julia> data = DataFrame(X=[1,2,3], Y=[2,4,7])

3x2 DataFrame|-------|---|---|| Row # | X | Y || 1 | 1 | 2 || 2 | 2 | 4 || 3 | 3 | 7 |

87

GLM.jl

julia> OLS = glm(@formula(Y ~ X), data, Normal(), IdentityLink())

DataFrameRegressionModel{GeneralizedLinearModel,Float64}:

Coefficients: Estimate Std.Error z value Pr(>|z|)(Intercept) -0.666667 0.62361 -1.06904 0.2850X 2.5 0.288675 8.66025 <1e-17

88

GLM.jl

julia> newX = DataFrame(X=[2,3,4]);julia> predict(OLS, newX, :confint)

3×3 Array{Float64,2}: 4.33333 1.33845 7.32821 6.83333 2.09801 11.5687 9.33333 1.40962 17.257 # The columns of the matrix are prediction, 95% lower and upper confidence bounds

89

Gadfly.jl

90

Plots.jl# initialize the attractorn = 1500dt = 0.02σ, ρ, β = 10., 28., 8/3x, y, z = 1., 1., 1.

# initialize a 3D plot with 1 empty seriesplt = path3d(1, xlim=(-25,25), ylim=(-25,25), zlim=(0,50), xlab = "x", ylab = "y", zlab = "z", title = "Lorenz Attractor", marker = 1)# build an animated gif, saving every 10th frame@gif for i=1:n dx = σ*(y - x) ; x += dt * dx dy = x*(ρ - z) - y ; y += dt * dy dz = x*y - β*z ; z += dt * dz push!(plt, x, y, z)end every 10

91

Data

JuliaData DataTables.jl CSV.jl DataStreams.jl CategoricalArrays.jl

JuliaDB

92

File

JuliaIO FileIO.jl JSON.jl LightXML.jl HDF5.jl GZip.jl

93

Differential equation JuliaDiff

ForwardDiff.jl: Forward Mode Automatic Differentiation for Julia ReverseDiff.jl: Reverse Mode Automatic Differentiation for Julia TaylorSeries.jl

JuliaDiffEq DifferentialEquations.jl

Discrete Equations (function maps, discrete stochastic (Gillespie/Markov) simulations)

Ordinary Differential Equations (ODEs) Stochastic Differential Equations (SDEs) Algebraic Differential Equations (DAEs) Delay Differential Equations (DDEs) (Stochastic) Partial Differential Equations ((S)PDEs)

94

Probability JuliaStats JuliaOpt

JuMP.jl Convex.jl

JuliaML LearnBase.jl LossFunctions.jl ObjectiveFunctions.jl PenaltyFunctions.jl

Klara.jl: MCMC inference in Julia

Mamba.jl: Markov chain Monte Carlo (MCMC) for Bayesian analysis in julia

95

Graph / Network

JuliaGraphs LightGraphs.jl GraphPlot.jl

96

Plot

Gadfly.jl JuliaPlots

Plots.jl

97

Glue

JuliaPy PyCall.jl pyjulia Conda.jl PyPlot.jl Pandas.jl Seaborn.jl SymPy.jl

JuliaInterop RCall.jl JavaCall.jl CxxWrap.jl MATLAB.jl

98

Programming

JuliaCollections Iterators.jl DataStructures.jl SortingAlgorithms.jl FunctionalCollections.jl

Combinatorics.jl

99

Web

JuliaWeb Requests.jl HttpServer.jl WebSockets.jl HTTPClient.jl

100

跟其他語言的比較 Python R Perl

101

Jobs

Apple, Amazon, Facebook, BlackRock, Ford, Oracle

Comcast, Massachusetts General Hospital

Farmers Insurance

Los Alamos National Laboratory and the National Renewable Energy Laboratory

https://juliacomputing.com/press/2017/01/18/jobs.html

102

Julia Taiwan

FB社群 : https://www.facebook.com/groups/1787971081482186/

top related