introduction to julia for bioinformacis

73
Introduction to Julia for Bioinformatics Kenta Sato ( ) @ Bioinformatics Research Unit, RIKEN ACCC November 19, 2015 1 / 72

Upload: kenta-sato

Post on 12-Jan-2017

2.325 views

Category:

Science


10 download

TRANSCRIPT

Page 1: Introduction to Julia for bioinformacis

Introduction to Julia for Bioinformatics

Kenta Sato (佐藤建太)

@ Bioinformatics Research Unit, RIKEN ACCC

November 19, 2015

1 / 72

Page 2: Introduction to Julia for bioinformacis

TopicsAbout MeJuliaBioJuliaJulia Updates '15

2 / 72

Page 3: Introduction to Julia for bioinformacis

About Me

Graduate school student at the University of Tokyo.About 2-year experience of Julia programming.Contributing to Julia and its ecosystem:

https://github.com/docopt/DocOpt.jlhttps://github.com/bicycle1885/IntArrays.jlhttps://github.com/BioJulia/IndexableBitVectors.jlhttps://github.com/BioJulia/WaveletMatrices.jlhttps://github.com/BioJulia/FMIndexes.jlhttps://github.com/isagalaev/highlight.js (Julia support)etc.

Core developer of BioJulia - https://github.com/BioJulia/Bio.jlJulia Summer of Code 2015 Student -http://julialang.org/blog/2015/10/biojulia-sequence-analysis/

3 / 72

Page 4: Introduction to Julia for bioinformacis

JuliaCon 2015 at MIT, Boston

https://twitter.com/acidflask/status/633349038226690048

4 / 72

Page 5: Introduction to Julia for bioinformacis

Julia is ...

Julia is a high­level, high­performancedynamic programming language for technicalcomputing, with syntax that is familiar to users ofother technical computing environments. It providesa sophisticated compiler, distributed parallelexecution, numerical accuracy, and an extensivemathematical function library. Julia’s Base library,largely written in Julia itself, also integrates mature,best­of­breed open source C and Fortran libraries forlinear algebra, random number generation, signalprocessing, and string processing.— http://julialang.org/

5 / 72

Page 6: Introduction to Julia for bioinformacis

Two­Language Problem

In technical computing, users use easier and slower script languages,while developers use harder and faster compiled languages.

6 / 72

Page 7: Introduction to Julia for bioinformacis

Two­Language Problem

Both users and developers can use a handy language withoutsacrificing performance.

7 / 72

Page 8: Introduction to Julia for bioinformacis

Three Virtues of the Julia Language

SimpleFastDynamic

8 / 72

Page 9: Introduction to Julia for bioinformacis

Simple

Syntax with least astonishmentno semicolonsno variable declarationsno argument typesUnicode support1-based indexblocks end with end

No implicit type conversionQuick sort with 24 lines

quicksort(xs) = quicksort!(copy(xs))quicksort!(xs) = quicksort!(xs, 1, endof(xs))

function quicksort!(xs, lo, hi) if lo < hi p = partition(xs, lo, hi) quicksort!(xs, lo, p - 1) quicksort!(xs, p + 1, hi) end return xsend

function partition(xs, lo, hi) pivot = div(lo + hi, 2) pvalue = xs[pivot] xs[pivot], xs[hi] = xs[hi], xs[pivot] j = lo @inbounds for i in lo:hi-1 if xs[i] ≤ pvalue xs[i], xs[j] = xs[j], xs[i] j += 1 end end xs[j], xs[hi] = xs[hi], xs[j] return jend

9 / 72

Page 10: Introduction to Julia for bioinformacis

Fast

Comparable performance to compiled languages.

http://julialang.org/ 10 / 72

Page 11: Introduction to Julia for bioinformacis

Fast

The LLVM-backed JIT compiler emits machine code at runtime.

julia> 4 >> 1 # bitwise right-shift function2

julia> @code_native 4 >> 1 .section __TEXT,__text,regular,pure_instructionsFilename: int.jlSource line: 115 pushq %rbp movq %rsp, %rbp movl $63, %ecx cmpq $63, %rsiSource line: 115 cmovbeq %rsi, %rcx sarq %cl, %rdi movq %rdi, %rax popq %rbp ret

11 / 72

Page 12: Introduction to Julia for bioinformacis

Dynamic

No need to precompile your program.

hello.jl:

println("hello, world")

Output:

$ julia hello.jlhello, world

In REPL:

julia> include("hello.jl")hello, world

12 / 72

Page 13: Introduction to Julia for bioinformacis

Dynamic

High-level code generation at runtime (macros).

julia> x = 55

julia> @assert x > 0 "x should be positive"

julia> x = -2-2

julia> @assert x > 0 "x should be positive"ERROR: AssertionError: x should be positive

julia> macroexpand(:(@assert x > 0 "x should be positive")):(if x > 0 nothing else Base.throw(Base.Main.Base.AssertionError("x should be positive" end)

13 / 72

Page 14: Introduction to Julia for bioinformacis

Who Created?

Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman

Soon the team was building their dream language.MIT, where Bezanson is a graduate student, becamean anchor for the project, with much of the workbeing done within computer scientist andmathematician Alan Edelman’s research group. Butdevelopment of the language remained completelydistributed. “Jeff and I didn’t actually meet until we’dbeen working on it for over a year, and Viral was inIndia the entire time,” Karpinski says. “So the wholelanguage was designed over email.”— "Out in the Open: Man Creates One Programming Language to Rule Them All" ­http://www.wired.com/2014/02/julia/

14 / 72

Page 15: Introduction to Julia for bioinformacis

Why Created?

In short, because we are greedy.— "Why We Created Julia" ­ http://julialang.org/blog/2012/02/why­we­created­julia/

15 / 72

Page 16: Introduction to Julia for bioinformacis

Why Created?

The creators wanted a language that satisfies:

the speed of Cwith the dynamism of Rubymacros like Lispmathematical notations like Matlabas usable for general programming as Pythonas easy for statistics as Ras natural for string processing as Perlas powerful for linear algebra as Matlabas good at gluing programs together as the shell

16 / 72

Page 17: Introduction to Julia for bioinformacis

Batteries Included

You can start technical computing without installing lots of libraries.

Numeric types{8, 16, 32, 64, 128}-bit {signed, unsigned} integers,16, 32, 64-bit floating point numbers,and arbitrary-precision numbers.

Numerical linear algebramatrix multiplication, matrix decomposition/factorization, solver forsystem of linear equations, and more!sparse matrices

Random number generatorMersenne-Twister method accelerated by SIMD

17 / 72

Page 18: Introduction to Julia for bioinformacis

Batteries Included

You can start technical computing without installing lots of libraries.

Unicode supportPerl-compatible regular expressions (PCRE)Parallel computingDates and timesUnit testsProfilerPackage manager

18 / 72

Page 19: Introduction to Julia for bioinformacis

Language Design

19 / 72

Page 20: Introduction to Julia for bioinformacis

Literals

# Int644210_000_000

# UInt80x1f

# Float643.146.022e23

# Booltruefalse

# UnitRange{Int64}1:100

# ASCIIString"ascii string"

# UTF8String

"UTF8文字列"

# Regexr"̂>[̂\n]+\n[ACGTN]+"

# Array{Float64,1}# (Vector{Float64})[1.0, 1.1, 1.2]

# Array{Float64,2}# (Matrix{Float64})[1.0 1.1; 2.0 2.2]

# Tuple{Int,Float64,ASCIIString}(42, 3.14, "ascii string")

# Dict{ASCIIString,Int64}Dict("one" => 1, "two", => 2)

20 / 72

Page 21: Introduction to Julia for bioinformacis

Functions

All function definitions below are equivalent:

function func(x, y) return x + yend

function func(x, y) x + yend

func(x, y) = return x + y

func(x, y) = x + y

Force inlining:

@inline func(x, y) = x + y

This simple function will be automatically inlined by the compiler.❏

21 / 72

Page 22: Introduction to Julia for bioinformacis

Functions ­ Arguments

Optional arguments:

function increment(x, by=1) return x + byend

increment(3) # 4increment(3, 2) # 5

Keyword arguments:

function increment(x; by=1) return x + byend

increment(3) # 4increment(3, by=2) # 5

Variable number of arguments:

function pushback!(list, vals...) for val in vals push!(list, val) end return listend

pushback!([]) # []pushback!([], 1) # [1]pushback!([], 1, 2) # [1, 2]

Notice semicolon (;) in the argument list above.❏

22 / 72

Page 23: Introduction to Julia for bioinformacis

Functions ­ Return Values

You can return multiple values from a function as a tuple:

function divrem64(n) return n >> 6, n & 0b111111end

And you can receive returned values with multiple assignments:

julia> divrem64(1025)(16,1)

julia> d, r = divrem64(1025)(16,1)

julia> d16

julia> r1

23 / 72

Page 24: Introduction to Julia for bioinformacis

Functions ­ Document

A document string can be attached to a function definition:

"""This function computes quotient and remainderdivided by 64 for a non-negative integer."""function divrem64(n) return n >> 6, n & 0b111111end

In REPL, you can read the attached document with the ? command:

help?> divrem64search: divrem64 divrem

This function computes quotient and remainder divided by 64 for a non-negative integer.

24 / 72

Page 25: Introduction to Julia for bioinformacis

Types

Two kinds of types:concrete types: instantiatableabstract types: not instantiatable

25 / 72

Page 26: Introduction to Julia for bioinformacis

Defining Types

Abstract type:

abstract AbstractFloat <: Real

Composite type:

# mutabletype Point x::Float64 y::Float64end

# immutableimmutable Point x::Float64 y::Float64end

Bits type:

bitstype 64 Int64 <: Signed

Type alias:

typealias UInt UInt64

Enum:

@enum Vote positive negative

26 / 72

Page 27: Introduction to Julia for bioinformacis

Parametric Types

Types can take type parameters:

type Point{T} x::T y::Tend

Point: abstract typePoint{Int64}: concrete type

subtype of Point (Point{Int64} <: Point)all of the members (i.e. x and y) are Int64s

type NucleotideSequence{T<:Nucleotide} <: Sequence data::Vector{UInt64} ...end

27 / 72

Page 28: Introduction to Julia for bioinformacis

Constructors

Julia automatically generates default constructors.Point(1, 2) creates an object of Point{Int} type.Point(1.0, 2.0) creates an object of Point{Float64} type.Point{Float64}(1, 2) creates an object of Point{Float64} type.

Users can create custom constructors.

type Point{T} x::T y::Tend

# outer constructorfunction Point(x) return Point(x, x)end

p = Point(1) #> Point{Int64}(1, 1)

28 / 72

Page 29: Introduction to Julia for bioinformacis

Memory Layout

Compact memory layout like C's structsC compatible memory layoutYou can pass Julia objects to C functions without copy.

This is especially important in bioinformaticswhen defining data structures for efficient algorithmswhen handling lots of small objects

julia> @enum Strand forward reverse both unknown

julia> immutable Exon chrom::Int start::Int stop::Int strand::Strand end

julia> sizeof(Exon(1, 12345, 12446, forward))32

29 / 72

Page 30: Introduction to Julia for bioinformacis

Multiple Dispatch

Combination of all argument types determines a called method.

Single dispatch (e.g. Python)

The first argument is special anddetermines a method.

Multiple dispatch (e.g. Julia)

All arguments are equallyresponsible to determine amethod.

class Serializer: def write(self, val): if isinstance(val, int) # ... elif isinstance(val, float) # ... #...

function write(dst::Serializer, val::Int64) # ...end

function write(dst::Serializer, val::Float64) # ...end

# ...

30 / 72

Page 31: Introduction to Julia for bioinformacis

Multiple Dispatch ­ Example (1)

base/char.jl:

-(x::Char, y::Char) = Int(x) - Int(y)-(x::Char, y::Integer) = Char(Int32(x) - Int32(y))+(x::Char, y::Integer) = Char(Int32(x) + Int32(y))+(x::Integer, y::Char) = y + x

julia> 'c' - 'a'2

julia> 'c' - 1'b'

julia> 'a' + 0x01'b'

julia> 0x01 + 'a''b'

31 / 72

Page 32: Introduction to Julia for bioinformacis

Multiple Dispatch ­ Example (2)

function has{T<:Integer}(range::UnitRange{Int}, target::T) return first(range) ≤ target ≤ last(range)end

function has(iter, target) # same as has(iter::Any, target::Any) for elm in iter if elm == target return true end end return falseend

julia> has(1:10, 4)true

julia> has(1:10, -2)false

julia> has([1,2,3], 2)true

32 / 72

Page 33: Introduction to Julia for bioinformacis

Metaprogramming

Julia can represent its own program code as a data structure (Expr).Three metaprogramming components in Julia:

Macrosgenerate an expression from expressions.

Expr ↦ ExprGenerated functions

generate an expression from types.

Types ↦ ExprNon-standard string literals

generate an expression from a string.

String ↦ Expr

33 / 72

Page 34: Introduction to Julia for bioinformacis

Metaprogramming ­ Macros

Generate an expression from expressions.

Expr ↦ ExprDenoted as @<macro name>.

Distinguishable from function callsWe've already seen some macros.

macro assert(ex) msg = string(ex) :($ex ? nothing : throw(AssertionError($msg)))end

julia> x = -1-1

julia> @assert x > 1ERROR: AssertionError: x > 1

34 / 72

Page 35: Introduction to Julia for bioinformacis

Metaprogramming ­ Useful Macros (1)

@show: print variables, useful for debug:

julia> x = -1-1

julia> @show xx = -1

@inbounds: omit to check bounds:

@inbounds h[i,j] = h[i-1,j-1] + submat[a[i],b[j]]

@which: return which function will be called:

julia> @which max(1, 2)max{T<:Real}(x::T<:Real, y::T<:Real) at promotion.jl:239

35 / 72

Page 36: Introduction to Julia for bioinformacis

Metaprogramming ­ Useful Macros (2)

@time: measure elapsed time to evaluate the expression:

julia> xs = rand(1_000_000);

julia> @time sum(xs) 0.022633 seconds (27.24 k allocations: 1.155 MB)499795.2805424741

julia> @time sum(xs) 0.000574 seconds (5 allocations: 176 bytes)499795.2805424741

@profile: profile the expression:

julia> sort(xs); @profile sort(xs);

julia> Profile.print()69 REPL.jl; anonymous; line: 92 68 REPL.jl; eval_user_input; line: 62...

36 / 72

Page 37: Introduction to Julia for bioinformacis

Generated Functions

Generate a specialized program code for argument types.

Type(s) ↦ ExprSame as function call.

indistinguishable syntax from a calling site

@generated function _sub2ind{N,M}(dims::NTuple{N,Integer}, subs::NTuple{M,Integer}) meta = Expr(:meta, :inline) ex = :(subs[$M] - 1) for i = M-1:-1:1 if i > N ex = :(subs[$i] - 1 + $ex) else ex = :(subs[$i] - 1 + dims[$i]*$ex) end end Expr(:block, meta, :($ex + 1))end

37 / 72

Page 38: Introduction to Julia for bioinformacis

Non­standard String Literals

Generate an expression from a string.

String ↦ ExprDenoted as <literal name>"..."Regular expression literal (e.g. r"̂>[̂\n]+\n[ACGTN]+") is anexample.In Bio.jl, dna"ACGT" is converted to a DNASequence object.

macro r_str(s) Regex(s)end

# Regex objectr"̂>[̂\n]+\n[ACGTN]+"

# DNASequence objectdna"ACGT"

38 / 72

Page 39: Introduction to Julia for bioinformacis

Modules

Modules are namespace.Names right under a module are considered as global names.Import/export system enables to exchange names betweenmodules.

module Foo

export foo, gvar

# functionfoo() = println("hello, foo")bar() = println("hello, bar")

# global variableconst gvar = 42

end

Foo.foo()Foo.bar()Foo.gvar

import Foo: foofoo()

import Foo: barbar()

using Foofoo()gvar

39 / 72

Page 40: Introduction to Julia for bioinformacis

Packages

A package manager is bundled with Julia.No other package manager; this is the standard.The package manager can build, install, and create packages.Almost all packages are hosted on GitHub.

Registered packagesRegistered packages are public packages that can be installed byname.List: http://pkg.julialang.org/Repository: https://github.com/JuliaLang/METADATA.jl

40 / 72

Page 41: Introduction to Julia for bioinformacis

Packages ­ Management

The package manager is accessible from REPL.Pkg.update(): update registered package data and upgradepackages

The way to install a package depends on whether the package isregistered or not.Pkg.add(<package>): install a registered packagePkg.clone(<url>): install a package from the git URL

julia> Pkg.update()

julia> Pkg.add("DocOpt")

julia> Pkg.clone("[email protected]:docopt/DocOpt.jl.git")

41 / 72

Page 42: Introduction to Julia for bioinformacis

Packages ­ Create a Package

Package template can be generated with Pkg.generate(<package>).This generates a disciplined scaffold to develop a new package.Generated packages will be located in ~/.julia/v0.4/.

Pkg.tag(<package>, <version>) tags the version to the currentcommit of the package.

This tag is considered as a release of the package.Developers should follow Semantic Versioning.

major: incompatible API changesminor: backwards-compatible functionality additionpatch: backwards-compatible bug fixes

julia> Pkg.generate("DocOpt")

julia> Pkg.tag("DocOpt", :patch) # patch update

42 / 72

Page 43: Introduction to Julia for bioinformacis

BioJulia

43 / 72

Page 44: Introduction to Julia for bioinformacis

BioJulia

Collaborative project to build bioinformatics infrastructure for Julia.Packages:

Bio.jl - https://github.com/BioJulia/Bio.jlOther packages - https://github.com/BioJulia

44 / 72

Page 45: Introduction to Julia for bioinformacis

BioJulia ­ Basic Principles

BioJulia will be fast.All contributions undergo code review.We'll design it to suit modern bioinformatics and Julia, not just copyother Bio-projects.

https://github.com/BioJulia/Bio.jl/wiki/roadmap

45 / 72

Page 46: Introduction to Julia for bioinformacis

Bio.jl

Major modules:Bio.Seq: biological sequencesBio.Intervals: genomic intervalsBio.Align: sequence alignments (coming soon!)Bio.Phylo: phylogenetics (common soon!)

Under (active!) development.

46 / 72

Page 47: Introduction to Julia for bioinformacis

Bio.jl

Major modules:Bio.Seq: biological sequencesBio.Intervals: genomic intervalsBio.Align: sequence alignments (coming soon!)Bio.Phylo: phylogenetics (common soon!)

Under (active!) development.

47 / 72

Page 48: Introduction to Julia for bioinformacis

Sequences

Sequence types are defined in Bio.Seq module:DNASequence, RNASequence, AminoAcidSequence, Kmer

julia> using Bio.Seq

julia> dna"ACGTN" # non-standard string literal5nt DNA Sequence ACGTN

julia> rna"ACGUN"5nt RNA Sequence ACGUN

julia> aa"ARNDCWYV"8aa Sequence: ARNDCWYV

julia> kmer(dna"ACGT")DNA 4-mer: ACGT

48 / 72

Page 49: Introduction to Julia for bioinformacis

Sequences ­ Packed Nucleotides

A/C/G/T are packed into an array with 2-bit encoding (+1 bit for N).

type NucleotideSequence{T<:Nucleotide} <: Sequence data::Vector{UInt64} # 2-bit encoded sequence ns::BitVector # 'N' mask ...end

In Kmer, nucleotides are packed into a 64-bit type.

bitstype 64 Kmer{T<:Nucleotide, K}typealias DNAKmer{K} Kmer{DNANucleotide, K}typealias RNAKmer{K} Kmer{RNANucleotide, K}

49 / 72

Page 50: Introduction to Julia for bioinformacis

Sequences ­ Immutable by Convention

Sequences are immutable by convention.No copy when creating a subsequence from an existing sequence.

julia> seq = dna"ACGTATG"7nt DNA Sequence ACGTATG

julia> seq[2:4]3nt DNA Sequence CGT

# internal data is shared between# the original and its subsequencesjulia> seq.data === seq[2:4].datatrue

50 / 72

Page 51: Introduction to Julia for bioinformacis

Intervals

Genomic interval types are defined in Bio.Intervals module:Interval{T}: T is the type of metadata attached to the interval.

type Interval{T} <: AbstractInterval{Int64} seqname::StringField first::Int64 last::Int64 strand::Strand metadata::Tend

This is useful when annotating a genomic range:

julia> using Bio.Intervals

julia> Interval("chr2", 5692667, 5701385, '+', "SOX11")chr2:5692667-5701385 + SOX11

51 / 72

Page 52: Introduction to Julia for bioinformacis

Intervals ­ Indexed Collections

Set of intervals can be indexed by IntervalCollection:

immutable CDS; gene::ASCIIString; index::Int; end

ivals = IntervalCollection{CDS}()push!(ivals, Interval("chr6", 156777930, 156779471, '+', CDS("ARID1B", 1)))push!(ivals, Interval("chr6", 156829227, 156829421, '+', CDS("ARID1B", 2)))push!(ivals, Interval("chr6", 156901376, 156901525, '+', CDS("ARID1B", 3)))

intersect iterates over intersecting intervals:

julia> query = Interval("chr6", 156829200, 156829300);julia> for i in intersect(ivals, query) println(i) endchr6:156829227-156829421 + CDS("ARID1B",2)

52 / 72

Page 53: Introduction to Julia for bioinformacis

Parsers

Parsers are generated from the Ragel state machine compiler.Finite state machines are described in regular language.The Ragel compiler generates pure Julia programs.Actions can be injected into the state transition.

The next Ragel release (v7) will be shipped with the Julia generator.http://www.colm.net/open-source/ragel/

53 / 72

Page 54: Introduction to Julia for bioinformacis

Parsers ­ FASTA

<name> = <expression> > <entering action> %<leaving action>;

FASTA parser:

newline = '\r'? '\n' >count_line;hspace = [ \t\v];whitespace = space | newline;

identifier = (any - space)+ >mark %identifier;description = ((any - hspace) [̂\r\n]*) >mark %description;letters = (any - space - '>')+ >mark %letters;sequence = whitespace* letters? (whitespace+ letters)*;fasta_entry = '>' identifier (hspace+ description)? newline sequence whitespace*;

main := whitespace* (fasta_entry %finish_match)**;

https://github.com/BioJulia/Bio.jl/blob/master/src/seq/fasta.rlhttps://github.com/BioJulia/Bio.jl/blob/master/src/seq/fasta.jl

54 / 72

Page 55: Introduction to Julia for bioinformacis

Parsers ­ Fast

Ragel can generate fast parsers.

julia> @time for rec in open("hg38.fa", FASTA) println(rec) end>chr1248956422nt Mutable DNA Sequence NNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNN>chr10133797422nt Mutable DNA Sequence NNNNNNNNNNNNNNNNNNNNNNN…NNNNNNNNNNNNNNNNNNNNNNNN

# ...

>chrY_KI270740v1_random37240nt Mutable DNA Sequence TAATAAATTTTGAAGAAAATGAA…GAATGAAGCTGCAGACATTTACGG 32.198314 seconds (174.92 k allocations: 1.464 GB, 1.14% gc time)

55 / 72

Page 56: Introduction to Julia for bioinformacis

Alignments

The Bio.Align module supports various pairwise alignment types.

Score maximization:GlobalAlignment

SemiGlobalAlignment

OverlapAlignment

LocalAlignment

Cost minimization:EditDistance

LevenshteinDistance

HammingDistance

56 / 72

Page 57: Introduction to Julia for bioinformacis

Alignments ­ Simple Interfaces (1)

julia> affinegap = AffineGapScoreModel(match=5, mismatch=-4, gap_open=-3, gap_extend=-2);

julia> pairalign(GlobalAlignment(), dna"ATGGTGACT", dna"ACGTGCCCT", affinegap)PairwiseAlignment{Int64,Bio.Seq.NucleotideSequence{Bio.Seq.DNANucleotide},Bio.Seq.NucleotideSequence{Bio.Seq.DNANucleotide}}: score: 12 seq: ATGGTGAC-T | | || | | ref: ACG-TGCCCT

57 / 72

Page 58: Introduction to Julia for bioinformacis

Alignments ­ Simple Interfaces (2)

pairalign(<type>, <seq1>, <seq2>, <score/cost model>)

pairalign(GlobalAlignment(), a, b, model)pairalign(SemiGlobalAlignment(), a, b, model)pairalign(OverlapAlignment(), a, b, model)pairalign(LocalAlignment(), a, b, model)pairalign(EditDistance(), a, b, model)pairalign(LevenshteinDistance(), a, b)pairalign(HammingDistance(), a, b)

Alignment options:

pairalign(GlobalAlignment(), a, b, model, banded=true)pairalign(GlobalAlignment(), a, b, model, score_only=true)

58 / 72

Page 59: Introduction to Julia for bioinformacis

Alignments ­ Speed (1)

Global alignment of titin sequences (human and mouse):

affinegap = AffineGapScoreModel(BLOSUM62, -10, -1)a = first(open("Q8WZ42.fasta", FASTA)).seqb = first(open("A2ASS6.fasta", FASTA)).seq@time aln = pairalign( GlobalAlignment(), Vector{AminoAcid}(a), Vector{AminoAcid}(b), affinegap,)println(score(aln))

8.012499 seconds (601.99 k allocations: 1.155 GB, 0.09% gc time)165611

vs. R (Biostrings):

user system elapsed 14.042 1.233 15.475

59 / 72

Page 60: Introduction to Julia for bioinformacis

Alignments ­ Speed (2)

vs. R (Biostrings):

user system elapsed 14.042 1.233 15.475

library(Biostrings, quietly=T)a = readAAStringSet("Q8WZ42.fasta")[[1]]b = readAAStringSet("A2ASS6.fasta")[[1]]t0 = proc.time()aln = pairwiseAlignment(a, b, type="global", substitutionMatrix="BLOSUM62", gapOpening=10, gapExtension=1)t1 = proc.time()print(t1 - t0)print(score(aln))

60 / 72

Page 61: Introduction to Julia for bioinformacis

Indexable Bit Vectors

Bit vectors that supports bit counting in constant time.rank1(bv, i): Count the number of 1 bits within bv[1:i].rank0(bv, i): Count the number of 0 bits within bv[1:i].

A fundamental data structure when defining other data structures.WaveletMatrix, a generalization of the indexable bit vector,depends on this data structure.'N' nucleotides in a reference sequence can be compressedusing this data structure.

julia> bv = SucVector(bitrand(10_000_000));

julia> rank1(bv, 9_000_000); # precompile

julia> @time rank1(bv, 9_000_000) 0.000006 seconds (149 allocations: 10.167 KB)4502258

61 / 72

Page 62: Introduction to Julia for bioinformacis

Indexable Bit Vectors ­ Internals

A bit vector is divided into 256-bit large blocks and each large block isdivided into 64-bit small blocks:

immutable Block # large block large::UInt32 # small blocks smalls::NTuple{4,UInt8} # bit chunks (64bits × 4 = 256bits) chunks::NTuple{4,UInt64}end

Each block has a cache that counts the number of 1s.

62 / 72

Page 63: Introduction to Julia for bioinformacis

FM­Indexes

Index for full-text search.Fast, compact, and often used in short-read sequence mappers(Bowtie2, BWA, etc.).Product of Julia Summer of Code 2015https://github.com/BioJulia/FMIndexes.jl

This package is not specialized for biological sequences.FMIndexes.jl does not depend on Bio.jl.JIT compiler can optimize code for a specific type at runtime.

julia> fmindex = FMIndex(dna"ACGTATTGACTGTA");

julia> count(dna"TA", fmindex)2

julia> count(dna"TATT", fmindex)1

63 / 72

Page 64: Introduction to Julia for bioinformacis

FM­Indexed ­ Queries

Create an FM-Index for chromosome 22:

julia> fmindex = FMIndex(first(open("chr22.fa", FASTA)).seq);

count(pattern, index): count the number of occurrences of pattern:

julia> count(dna"ACGT", fmindex)37672

julia> count(dna"ACGTACGT", fmindex)42

64 / 72

Page 65: Introduction to Julia for bioinformacis

FM­Indexed ­ Queries

Create an FM-Index for chromosome 22:

julia> fmindex = FMIndex(first(open("chr22.fa", FASTA)).seq);

locate(pattern, index): locate positions of pattern:

# locate returns an iteratorjulia> locate(dna"ACGTACGT", fmindex) |> collect42-element Array{Any,1}: 20774876 ⋮ 22729149

# locateall returns an arrayjulia> locateall(dna"ACGTACGT", fmindex)42-element Array{Int64,1}: 20774876 ⋮ 22729149

65 / 72

Page 66: Introduction to Julia for bioinformacis

Other Julia Orgs You Should Know

Statistics - JuliaStats https://github.com/JuliaStats

https://github.com/JuliaStats/StatsBase.jlhttps://github.com/JuliaStats/DataFrames.jlhttps://github.com/JuliaStats/Clustering.jlhttps://github.com/JuliaStats/Distributions.jlhttps://github.com/JuliaStats/MultivariateStats.jlhttps://github.com/JuliaStats/NullableArrays.jlhttps://github.com/JuliaStats/GLM.jl

66 / 72

Page 67: Introduction to Julia for bioinformacis

Other Julia Orgs You Should Know

Optimization - JuliaOpt https://github.com/JuliaOpt

https://github.com/JuliaOpt/JuMP.jlhttps://github.com/JuliaOpt/Optim.jlhttps://github.com/JuliaOpt/Convex.jl

Graphs - JuliaGraphs https://github.com/JuliaGraphs

https://github.com/JuliaGraphs/LightGraphs.jl

Database - JuliaDB https://github.com/JuliaDB

https://github.com/JuliaDB/SQLite.jlhttps://github.com/JuliaDB/PostgreSQL.jl

67 / 72

Page 68: Introduction to Julia for bioinformacis

Julia Updates '15

68 / 72

Page 69: Introduction to Julia for bioinformacis

Julia Updates '15

Julia Computing Inc. was founded."Why the creators of the Julia programming language justlaunched a startup" - http://venturebeat.com/2015/05/18/why-the-creators-of-the-julia-programming-language-just-launched-a-startup/

69 / 72

Page 70: Introduction to Julia for bioinformacis

Julia Updates '15

Julia Computing Inc. was founded."Why the creators of the Julia programming language justlaunched a startup" - http://venturebeat.com/2015/05/18/why-the-creators-of-the-julia-programming-language-just-launched-a-startup/

Moore foundation granted Julia Computing $600,000."Bringing Julia from beta to 1.0 to support data-intensive, scientificcomputing" - https://www.moore.org/newsroom/in-the-news/2015/11/10/bringing-julia-from-beta-to-1.0-to-support-data-intensive-scientific-computing

70 / 72

Page 71: Introduction to Julia for bioinformacis

Julia Updates '15

Julia Computing Inc. was founded."Why the creators of the Julia programming language justlaunched a startup" - http://venturebeat.com/2015/05/18/why-the-creators-of-the-julia-programming-language-just-launched-a-startup/

Moore foundation granted Julia Computing $600,000."Bringing Julia from beta to 1.0 to support data-intensive, scientificcomputing" - https://www.moore.org/newsroom/in-the-news/2015/11/10/bringing-julia-from-beta-to-1.0-to-support-data-intensive-scientific-computing

Multi-threading Supporthttps://github.com/JuliaLang/julia/pull/13410

71 / 72

Page 72: Introduction to Julia for bioinformacis

Julia Updates '15

Julia Computing Inc. was founded."Why the creators of the Julia programming language justlaunched a startup" - http://venturebeat.com/2015/05/18/why-the-creators-of-the-julia-programming-language-just-launched-a-startup/

Moore foundation granted Julia Computing $600,000."Bringing Julia from beta to 1.0 to support data-intensive, scientificcomputing" - https://www.moore.org/newsroom/in-the-news/2015/11/10/bringing-julia-from-beta-to-1.0-to-support-data-intensive-scientific-computing

Multi-threading Supporthttps://github.com/JuliaLang/julia/pull/13410

Intel released ParallelAccelerator.jlhttps://github.com/IntelLabs/ParallelAccelerator.jl

72 / 72

Page 73: Introduction to Julia for bioinformacis