knime & bioinformatics

78
Copyright © 2015 KNIME.com AG Биоинформатик в тридевятом царстве, или двое программистов из ларца KNIME Oleg Yasnev KNIME.com

Upload: bioinformaticsinstitute

Post on 20-Feb-2017

956 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Knime & bioinformatics

Copyright © 2015 KNIME.com AG

Биоинформатик в тридевятом царстве, или двое программистов из ларца KNIME

Oleg Yasnev

KNIME.com

Page 2: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 2

А вы что же за меня и код

писать будете?

Ага!

Кадр из мультфильма «Вовка в тридевятом царстве» © «Союзмультфильм»

Page 3: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 3

KNIME.com

3

Page 4: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 4

KNIME.com

• KNIME.com founded in 2008

• Offices in Zurich, San Francisco (Aug ‘13), Berlin (May ‘14) and Konstanz (October ‘15)

• 15 open source releases, 10 product releases (in 2014)

• >2m lines of code

• 600k lines of community code

4

Page 5: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 5

Advanced Analytics

Pharma

Health Care

Finance

Retail

Customer Intelligence

Manu-facturing

Broad Range of KNIME Application Areas

5

Page 6: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 6

The KNIME Analytics Platform

6

Page 7: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 7

From Access to Visualization and Deployment

Page 8: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 8

Data Access

• Databases– MySQL, PostgreSQL– any JDBC (Oracle, DB2, MS SQL

Server)

• Files– Csv, txt– Excel, Word, PDF– SAS, SPSS– XML– PMML– Images, texts, networks, chem

• Web, Cloud– REST, Web services– Twitter, Google

Page 9: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 9

Big Data

• HDFS support

• Hive

• Impala

• HP Vertica

• In-database processing

Page 10: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 10

Transformation

• Preprocessing

– Row, column, matrix based

• Data blending

– Join, concatenate, append

• Aggregation

– Grouping, pivoting, binning

• Feature Creation and Selection

Page 11: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 11

Analyze & Data Mining

• Regression– Linear, Logistic

• Classification– Decision tree, ensembles,

SVM, MLP, Naïve Bayes

• Clustering– k-means, DBSCAN, hierarchical

• Validation– Cross-validation, scoring, ROC

• Misc– PCA, MDS, item set mining

• External– R, Weka

Page 12: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 12

Visualization

• Interactive

– Scatter plot, histogram, pie charts, box plot

– Highlighting (brushing)

• JFreeChart

• JavaScript

• Misc

– Tag cloud, open street map, networks, molecules

• External

– R

Page 13: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 13

Deployment

• Database

• Files

– Excel, csv, txt

– XML

– PMML

– to: local, KNIME Server, SSH-, FTP-Server

• BIRT Reporting

Page 14: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 14

StatisticsData MiningMachine LearningWeb AnalyticsText MiningNetwork AnalysisSocial Media AnalysisWEKARCommunity / 3rd

MySQL, Oracle, etc.SAS, SPSS, etc.Excel, Flat, etc.Hive etc.XML, PMMLText, Doc, ImageWeb CrawlersIndustry SpecificCommunity / 3rd

ETLRow, ColumnMatrixText, ImageTime SeriesJavaPythonCommunity / 3rd

RJFreeChartCommunity / 3rd

via BIRTPMMLXMLDatabasesExcel, Flat, etc.Hive etc.Text, Doc, ImageIndustry SpecificCommunity / 3rd

Over 1000 native and embedded nodes included:

14

Page 15: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 15

KNIME: Integrating Data and Tools

15

Page 16: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 16

Big Data.Pre-processing on Hadoop

Page 17: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 17

In-Database Processing

17

Loads your pre-processeddata into KNIME

Page 18: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 18

Reader/Writer

• Table selection

• Load data into KNIME

• Create table as select

• Insert/append data

• Delete rows from table

• Update values in table

18

Page 19: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 19

Hive/Impala Loader

• Upload a KNIME data table to Hive/Impala

• Part of the commercial Big Data Extension

19

Page 20: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 20

Manipulation

• Filter rows and columns

• Join tables/queries

• Sort your data

• Write your own query

• Aggregate your data

20

Page 21: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 21

Database GroupBy – Manual Aggregation

21

Page 22: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 22

Database GroupBy – Type Based Aggregation

22

Matches all cells

Matches all numericcells

Page 23: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 23

Utility

• Drop table

– missing table handling

– cascade option

• Execute any SQL statement e.g. DDL

• Manipulate existing queries

23

Page 24: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 24

HDFS File Handling

• New nodes

– HDFS Connection

– HDFS File Permission

• Utilize the existing remote file handling nodes

– Upload/download files

– Create/list directories

– Delete files

24

Page 25: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 25

HDFS File Handling

25

Page 26: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 26

Workflow 1: PrepareData

26

~ 2 daysIrish Smart Energy Meter Trials• July 2009 – Dec 2010• 6000 meters• roughly 176m rows of data

Page 27: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 27

Import Data from Database into KNIME

27

< 30 min

Page 28: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 28

Big Data.Machine Learning on Hadoop

Page 29: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 29

Machine Learning on Hadoop

• Based on Spark MLlib

• Scalable machine learning library

• Runs on Hadoop

• Algorithms for

– Classification (decision tree, naïve bayes, …)

– Regression (logistic regression, linear regression, …)

– Clustering (k-means)

– Collaborative filtering (ALS)

– Dimensionality reduction (SVD, PCA)

29

Page 30: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 30

MLlib Integration

• Usage model and dialogs similar to existing nodes

• No coding required

Page 31: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 31

MLlib Integration

• MLlib model ports for model transfer

• Native MLlib model learning and prediction

• Spark nodes start and manage Spark jobs

• Supports Spark job cancelation

Native MLlib model

Page 32: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 32

MLlib Integration

• Spark RDDs as input/output format

• Data stays within your cluster

• No unnecessary data movements

• Several input/output nodes e.g. Hive, hdfs files, …

Page 33: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 33

Mass Learning – Fast Event Prediction

• Convert supported MLlib models to PMML

• Mass learning on Hadoop

• Fast event prediction based on compiled models

Page 34: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 34

Mix and Match

• Combine with existing KNIME nodes

Page 35: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 36

Modularize and Execute Your Own Spark Code

Page 36: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 37

Spark Node Overview

Page 37: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 38

А что же Rocket Science?

38

Page 38: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 39

Community Contributors

39

TechnologyPartners

Distribution& ConsultingPartners

CommunityContributors

CommunityUser Base

Donated byCompanies

Contributions fromResearch

Institutions

Maintained byKNIME

Page 39: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 40

Community Contributors

40

TechnologyPartners

Distribution& ConsultingPartners

CommunityContributors

CommunityUser Base

Academic Institutions:

• Universität Tübingen (BALL, OpenMS)

• Freie Universität Berlin (SeqAn)

• MPI Dresden (ImgLib)

• Universität Dresden (Palladin)

• ETH Zürich (OpenBIS)

• Dublin University (OMERO)

• University of Wisconsin (ImageJ2)

• …

Commercial Contributors:

• Dymatrix Consulting Group (Uplift Nodes)

• Eli Lilly (ChemInf suite)

• Novartis (RDKit, Indigo)

• Vernalis (Proteomics)

• Cenix (REST Nodes)

• Böhringer-Ingelheim (various sponsored nodes)

• …

Page 40: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 41

Bioinformaticshttps://tech.knime.org/bioinformatics-and-next-generation-sequencing-extensions

Page 41: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 42

OpenMS

Open-source software C++ library for liquid chromatography–mass spectrometry data management and analyses.

https://tech.knime.org/community/bioinf/openms

Page 42: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 43

SeqAn

Open-source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data.

https://tech.knime.org/seqan-nodes-for-knime

Page 43: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 44

NGS

Nodes and workflows used for processing next generation sequencing results

https://tech.knime.org/community/next-generationsequencing

Page 44: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 45

knime4bio

Set of custom nodes for analysing NGS data

https://code.google.com/p/knime4bio/

Page 45: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 46

Image Processinghttps://tech.knime.org/community/image-processing

Page 46: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 47

Active Classification in Cell Assay Images

• Different modules for segmentation and feature extraction

• Active Learning

Page 47: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 48

Active Classification in Cell Assay Images

CellMiner Nodes

Plate/Image Reading

– Plate Reader, Plate Editor, Plate View

Preprocessing

– Noise Filtering, Lowpass Filter

Segmentation

– Threshold based Segmentation, Voronoi Segmentation

Features

– Line, Histogram, Texture, RGB, Zernike Moments, Shape

Active Classification

Page 48: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 49

Chemistry and Cheminformaticshttps://tech.knime.org/cheminformatics-extensions

Page 49: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 50

Selected Open Source extensions

50

Page 50: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 51

Selected commercial extensions

51

Page 51: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 52

Overview of types in KNIME

52

• Basic KNIME types

• string, integer, double

• KNIME core chemistry types:

• smiles, sdf, mol, mol2

• Structures in these formats can be rendered in KNIME tables

Page 52: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 53

Nodes for type manipulation

53

• Molecule Type Cast• Casts any string as a chemical type (i.e. It

tells KNIME “This is a smiles string”)

• Useful when reading data form a csv file or database.

• Marvin MolConverter• Provided by Chemaxon/Infocom

• Translates seamlessly between types (smiles sdf mrv)

Page 53: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 54

Nodes for reading and writing files

54

Reader and writers provided for:

- sdf, smiles, mol, mol2

Page 54: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 55

Sketching chemical structures – use Marvin

55

MarvinSketch• Provided by Chemaxon/Infocom

• Sketch structures in the configuration dialog

• Execute node to inject structures into workflow

Page 55: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 56

RDKit

56

• Open source cheminfo library in c++

• Wrappers for KNIME maintained by the open source community

• Useful for:

Descriptor calculation

Cleaning structures

InChi conversion

Standardizing smiles

Fingerprints

Scaffolds/substructures

Reaction simulation

and more…

Page 56: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 57

Infocom JChem KNIME Nodes

Extensions of ChemAxon’s tool for KNIME workflow

Infocom implements it with the support of ChemAxon

Contains over 90% of ChemAxon'scheminformatics functionality

Page 57: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 58

ChEMBL

58

A public database of bioactive druglike compounds~1.3 mio compounds~ 9k targets~12 mio bioactivitities

Provided by the European Bioinformatics InstituteAccessible online at www.ebi.ac.uk/chemblor via EBI provided KNIME nodes…

Page 58: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 59

New Node: ChEMBLdb Connector

59

Access data in ChEMBL via a web service call(internet access required)

Lookup by ChEMBLID or InChi KeyRetrieve structure and bioactivity data

Compound search using smilesexact, similarity, or substructure

Page 59: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 60

Tool Integrations

Page 60: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 61

• Select the KNIME version for your computer

– (Mac, Win, Linux)

• Copy to your local machine

• Unpack the file in a “nice” place

Install KNIME

61

Page 61: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 62

Start KNIME

Go to the installation directory and launch KNIME.

62

Page 62: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 63

The Workspace

• The workspace is the folder in which workflows (and potentially data files) for the current KNIME session is stored.

• Workspaces are portable (just like KNIME)

63

Page 63: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 64

Starting KNIME for the first time

64

Install additional extensions

Goes straight to theKNIME workbench

Page 64: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 65

The KNIME Workbench

65

Page 65: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 66

A basic workflow

66

Page 66: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 67

More on nodes…

A node can have 3 states:

67

Idle: The node is not yet configured and can not be executed with it’s current settings.

Configured:The node has been set up correctly, and may be executed at any time

Executed: The node has been successfully executed. Results may be viewed and used in downstream nodes.

Page 67: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 68

Node configuration

• Most nodes require configuration

• To access a node configuration window:

• Double-click the node

• Right-click > Configure

68

Page 68: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 69

Node execution

• Right-click node

• Select Execute in context menu

• If execution is successful, status shows green light

• If execution encounters errors, status shows red light

69

Page 69: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 70

Node views

• Right-click node

• Select Views in context menu

• Select output port to inspect executionresults

70

Page 70: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 71

Hotkeys (for future reference)

71

Page 71: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 72

A Peak under the Hood:KNIME (Node) Development

72

Page 72: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 73

Node Architecture

• KNIME interacts only with a Node

• Node takes care of embedding the node in the infrastructure

• New nodes implement Model/View/Dialog

73

class Node

(final)

class

Node-

Dialog-

Pane

(abstract)

class

Node-

View

(abstract)

class

Node-

Model

(abstract)

class NodeFactory (abstract)

Page 73: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 74

Node Extension Wizard

• Included in the KNIME Developer Version

• Allows creation of plugin projects including functioning KNIME nodes (with sample code)

• Helpful to easily create all node classes

– Generates all Java classes

– Node is registered with the plugin project

– Launch KNIME and enjoy the new node working!

74

Page 74: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 75

Node Extension Wizard

75

Page 75: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 76

Node Extension Wizard

• Specify all settings to create a new KNIME node– In a completely new plugin

project, or

– Into an existing project

• Node type: Sink, Source, Learner, Predictor, Manipulator, Visualizer, Meta, or Other

• Include sample code or not

76

Page 76: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 77

Node Extension Wizard

• Contains all Java classes (including sample code)

• Node is registered in the plugin.xml

• NodeDialog and NodeView class are also created and registered to the NodeFactory

77

Page 77: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 78

Node Development

78

Page 78: Knime & bioinformatics

Copyright © 2015 KNIME.com AG 79

Resources

• KNIME pages (www.knime.org)• APPLICATIONS for example workflows

• LEARNING HUB under RESOURCES www.knime.org/learning-hub

• KNIME Tech pages (tech.knime.org)• FORUM for questions and answers

• DOCUMENTATION for documentation, FAQ, changelogs, ...

• LABS where to find new experimental nodes

• COMMUNITY CONTRIBUTIONS for development instructions and third party nodes

• KNIME TV channel on

• KNIME on @KNIME

79