cehs 2016 poster

Interpretable deep learning models to predict protein phenotype from genotypeEric J. Ma1, David K. Duvenaud2, Jonathan A. Runstadler1,3

1Department of Biological Engineering & 3Division of Comparative Medicine, MIT 2School of Engineering and Applied Science, Harvard University

Problems1. Pathogen risk determination is currently based on non-standardized

measurements and simple heuristics.2. Mapping from genotype to phenotype is complex.3. Lack of standardized measurements hampers systematic study & reproducibility.4. Current machine learning models cannot regress on inputs of variable length.

Convolution & Fingerprint

Introduction

Deep Learning Algorithm

Results

Future Work• Learning Capacity: neural network architecture improvements; prevent overfitting.• Interpretability: better visualizations of convolutional feature maps.• Applications: pathogen genomic surveillance, chemical surveillance.

• Real-time dashboard for influenza surveillance• Risk profile informs tailored interventions

Vision

• Duvenaud et. al., 2016 (arXiv): prediction of chemical properties on chemical graphs• Genotype: chemical structure; phenotype: chemical property.• Applications in drug screening, toxicity prediction etc.

Prior Work

toxicity solubility

0 1 2 3 4 5 6Fingerprint radius

0.8

1.0

1.2

1.4

1.6

1.8

2.0

RM

SE

(lo

g M

ol/

L)

Circular fingerprints

Random conv with large parameters

Random conv with small parameters

Predictive Features Training Performance

Goals• Data set: HIV-1 Protease, Stanford HIV Drug Resistance

Database• Train convolutional network on protein graph.• Develop software package for generalized graph regression

Protein GraphsAtomic

StructureGraph

Representation• Nodes: Amino acids with their

identities• Edges: Biochemical interactions

between amino acids• Biochemical knowledge of each node

is encoded as fixed vector.

14

-3

1

Input Convolution

-22

3

2

FingerprintNon-Linearity

02

3

2

7

• Convolutions capture local structure of graph• Non-linearities allow modelling of arbitrary functions• Fingerprints represent a fixed-length representation of underlying graph.• Graphs with identical nodes and edges will have identical fingerprints.

Synthetic Data

# Nodes Train Loss Conv::Sine Train LossConv::Sum Nodes Train Loss

FingerprintLin. Reg.

ConvolutionConvolutionFingerprintLin. Reg.

14

-3

1

Σ(node + Σ(neighbors))

14

-3

1

= 5

Σ(sin(node) + Σsin(neighbors))

14

-3

1

= 1.37= 4

number ofnodes

• Learn mathematical transforms on top of integer graphs.• Deeper networks converge in fewer iterations with smaller error.• Graph-based convolutional neural nets work on simplemathematical functions.

PQVTLWQRPI VTIRIGGQLK EALLDTGADD...PQVTLWQRPI VTIKIGGQLK EALLDTGADD...PQVTLWQRPI VTIKIGGDLK EALLDTGADD...

MNPNQKLITI GSICLVVGLI SLILQIGNII...MNPNQKIITI GSICLVMGLI SLILQIGNII...MNPNQKIITI GSICLVVGLI SLILQIGNII...

Viral Samples Sequences Models Risk Pro�le

Drug Resistance

Antigenicity

ReplicationRate

AerosolViability

ThermalStability

Visualizing Learned Features

A B C D E F G H I J K L M N P Q R S T V W X Y Z pKa

mol. wt

degree

Σ(dist.)

h-phobic

disulfide

h-bond

ionic

aromatic

arm_sulph

cation_pi

backbone

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 10 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5

1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3

Feature ID

Node Properties Edge Properties

- - - -

• Able to recapitulate known mutations involved in FPV drug resistance.• Interpretable: hydrophobic network of amino acids implicated in FPV resistance.• (left) Dark nodes: highly activating; light nodes: weakly activating• (right) Green ribbon: backbone; Dark blue spheres: top activating nodes; Light blue

spheres: neighbors.

HIV Protease Drug Resistance

• Simple convolutional model comparable with random forest and gradient boost baseline models.

• Very little overfitting.

Model Error

Random Forest

Gradient Boost

Convolution

0.09

0.07

0.14