hpe hpc & ai フォーラム 2018 講演資料...design and build of container as a service...

HPE HPC & AI フォーラム 2018

Hewlett Packard EnterprisePointnext Hybrid IT COELead Architect吉瀬淳一

AI活用を加速するイノベーションプラットフォーム

HPE PointnextのDigital Transformation支援

1

お客様を支援する専門知識 – HPE Pointnext –

2

アドバイザリープロフェッショナルオペレーション

お客様の成果と課題の把握

概念実証と

パイロットを通じた妥当性の確認

変革プランの設計

ソリューションの迅速な展開と実装

大規模なITソリューションの設計と構成

ソリューションに対する

継続的な運用とサポートの提供

柔軟なデリバリモデルと消費モデルの最適化

Broader Use Cases of Container Based Transformation

3

By 2018, more than 50% of new workloads will be

deployed into containers in at least one stage

of the application life cycle

2018

Gartner, Mar. 2016

Adoption has been accelerating due to the fact more use cases are built in various areas…Originally the use was started from application development.

Now, expanding to…• Replacement of VM• AI / Deep Learning• CICD Automation

Replacement of Virtualization• Use container technology instead of

virtualization to minimize infrastructure tax

• Benefit : Performance, TCO

Optimization, Flexibility

AI / Deep Learning Platform• Higher utilization and performance of AI /

DL frameworks such as Tensorflow could

be provided

• Benefit : Efficiency, Agility, Innovation

CICD Automation• Seamless application development and

deployment could be accelerated and

automated by the use of Container

• Benefit : Agility, IP Protection

https://hpe.northernlight.com/document.php?docid=HO20160329130000080&datasource=HPESYND

References of Container based Transformation

4

CICD Automation @ 2017-

Source: https://h50146.www5.hpe.com/products/servers/news/casestudy/jcb-synergy/

Integration of container based application development lifecycle tool chain to transform and accelerate application development with HPE Pointnext Center of Excellence expertiseDesign and build of container as a service (based on Mesosphere DCOS) with OSS CICD lifecycle tool chain such as Jenkins, Tensorflow, etc. on HPE Synergy platform.With the transformation, JCB would be able to acquire “agility” and “flexibility” in their application development.

Replacement of Virtualization

@ 2016-

Source: https://h20195.www2.hpe.com/v2/Getdocument.aspx?docname=a00045370enw

Adoption of container platform in order to bring efficiency, agility, and flexibility as one package for competitive semicon manufacturing process. Testing of SSD firmware could be

CICD Automation

with OpenShift

@ 2018-

Design and integration of CICD automation with OSS tools on Red Hat OpenShift in order to increase the speed of application development and to bring clarity and standardization of application development platform for security and governance purpose.Customer : Financial bank in Japan

Container based automated

data analysis on AWS

@ 2018-

DL Platform with

Container based

Distributed GPU

@ 2018-Bring efficiency and high utilization to GPU platform for DL framework, Tensorflow by transforming the platform to containerization.Customer : Manufacturer in Korea

Customer : Financial Institution in Singapore

取り扱っている関連テクノロジー/プロダクト

– Cloud Native Platform (Kubernetesディストリビューション）

– Mesosphere DC/OS

– SUSE CaaS Platform

– RedHat OpenShift

– Docker Enterprise

– Infrastructure

– HPE サーバー/ストレージ/ネットワーク

–パブリッククラウド

–AWS

–Azure

–GCP

–プライベートクラウド

–OpenStack

–VMware

5

Deep Learning Starter Package w/ Tensorflow

Best Optimized Platform for Deep Learning

– HPE Apollo 6500 Gen10 System provides superior performance-per-dollar for GPU

intensive workloads, with eight NVIDIA Tesla V100 GPUs per server and NVLink

interconnect, delivering up to 125 TFlops single precision compute2 for faster

intelligence.

– Unprecedented performance delivering economical AI and deep learning

– Rock-solid, enterprise-level reliability, availability, serviceability - RAS features

– Supports a wide range of workloads, including deep learning and HPC workloads of complex simulation

and modeling

– Open Sourced Deep Learning Framework would be loaded out of box

– Major deep learning framework in the market, Tensorflow, would be configured for your innovation

– Open Sourced Deep Learning Library for Multiple GPUs– Easy to execute complex Deep Neural Network structure with Python

– Simple and visual management console for deep learning process with TensorBoard

– No code changes for enabling multiple GPUs to maximize its process power

– Proven Architecture by HPE– Various services leveraged Tensorflow in the world

– Apollo specially developed for innovation of our customers

– Deep Learning Ready Platform– Start innovation from today with buit-in deep learning architecture

– Certified architecture by HP Enterprise

– Various platform support tools are enabled from the beginning, such as iLO Management, CUDA

toolkits, and TensorBoard

6

HPE Deep Learning Development Platform

8 x Tesla GPU with NVLINK2.0 could be loaded on HPE ProLiant XL270d Accelerator Tray

Baremetal TensorFlow

Solution Characteristics

Best GPU

Density

Leading DL

Framework

Flexible

Storage

Option

Superior

Management

Tool

Benefit of HPE Deep Learning Development Platform

– Service Duration : 2.0 weeks– Design & Integration

– Configuration of Apollo System

– Installation of CentOS or Ubuntu

– NVIDIA Driver Implementation

– CUDA Toolkit / cuDNN Implementation

– TensorFlow Installation

– Sample Program Test

– Skill Transfer– Sample Program Handover

– Q&A for 1.0 week

– Output– Implementation Report

– Optional Services– Deep Learning Consulting Service

– Inception Integration

Service Description

CentOS / Ubuntu

CUDA

Deep Learning Framework

TensorFlow

Sample DL Apps

HPE Apollo 6500 Gen10

+ GPU(NVIDIA)

Deep Learning Starter Package w/ Tensorflowon Container based Distributed GPUs Platform w/ Red Hat

Best Optimized and Scalable Container based Platform for Deep Learning– HPE Apollo 6500 Gen10 System provides superior performance-per-dollar for GPU

intensive workloads, with eight NVIDIA Tesla V100 GPUs per server and NVLinkinterconnect, delivering up to 125 TFlops single precision compute2 for faster intelligence.



– Supports a wide range of workloads, including deep learning and HPC workloads of complex simulation and modeling

– Container based Deep Learning Framework would be loaded out of box


– Provide the highest level of GPU resource utilization with container technologies

– Open Sourced Deep Learning Library for Multiple GPUs– Easy to execute complex Deep Neural Network structure with Python– Simple and visual management console for deep learning process with TensorBoard– No code changes for enabling multiple GPUs to maximize its process power

– Innovative Architecture by HPE– Various services leveraged Tensorflow in the world– Apollo specially developed for innovation of our customers– Container based Tensorflow would increase the efficiency of GPU resource usage– Easy to develop entire application ecosystem by integration of Deep Learning framework on Red Hat

OpenShift

– Deep Learning Ready Platform– Start innovation from today with buit-in deep learning architecture– Certified architecture by HP Enterprise– Various platform support tools are enabled from the beginning, such as Red Hat OpenShift, iLO

Management, CUDA toolkits, and TensorBoard

7




Best GPU

Density

Leading DL

Framework

Flexible

Storage

Option

Superior

Management

Tool





– Installation of Red Hat OpenShift

– Configuration of master, worker, and infra nodes










Service Description

CentOS / Ubuntu

CUDA


TensorFlow

Sample DL Apps


+ GPU(NVIDIA)

Container Based TensorFlow

OpenShift

Deep Learning Starter Package w/ Tensorflowon Container based Distributed GPUs Platform w/ DCOS

Best Optimized and Scalable Container based Platform for Deep Learning– HPE Apollo 6500 Gen10 System provides superior performance-per-dollar for GPU

intensive workloads, with eight NVIDIA Tesla V100 GPUs per server and NVLinkinterconnect, delivering up to 125 TFlops single precision compute2 for faster intelligence.



– Supports a wide range of workloads, including deep learning and HPC workloads of complex simulation and modeling

– Container based Deep Learning Framework would be loaded out of box


– Provide the highest level of GPU resource utilization with container technologies

– Open Sourced Deep Learning Library for Multiple GPUs– Easy to execute complex Deep Neural Network structure with Python– Simple and visual management console for deep learning process with TensorBoard– No code changes for enabling multiple GPUs to maximize its process power

– Innovative Architecture by HPE– Various services leveraged Tensorflow in the world– Apollo specially developed for innovation of our customers– Container based Tensorflow would increase the efficiency of GPU resource usage– Easy to develop entire application ecosystem by integration of Deep Learning framework on

Mesosphere DCOS

– Deep Learning Ready Platform– Start innovation from today with buit-in deep learning architecture– Certified architecture by HP Enterprise– Various platform support tools are enabled from the beginning, such as Mesosphere DCOS, iLO

Management, CUDA toolkits, and TensorBoard

8




Best GPU

Density

Leading DL

Framework

Flexible

Storage

Option

Superior

Management

Tool





– Installation of Mesosphere DCOS

– Configuration of master, worker, and infra nodes










Service Description

CentOS / Ubuntu

CUDA


TensorFlow

Sample DL Apps


+ GPU(NVIDIA)

Container Based TensorFlow

DCOS

そもそもAIとはどのように構成されるのか

機械学習/ディープラーニングのプロセス

10

New DataTraining Dataset

“dog”

“cat”

“dog”

“cat”

“dog”

“cat”

Neural Network Model

元データ

前処理

学習Model

Training

学習済みモデル

推論Inference

“cat”

AIと呼ばれているのはこのあたり

11


“dog”

“cat”

“dog”

“cat”

“dog”

“cat”


元データ

前処理

学習Model

Training


推論Inference

“cat”

つまり• 学習データからいい感じに学習してくれる• 学習結果を用いて、問いに対していい感じに答えを出してくれるものがAI、という雰囲気

AI開発のイメージ

–やりたいこと（例えば）–ペットの写真を与えたら、被写体が猫か犬かを見分けるアプリケーション

–よろしい、では–大量の猫と犬の画像を用意

–機械学習で特徴を抽出しモデルを作成するための学習アルゴリズムの開発

–モデルの学習

–学習済みモデルを利用して推論を行うアプリケーションの開発

12

AI開発のイメージ

–やりたいこと（例えば）–ペットの写真を与えたら、被写体が猫か犬かを見分けるアプリケーション

–よろしい、では–大量の猫と犬の画像を用意

–機械学習で特徴を抽出しモデルを作成するための学習アルゴリズムの開発

–モデルの学習

–学習済みモデルを利用して推論を行うアプリケーションの開発

そんな単純な話ではない。

13

機械学習/ディープラーニングのプロセス：考慮ポイント

14


“dog”

“cat”

“dog”

“cat”

“dog”

“cat”


元データ

前処理preprocessing

学習Model

Training


推論Inference

“cat”

- GPUパワーの割り当て

- 開発者の利便性

- モデルとフレームワーク/ライブラリの管理

- データをどう貯めておくか

- どう処理するべきか

- 学習ジョブに与えるためのデータの管理

- リアルタイム処理- アプリケーション開発の効率化

- モデルの管理- 推論アプリケーションからのアクセス

Googleの論文Hidden Technical Debt in Machine Learning Systems(NIPS2015)

15

Googleの論文Hidden Technical Debt in Machine Learning Systems(NIPS2015)

16

ちょっとしたAIアプリケーションのための構築コストが半端ない

しかも

AIをビジネスに活用するための要件：–変化し続けるデータを用いた継続的再学習

–モデルの精度のトラッキングとチューニング

–同じデータセットから様々な用途に応じた学習

–様々な推論アプリケーションへの対応

17

Facebookの論文Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective (HPCA2018)

18

AIプラットフォームのパラダイムシフト：MLaaS

19

DataOps – DevOps inData Science and ML

20

• DataOpsとは• 自動化されたプロセス指向の方法論• データ/分析チームが品質を向上させ、データ分析のサイクルを短縮することが目的

• 目指すところ• 継続的な価値の提供• 属人性の低減• 疎結合• 柔軟なリソース活用

• つまり、アプリケーション開発における- DevOps-アジャイル開発手法をデータサイエンス/機械学習の分野に応用した方法論

こういうものが必要だということ（MLaaS: Machine Learning as a Service)

21

ダイナミックに蓄積されたデータ

管理されたモデル利用可能な

各種フレームワーク計算リソース

(GPU,CPU,メモリ)共有資源

いい感じに必要な資源に取り次いでくれるなにか

利用する人・プロセス

データの取り込み

データの前処理

データ分析

トレーニング

モデルの評価

推論アプリ

チューニング

なんか見たことある

22

分散データサービス

ストレージ・ネットワークサービス

計算リソース(GPU,CPU,メモリ)共有資源

いい感じに必要な資源に取り次いでくれるなにか

コンテナ化されたアプリケーションワークロード（マイクロサービス）

リポジトリ・レジストリ

なんか見たことある

23

分散データサービス

ストレージ・ネットワークサービス

リポジトリ・レジストリ

計算リソース(GPU,CPU,メモリ)共有資源

コンテナオーケストレーション＋DevOps

コンテナ化されたアプリケーションワークロード（マイクロサービス）

ちなみにコンテナプラットフォームいろいろ

マネージドプライベート

プロプラ

Kubernetes aaS Kubernetesベース k8s以外

GKE

Azure Kubernetes Service

• IBM

• Oracle

• Red Hat

• Pivotal

• etc

DC/OS

ピュアk8s

Distro系

大いなるなにかの

一部

マルチクラスタ系

IBM Cloud Private

独自強化発展系

Docker

Kubernetes: コンテナオーケストレーションのデファクトスタンダード

–Googleのサービス基盤のコンセプトをGoで再実装しオープンソース化

–Linux Foundationの下位団体であるCloud Native Computinf Foundationの中心プロジェクト

–“Cloud Native”なアプリケーション開発と運用を実現するためのインフラ技術としてコンテナを活用

25

Kubernetesエコシステムの一例

–Mesosphere DC/OS

–各種分散サービスのための統合プラットフォーム

–コンテナオーケストレーションとしてKubernetesをサポート

–データサービス、CI/CDツールなどをカタログからデプロイ可能

26

Kubeflow: Kubernetes上でMLaaSを実現するプロジェクト

– MLaaSとして必要な各種機能をKubernetes上にインテグレーション

– Jupyterhub: 多チーム対応のモデル作成環境（ノートブック）

– フレームワーク： Tensorflow, Pytorch, Caffe,Chainerなど

– Katib: ハイパーパラメーターチューニング

– Argo: コンテナワークフローエンジン

– Pachyderm: データパイプライン管理

– Serving: モデル・推論のAPI提供

–現在Ver 0.2

– Ver 1.0は2018/12/16リリース予定

27

具体例

28

学習(Model Training)

29

機械学習モデルのトレーニングにおける課題

– データセットの用意

– 増加するプロジェクト/データサイエンティストによるGPUの効率的な活用

– 日進月歩のフレームワーク/ライブラリの活用

30

プラットフォームに求められるケイパビリティ：

• データストアのスケーラビリティーとフレキシビリティ• データの前処理/ETL

• GPUスケジューリング• 機械学習ジョブ実行環境のイメージ管理と柔軟なデプロイメント

機械学習のためのコンテナ・オーケストレーション

TensorFlow 1.0

Ubuntu16.04

Container image build automation

Container Orchestration Platform

GPU

GPU

GPU

GPU

Training Job Inference app

InferenceUser

Data Scientist

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Distributed workers

Training Data Model

CUDA8.0cuDNN v6

TensorFlow 1.0

Ubuntu16.04

TensorFlow 1.0

Ubuntu16.04

Training Job Inference app

Other framework/version

Training Code

Application Code

App Developer

DC/OS+Distributed Tensorflowによる学習ジョブ実行例

32

DC/OS Universe Package

Learning Code

TensorflowServer

Scheduling Parameter

TensorflowWorker

TensorflowWorker

TensorflowWorker

CPU

GPU

GPU

CPU

GPU

GPU

CPU

GPU

GPU

学習データ

CPU

CPU

Model

データ前処理パイプライン

Checkpoint

Scheduling

Distributed Tensorflow Job

DC/OSがホストできるワークロード

Data Scientist

例 : GPUパワーを利用する即実行可能な環境を瞬時に用意

33クラウド管理

ハードウェリソース (CPU, GPU, Memory, Disk)

コンテナオーケストレーション

データサービスイメージレジストリ

データセット

データセット管理

フレーバーの選択

展開

要件の定義

利用

推論(Inference)

34

Inference（学習済みモデルを利用した推論アプリケーション）における課題

–学習済みモデルへのアクセス

–入力データに対するリアルタイム処理

–日進月歩のフレームワーク/ライブラリの活用

–アプリケーション開発の効率化

35

プラットフォームに求められるケイパビリティ：

• リアルタイムデータパイプライン• 用途に応じた各種データサービス• アプリケーションランタイムのイメージ管理• アプリケーションのCI/CD

例 : リアルタイムの可視化とロンダリング検出

36

金融取引データ


メッセージキュー(Kafka)

ロンダリング検出器

POTENTIAL MONEY LAUNDERING: 856 -> 804 totalling 8994 now POTENTIAL MONEY LAUNDERING: 233 -> 954 totalling 8710 now POTENTIAL MONEY LAUNDERING: 318 -> 273 totalling 8883 now

時系列DB(Influxdb) ダッシュボード

(Grafana)

まとめ

37

⚫データ大事。データなくしてAIは始まらない。

⚫コンテナオーケストレーション技術を活用したMLaaSがこれからのAI開発と活用を支えるプラットフォームとなる。

⚫HPE Pointnextは豊富な経験と先進的な取り組みにより、お客様のイノベーションを支援します。

38

Thank you

hpe hpc & ai フォーラム 2018 講演資料...design and build of container as a service...

Documents